specification-dilemma/README.md

# Specification Dilemma Experiment

Small empirical probe for the claim that sparse-specification prompts
produce more homogeneous outputs across users than dense-specification
prompts.

See `experiment.pdf` for the full specification.

## Design note: matched pairs

`experiment.tex` describes a between-groups design with 30 independently-drawn
sparse prompts and 30 independently-drawn dense prompts. The prompts in this
repo follow a tighter variant: **matched pairs**. Each of 30 imagined users
has a fixed underlying intent (audience, thesis, tone, voice, opening move,
structural constraint). `prompts/dense.json[i]` expresses that user's full
intent; `prompts/sparse.json[i]` is what the same user would type when
underspecifying — topic only, in roughly their natural register. The sparse
prompts carry no audience, thesis, tone, or structural specification.

The statistical comparison is unchanged — cross-user pairwise similarity in
each condition — but the two conditions now sample the same population of
underlying intents. This tests the sharper claim: when users with divergent
intents underspecify, outputs converge (priors dominate); when they specify
fully, outputs diverge (intents dominate).

## Setup

1. Install LMStudio and download a strong instruction-tuned model
   (e.g. Qwen2.5-72B-Instruct or Llama-3.3-70B-Instruct).
2. Start the LMStudio local server (default: localhost:1234).
3. Create the environment and install dependencies with `uv`:

   ```
   uv sync
   ```

   (or `pip install -r requirements.txt` inside a venv if not using uv)

4. Edit `config.yaml` if your LMStudio model name or port differs from
   the defaults. If LMStudio is on a remote host, point `lmstudio.base_url`
   at that host (e.g. `http://<host>:1234/v1`).

5. Smoke-test the endpoint (checks connectivity, seed-honoring, and
   approximate per-generation latency):

   ```
   uv run python smoke_test.py
   ```

## Running

Freeze your prompts in `prompts/sparse.json` and `prompts/dense.json`
before generating anything.

Then run the full pipeline:

```
uv run python run_all.py
```

Or run steps individually:

```
uv run python generate.py     # LMStudio generations
uv run python embed.py        # sentence embeddings
uv run python similarity.py   # pairwise cosine similarities
uv run python stats.py        # t-test, Mann-Whitney, bootstrap, Cohen's d
uv run python plot.py         # violin plot
```

## Outputs

- `outputs/{sparse,dense}/NN.txt`    : raw model completions
- `embeddings/{sparse,dense}.npy`    : L2-normalized embedding matrices
- `results/pairwise.csv`             : all pairwise similarities
- `results/stats.json`               : test statistics and summary
- `results/plot.png`                 : similarity distribution plot

## Interpretation

A positive result: sparse-condition mean pairwise similarity is
meaningfully higher than dense-condition mean similarity, the
bootstrap 95% CI on the difference excludes 0, and Cohen's d is
large (>0.8).

A null or inverted result is also interesting and should be reported
honestly.