specification-dilemma/README.md

88 lines
3.0 KiB
Markdown

# Specification Dilemma Experiment
Small empirical probe for the claim that sparse-specification prompts
produce more homogeneous outputs across users than dense-specification
prompts.
See `experiment.pdf` for the full specification.
## Design note: matched pairs
`experiment.tex` describes a between-groups design with 30 independently-drawn
sparse prompts and 30 independently-drawn dense prompts. The prompts in this
repo follow a tighter variant: **matched pairs**. Each of 30 imagined users
has a fixed underlying intent (audience, thesis, tone, voice, opening move,
structural constraint). `prompts/dense.json[i]` expresses that user's full
intent; `prompts/sparse.json[i]` is what the same user would type when
underspecifying — topic only, in roughly their natural register. The sparse
prompts carry no audience, thesis, tone, or structural specification.
The statistical comparison is unchanged — cross-user pairwise similarity in
each condition — but the two conditions now sample the same population of
underlying intents. This tests the sharper claim: when users with divergent
intents underspecify, outputs converge (priors dominate); when they specify
fully, outputs diverge (intents dominate).
## Setup
1. Install LMStudio and download a strong instruction-tuned model
(e.g. Qwen2.5-72B-Instruct or Llama-3.3-70B-Instruct).
2. Start the LMStudio local server (default: localhost:1234).
3. Create the environment and install dependencies with `uv`:
```
uv sync
```
(or `pip install -r requirements.txt` inside a venv if not using uv)
4. Edit `config.yaml` if your LMStudio model name or port differs from
the defaults. If LMStudio is on a remote host, point `lmstudio.base_url`
at that host (e.g. `http://<host>:1234/v1`).
5. Smoke-test the endpoint (checks connectivity, seed-honoring, and
approximate per-generation latency):
```
uv run python smoke_test.py
```
## Running
Freeze your prompts in `prompts/sparse.json` and `prompts/dense.json`
before generating anything.
Then run the full pipeline:
```
uv run python run_all.py
```
Or run steps individually:
```
uv run python generate.py # LMStudio generations
uv run python embed.py # sentence embeddings
uv run python similarity.py # pairwise cosine similarities
uv run python stats.py # t-test, Mann-Whitney, bootstrap, Cohen's d
uv run python plot.py # violin plot
```
## Outputs
- `outputs/{sparse,dense}/NN.txt` : raw model completions
- `embeddings/{sparse,dense}.npy` : L2-normalized embedding matrices
- `results/pairwise.csv` : all pairwise similarities
- `results/stats.json` : test statistics and summary
- `results/plot.png` : similarity distribution plot
## Interpretation
A positive result: sparse-condition mean pairwise similarity is
meaningfully higher than dense-condition mean similarity, the
bootstrap 95% CI on the difference excludes 0, and Cohen's d is
large (>0.8).
A null or inverted result is also interesting and should be reported
honestly.