cosmetic fixes

This commit is contained in:
Levi Neuwirth 2026-04-24 10:23:29 -04:00
parent 4b7c734d96
commit bd6e38c01d
4 changed files with 26 additions and 1039 deletions

4
.gitignore vendored
View File

@ -15,6 +15,10 @@ __pycache__/
# OS
.DS_Store
# Spec (lives outside version control)
experiment.tex
experiment.pdf
# LaTeX build artifacts
*.aux
*.log

View File

@ -1,87 +1,34 @@
# Specification Dilemma Experiment
Small empirical probe for the claim that sparse-specification prompts
produce more homogeneous outputs across users than dense-specification
prompts.
## Files
See `experiment.pdf` for the full specification.
## Design note: matched pairs
`experiment.tex` describes a between-groups design with 30 independently-drawn
sparse prompts and 30 independently-drawn dense prompts. The prompts in this
repo follow a tighter variant: **matched pairs**. Each of 30 imagined users
has a fixed underlying intent (audience, thesis, tone, voice, opening move,
structural constraint). `prompts/dense.json[i]` expresses that user's full
intent; `prompts/sparse.json[i]` is what the same user would type when
underspecifying — topic only, in roughly their natural register. The sparse
prompts carry no audience, thesis, tone, or structural specification.
The statistical comparison is unchanged — cross-user pairwise similarity in
each condition — but the two conditions now sample the same population of
underlying intents. This tests the sharper claim: when users with divergent
intents underspecify, outputs converge (priors dominate); when they specify
fully, outputs diverge (intents dominate).
- `config.yaml` — LMStudio endpoint, model, generation and analysis parameters
- `prompts/sparse.json` — 30 sparse prompts
- `prompts/dense.json` — 30 dense prompts (matched to sparse by index)
- `smoke_test.py` — pre-flight: connectivity, seed-honoring, per-generation latency
- `generate.py` — runs completions against LMStudio
- `embed.py` — sentence embeddings
- `similarity.py` — pairwise cosine similarities
- `stats.py` — t-test, Mann-Whitney, bootstrap, Cohen's d
- `plot.py` — violin plot
- `run_all.py` — orchestrator (runs the five pipeline scripts in order)
- `pyproject.toml`, `uv.lock` — uv-managed environment
- `requirements.txt` — pip fallback
- `outputs/{sparse,dense}/NN.txt` — model completions (generated)
- `embeddings/{sparse,dense}.npy` — L2-normalized embedding matrices (generated)
- `results/pairwise.csv`, `results/stats.json`, `results/plot.png` — analysis artifacts (generated)
## Setup
1. Install LMStudio and download a strong instruction-tuned model
(e.g. Qwen2.5-72B-Instruct or Llama-3.3-70B-Instruct).
2. Start the LMStudio local server (default: localhost:1234).
3. Create the environment and install dependencies with `uv`:
1. Install LMStudio, load a strong instruction-tuned model, start the local server.
2. `uv sync`
3. Edit `config.yaml` for your LMStudio host, port, and model name.
4. `uv run python smoke_test.py` — verifies the endpoint and reports whether `seed` is honored.
```
uv sync
```
(or `pip install -r requirements.txt` inside a venv if not using uv)
4. Edit `config.yaml` if your LMStudio model name or port differs from
the defaults. If LMStudio is on a remote host, point `lmstudio.base_url`
at that host (e.g. `http://<host>:1234/v1`).
5. Smoke-test the endpoint (checks connectivity, seed-honoring, and
approximate per-generation latency):
```
uv run python smoke_test.py
```
## Running
Freeze your prompts in `prompts/sparse.json` and `prompts/dense.json`
before generating anything.
Then run the full pipeline:
## Run
```
uv run python run_all.py
```
Or run steps individually:
```
uv run python generate.py # LMStudio generations
uv run python embed.py # sentence embeddings
uv run python similarity.py # pairwise cosine similarities
uv run python stats.py # t-test, Mann-Whitney, bootstrap, Cohen's d
uv run python plot.py # violin plot
```
## Outputs
- `outputs/{sparse,dense}/NN.txt` : raw model completions
- `embeddings/{sparse,dense}.npy` : L2-normalized embedding matrices
- `results/pairwise.csv` : all pairwise similarities
- `results/stats.json` : test statistics and summary
- `results/plot.png` : similarity distribution plot
## Interpretation
A positive result: sparse-condition mean pairwise similarity is
meaningfully higher than dense-condition mean similarity, the
bootstrap 95% CI on the difference excludes 0, and Cohen's d is
large (>0.8).
A null or inverted result is also interesting and should be reported
honestly.
Or step-by-step: `generate.py``embed.py``similarity.py``stats.py``plot.py`.

Binary file not shown.

View File

@ -1,964 +0,0 @@
\documentclass[11pt]{article}
% --- Packages ---
\usepackage[margin=1in]{geometry}
\usepackage{microtype}
\usepackage{parskip}
\usepackage{enumitem}
\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{listings}
\usepackage{titlesec}
\usepackage{fancyhdr}
\usepackage{amsmath}
% --- Hyperref setup ---
\hypersetup{
colorlinks=true,
linkcolor=black,
urlcolor=blue!60!black,
citecolor=black,
pdftitle={The Specification Dilemma: Experiment Specification},
pdfauthor={}
}
% --- Code listing setup ---
\definecolor{codebg}{rgb}{0.97,0.97,0.97}
\definecolor{codekw}{rgb}{0.20,0.20,0.55}
\definecolor{codestr}{rgb}{0.25,0.50,0.25}
\definecolor{codecmt}{rgb}{0.45,0.45,0.45}
\lstdefinestyle{pythonstyle}{
backgroundcolor=\color{codebg},
basicstyle=\ttfamily\footnotesize,
keywordstyle=\color{codekw}\bfseries,
stringstyle=\color{codestr},
commentstyle=\color{codecmt}\itshape,
numbers=left,
numberstyle=\tiny\color{gray},
numbersep=6pt,
frame=single,
framesep=4pt,
rulecolor=\color{gray!30},
breaklines=true,
breakatwhitespace=true,
showstringspaces=false,
columns=fullflexible,
language=Python,
literate=
{->}{{$\rightarrow$}}2
{>=}{{$\geq$}}2
{<=}{{$\leq$}}2,
tabsize=2,
}
\lstdefinestyle{yamlstyle}{
backgroundcolor=\color{codebg},
basicstyle=\ttfamily\footnotesize,
keywordstyle=\color{codekw}\bfseries,
stringstyle=\color{codestr},
commentstyle=\color{codecmt}\itshape,
frame=single,
framesep=4pt,
rulecolor=\color{gray!30},
breaklines=true,
showstringspaces=false,
columns=fullflexible,
tabsize=2,
}
\lstdefinestyle{jsonstyle}{
backgroundcolor=\color{codebg},
basicstyle=\ttfamily\footnotesize,
stringstyle=\color{codestr},
frame=single,
framesep=4pt,
rulecolor=\color{gray!30},
breaklines=true,
showstringspaces=false,
columns=fullflexible,
tabsize=2,
}
\lstset{style=pythonstyle}
% --- Section formatting ---
\titleformat{\section}{\Large\bfseries}{\thesection}{1em}{}
\titleformat{\subsection}{\large\bfseries}{\thesubsection}{1em}{}
\titlespacing*{\section}{0pt}{18pt}{8pt}
\titlespacing*{\subsection}{0pt}{12pt}{6pt}
% --- Header/footer ---
\pagestyle{fancy}
\fancyhf{}
\fancyhead[L]{\small\itshape The Specification Dilemma}
\fancyhead[R]{\small\itshape Experiment Spec}
\fancyfoot[C]{\thepage}
\renewcommand{\headrulewidth}{0.4pt}
% --- Title ---
\title{\textbf{The Specification Dilemma}\\[4pt]
\large Experiment Specification and Scaffolding}
\author{}
\date{April 2026}
\begin{document}
\maketitle
\thispagestyle{empty}
\begin{abstract}
\noindent This document specifies a small empirical probe for the claim that as specification sparsity increases, pairwise semantic similarity across outputs generated from plausibly-varied user prompts increases. The experiment is designed to run locally against a strong open-weights model served by LMStudio, using sentence embeddings to measure output homogeneity in a sparse-vs-dense matched-pairs design, where 30 imagined users with distinct underlying intents each contribute both a sparse and a dense prompt. The document includes the full specification, Python scaffolding for generation and analysis, prompt files, and a project layout ready to execute.
\end{abstract}
\vspace{1em}
\hrule
\vspace{1em}
\tableofcontents
\newpage
% =====================================================================
\section{Hypothesis}
% =====================================================================
As specification sparsity increases (fewer tokens / less detail in the prompt), pairwise semantic similarity across outputs generated from plausibly-varied user prompts increases. Equivalently: sparse specification produces homogenized output across users, even when those users phrase their requests differently.
This probes the core empirical claim of the essay ``The Specification Dilemma'': that the mechanism by which inference-heavy collaboration produces homogeneity is the convergence of outputs onto the model's shared priors when human specification is insufficient to push the model off its modes.
% =====================================================================
\section{Design}
% =====================================================================
A two-condition design with \textbf{matched pairs at the user level}, comparing output similarity distributions in a \textbf{sparse specification} condition versus a \textbf{dense specification} condition.
The unit of variation is \emph{the user}. Each condition contains $N$ prompts, and the two conditions share the same $N$ imagined users. Each user has a fixed underlying intent (audience, thesis, tone, voice, opening move, structural constraint), and that intent is expressed twice: once as a sparse prompt (topic only, in the user's natural register) and once as a dense prompt (the intent in full). The sparse condition simulates those 30 users when they vibe; the dense condition simulates the same 30 users when they specify. Controlling the underlying intents across conditions tightens the contrast: any divergence between the two similarity distributions is attributable to specification completeness rather than to differences between the populations of users sampled.
\subsection{Task selection}
One task: \textbf{``Write the opening 300 words of a blog post about remote work.''} Rationale: a task many users actually perform with LLMs, enough creative latitude for homogenization to matter, output length tractable for embedding-based similarity.
\subsection{Sample size}
$N = 30$ per condition (60 prompts total). This yields $\binom{30}{2} = 435$ pairwise comparisons per condition, which is enough for a reasonably tight confidence interval on mean similarity.
\subsection{Prompt generation}
\textbf{Procedure.} First, write $N = 30$ underlying intents --- one per imagined user, spanning distinct audiences, theses, tones, voices, opening moves, and structural constraints. Then, for each intent, produce a matched pair of prompts:
\begin{itemize}[leftmargin=*,itemsep=2pt,topsep=4pt]
\item \textbf{Sparse prompt} (5--20 tokens, one sentence): what that user would type when underspecifying --- topic only, in their natural register (formal vs.\ casual, polite vs.\ terse, with or without a length cue). The sparse prompt must carry no audience, thesis, tone, voice, or structural specification. Smuggling those in would collapse the contrast the experiment is trying to measure.
\item \textbf{Dense prompt} (150--300 tokens): the intent in full --- target audience, thesis, tone, structural choices, vocabulary register, things to avoid, and author voice.
\end{itemize}
The two prompt files preserve order: \texttt{sparse.json[i]} and \texttt{dense.json[i]} are the same imagined user's two expressions. Different imagined users should specify \emph{different} target audiences, tones, and angles, so the 30 dense prompts span the space of plausible divergent intents.
\textbf{Pre-registration discipline:} write and freeze all 30 intents and all 60 prompts before running any generation. Do not iterate on prompts after seeing outputs.
\subsection{Generation parameters}
\begin{itemize}[leftmargin=*,itemsep=2pt,topsep=4pt]
\item \textbf{Model:} one strong open-weights instruction-tuned model on LMStudio (e.g.\ \texttt{Qwen2.5-72B-Instruct} or \texttt{Llama-3.3-70B-Instruct}), full precision or Q8. Held constant across all 60 generations.
\item \textbf{Temperature:} 0.7
\item \textbf{Top-p:} 0.95
\item \textbf{Max tokens:} 500
\item \textbf{Seed:} per-prompt deterministic seed (\texttt{seed = prompt\_index}) for reproducibility
\item \textbf{Generations per prompt:} 1
\end{itemize}
\subsection{Similarity measurement}
\textbf{Embedding model:} \texttt{sentence-transformers/all-mpnet-base-v2} as the default; \texttt{BAAI/bge-large-en-v1.5} as an optional robustness check.
\textbf{Metric:} cosine similarity between embedding vectors of full 300-word outputs.
\textbf{Aggregation:} for each condition, compute all 435 pairwise cosine similarities, producing two distributions of similarity scores.
\subsection{Statistical analysis}
\textbf{Primary test:} two-sample t-test (or Mann--Whitney U if distributions are non-normal) comparing sparse and dense similarity distributions.
\textbf{Reported statistics:} mean similarity per condition, standard deviation, test statistic, $p$-value, and Cohen's $d$.
\textbf{Secondary visualization:} overlaid violin plots of the two similarity distributions. More rhetorically useful in the essay than the $p$-value alone.
\textbf{Dependence caveat:} the 435 pairwise similarities per condition are not independent (each output appears in 29 pairs), so the effective sample size is smaller than 435. A bootstrap at the output level (resample outputs with replacement, recompute mean pairwise similarity, repeat $10{,}000$ times) gives a more conservative interval. Both the naive t-test and the bootstrap CI should be reported; the scaffolding below computes both.
\subsection{Expected result and falsification criteria}
\textbf{If the hypothesis holds:} sparse-condition mean pairwise similarity is meaningfully higher than dense-condition mean similarity, with large effect size (Cohen's $d > 0.8$) and $p < 0.01$.
\textbf{What would weaken or falsify it:}
\begin{itemize}[leftmargin=*,itemsep=2pt,topsep=4pt]
\item Similar means across conditions (no homogenization effect).
\item Small effect size even if statistically significant.
\item Dense condition \emph{higher} than sparse --- would be surprising and worth investigating (possibly an artifact of shared specification language leaking into outputs).
\end{itemize}
\subsection{Optional robustness checks}
In priority order, if time permits:
\begin{enumerate}[leftmargin=*,itemsep=2pt,topsep=4pt]
\item Re-run with a second task to check task-generality.
\item Re-run with a different embedding model to check metric-robustness.
\item Re-run with a different base LLM to check model-generality.
\end{enumerate}
% =====================================================================
\section{Project Layout}
% =====================================================================
\begin{lstlisting}[style=yamlstyle,language={}]
experiment/
prompts/
sparse.json # 30 sparse prompts, matched to dense by index
dense.json # 30 dense prompts, matched to sparse by index
outputs/
sparse/ # 30 .txt files, one per generation
dense/ # 30 .txt files, one per generation
embeddings/
sparse.npy # (30, D) embedding matrix
dense.npy # (30, D) embedding matrix
results/
pairwise.csv # all pairwise similarities, condition labeled
stats.json # test statistics and summary metrics
plot.png # violin + histogram comparison
config.yaml # model name, temperature, paths
smoke_test.py # pre-flight: connectivity, seed, latency
generate.py # runs LMStudio generations
embed.py # computes sentence embeddings
similarity.py # computes pairwise cosine similarities
stats.py # t-test, Mann-Whitney, bootstrap, effect size
plot.py # violin plot of similarity distributions
run_all.py # orchestrator: runs the full pipeline
pyproject.toml # uv-managed environment
requirements.txt # pip fallback
README.md
\end{lstlisting}
\paragraph{Dependencies.}
\texttt{openai} (LMStudio client), \texttt{sentence-transformers}, \texttt{numpy}, \texttt{scipy}, \texttt{matplotlib}, \texttt{seaborn}, \texttt{pandas}, \texttt{pyyaml}, \texttt{tqdm}.
\paragraph{Runtime estimate.}
Generation: 60 prompts $\times$ $\sim$15--30s each $\approx$ 15--30 min on strong hardware. Embedding + analysis + plotting: $<$1 min. Total wall time: well under an hour once prompts are frozen.
\newpage
% =====================================================================
\section{Configuration}
% =====================================================================
\subsection{\texttt{config.yaml}}
\begin{lstlisting}[style=yamlstyle,language={}]
# LMStudio server settings
lmstudio:
base_url: "http://localhost:1234/v1"
api_key: "lm-studio" # placeholder; LMStudio ignores this
model: "qwen2.5-72b-instruct" # name as it appears in LMStudio
# Generation parameters
generation:
temperature: 0.7
top_p: 0.95
max_tokens: 500
# Embedding model
embedding:
model: "sentence-transformers/all-mpnet-base-v2"
# Paths (relative to project root)
paths:
prompts_dir: "prompts"
outputs_dir: "outputs"
embeddings_dir: "embeddings"
results_dir: "results"
# Analysis
analysis:
bootstrap_iterations: 10000
random_seed: 42
\end{lstlisting}
\subsection{\texttt{requirements.txt}}
\begin{lstlisting}[style=yamlstyle,language={}]
openai>=1.40.0
sentence-transformers>=3.0.0
numpy>=1.26.0
scipy>=1.13.0
pandas>=2.2.0
matplotlib>=3.8.0
seaborn>=0.13.0
pyyaml>=6.0
tqdm>=4.66.0
\end{lstlisting}
% =====================================================================
\section{Prompt Files}
% =====================================================================
The 30 sparse prompts below are the frozen set shipped with this experiment. \texttt{sparse.json[i]} and \texttt{dense.json[i]} correspond to the same imagined user: the dense prompt expresses that user's full intent, and the sparse prompt is what the same user would type when underspecifying. The sparse set below carries no audience, thesis, tone, voice, or structural direction; only register, punctuation, and length cues differ across users. Freeze your final set before running any generations.
\subsection{\texttt{prompts/sparse.json}}
\begin{lstlisting}[style=jsonstyle,language={}]
[
"remote work blog post intro, 300 words",
"write the opening of a blog post on remote work",
"Draft an opening for a blog post about remote work.",
"can you write a blog post intro about remote work",
"blog intro about working from home, 300 words",
"need a remote work blog intro, ~300 words",
"Please draft the opening 300 words of a blog post about remote work.",
"I'd like the opening of a blog post about remote work, around 300 words.",
"remote work blog post opener, 300 words",
"blog intro, remote work, 300 words",
"first 300 words of a blog post on remote work",
"opening 300 words, blog post about remote work",
"hey can you write a remote work blog intro",
"write the first 300 words of a blog on remote work",
"Draft the opening paragraph of a blog post on remote work.",
"Please write the opening of a blog post about remote work, approximately 300 words.",
"Write the opening of a blog post about remote work, around 300 words.",
"blog post, remote work, first 300 words",
"could you write the intro to a blog post on remote work",
"Write the opening 300 words of a blog post about remote work.",
"opening for a blog post on remote work, ~300 words",
"write a blog intro about remote work please",
"Please draft the opening of a blog post about remote work.",
"blog post opening on remote work, about 300 words",
"Draft opening 300 words --- blog post, remote work.",
"write me a blog intro about remote work",
"Please produce the opening of a blog post on remote work, approximately 300 words.",
"can u write a remote work blog intro",
"need the opener for a blog post about remote work, ~300 words",
"Write the opening of a blog post about remote work."
]
\end{lstlisting}
\subsection{\texttt{prompts/dense.json} (example entry; see repo for all 30)}
The dense prompts should each push the model in a different direction. The first entry, shown below, pairs with the first sparse prompt above and illustrates the shape:
\begin{lstlisting}[style=jsonstyle,language={}]
[
"Write the opening 300 words of a blog post about remote work. The target audience is mid-career software engineers at 50-500 person startups who have worked remotely for 3+ years and are tired of both 'remote work is utopia' and 'return to office' takes. The tone should be dry, slightly weary, and specific rather than abstract. Open with a concrete observation about a small, texture-of-daily-life detail rather than a statistic or rhetorical question. Avoid the words 'unprecedented', 'new normal', 'journey', 'landscape', and 'game-changer'. The thesis should be that remote work's real cost is not productivity or culture but the erosion of ambient professional development - the kind of learning that happens when you overhear a senior engineer debug something. Voice should resemble a technical writer who also reads literary essays. No bullet points. End the opening paragraph on a sentence that turns the observation into a question the rest of the post will answer."
// ... 29 more, each specifying a different audience, thesis, tone, and
// structural constraint. See guidance below.
]
\end{lstlisting}
\paragraph{Guidance for producing the 30 intents and matched prompts.}
Vary each of these axes independently across the 30 intents: target audience (engineers, designers, HR leaders, small-business owners, parents, new grads, academics, tradespeople, freelancers, founders, managers, journalists, lawyers, clinicians, etc.); thesis (productivity, loneliness, asynchrony, hiring, real estate, inequality, rituals, surveillance, accessibility, capex amortization, etc.); tone (dry, earnest, skeptical, celebratory, investigative, personal-essay, confessional, technocratic, etc.); opening move (anecdote, statistic-that-turns, quote, scene, counter-intuitive claim, literary reference, quoted memo, etc.); voice (technical writer, essayist, journalist, first-person blogger, trade-publication editor, policy analyst, etc.); structural constraint (no bullets, one-sentence paragraphs, tight lede, delayed thesis, circular, etc.). For each intent, write the dense prompt that expresses it fully; then write the sparse prompt as what that same user would type when underspecifying --- topic-only, in their natural register, carrying none of the specification content. The goal is that if the mechanism holds, the sparse-condition outputs will converge despite 30 different underlying intents, while the dense-condition outputs will diverge because the specifications actually push the model toward different regions of output space.
\newpage
% =====================================================================
\section{Generation Script}
% =====================================================================
\subsection{\texttt{generate.py}}
\begin{lstlisting}
"""Generate completions for sparse and dense prompts via LMStudio.
LMStudio exposes an OpenAI-compatible server (default: localhost:1234).
Start the server from LMStudio's "Local Server" tab before running.
"""
from __future__ import annotations
import json
import os
from pathlib import Path
import yaml
from openai import OpenAI
from tqdm import tqdm
def load_config(path: str = "config.yaml") -> dict:
with open(path, "r") as f:
return yaml.safe_load(f)
def make_client(cfg: dict) -> OpenAI:
return OpenAI(
base_url=cfg["lmstudio"]["base_url"],
api_key=cfg["lmstudio"]["api_key"],
)
def generate_one(
client: OpenAI,
model: str,
prompt: str,
temperature: float,
top_p: float,
max_tokens: int,
seed: int,
) -> str:
"""Single completion. Returns the assistant message content."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
top_p=top_p,
max_tokens=max_tokens,
seed=seed,
)
return response.choices[0].message.content or ""
def run_condition(
client: OpenAI,
cfg: dict,
condition: str,
) -> None:
prompts_path = Path(cfg["paths"]["prompts_dir"]) / f"{condition}.json"
outputs_dir = Path(cfg["paths"]["outputs_dir"]) / condition
outputs_dir.mkdir(parents=True, exist_ok=True)
with open(prompts_path, "r") as f:
prompts = json.load(f)
gen_cfg = cfg["generation"]
model = cfg["lmstudio"]["model"]
for i, prompt in enumerate(tqdm(prompts, desc=f"{condition}")):
out_file = outputs_dir / f"{i:02d}.txt"
if out_file.exists():
continue # resume support
text = generate_one(
client=client,
model=model,
prompt=prompt,
temperature=gen_cfg["temperature"],
top_p=gen_cfg["top_p"],
max_tokens=gen_cfg["max_tokens"],
seed=i,
)
out_file.write_text(text, encoding="utf-8")
def main() -> None:
cfg = load_config()
client = make_client(cfg)
for condition in ("sparse", "dense"):
run_condition(client, cfg, condition)
print("Generation complete.")
if __name__ == "__main__":
main()
\end{lstlisting}
\newpage
% =====================================================================
\section{Embedding Script}
% =====================================================================
\subsection{\texttt{embed.py}}
\begin{lstlisting}
"""Compute sentence embeddings for each generation in each condition."""
from __future__ import annotations
from pathlib import Path
import numpy as np
import yaml
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
def load_config(path: str = "config.yaml") -> dict:
with open(path, "r") as f:
return yaml.safe_load(f)
def load_outputs(outputs_dir: Path) -> list[str]:
"""Load all .txt outputs from a condition directory, sorted by filename."""
files = sorted(outputs_dir.glob("*.txt"))
return [f.read_text(encoding="utf-8") for f in files]
def embed_condition(
model: SentenceTransformer,
texts: list[str],
) -> np.ndarray:
"""Return (N, D) embedding matrix. L2-normalized for cosine similarity."""
embeddings = model.encode(
texts,
batch_size=8,
show_progress_bar=True,
convert_to_numpy=True,
normalize_embeddings=True,
)
return embeddings
def main() -> None:
cfg = load_config()
model = SentenceTransformer(cfg["embedding"]["model"])
outputs_root = Path(cfg["paths"]["outputs_dir"])
emb_root = Path(cfg["paths"]["embeddings_dir"])
emb_root.mkdir(parents=True, exist_ok=True)
for condition in ("sparse", "dense"):
texts = load_outputs(outputs_root / condition)
if not texts:
print(f"No outputs found for {condition}; skipping.")
continue
print(f"Embedding {len(texts)} {condition} outputs...")
embeddings = embed_condition(model, texts)
np.save(emb_root / f"{condition}.npy", embeddings)
print(f"Saved {condition}.npy with shape {embeddings.shape}")
if __name__ == "__main__":
main()
\end{lstlisting}
\newpage
% =====================================================================
\section{Similarity Computation}
% =====================================================================
\subsection{\texttt{similarity.py}}
\begin{lstlisting}
"""Compute pairwise cosine similarities within each condition."""
from __future__ import annotations
from itertools import combinations
from pathlib import Path
import numpy as np
import pandas as pd
import yaml
def load_config(path: str = "config.yaml") -> dict:
with open(path, "r") as f:
return yaml.safe_load(f)
def pairwise_cosine(embeddings: np.ndarray) -> tuple[np.ndarray, list[tuple[int, int]]]:
"""Return (similarities, index_pairs) for all i<j pairs.
Assumes embeddings are L2-normalized, so cosine = dot product.
"""
n = embeddings.shape[0]
pairs = list(combinations(range(n), 2))
sims = np.array([
float(embeddings[i] @ embeddings[j]) for i, j in pairs
])
return sims, pairs
def main() -> None:
cfg = load_config()
emb_root = Path(cfg["paths"]["embeddings_dir"])
results_root = Path(cfg["paths"]["results_dir"])
results_root.mkdir(parents=True, exist_ok=True)
rows = []
for condition in ("sparse", "dense"):
emb_path = emb_root / f"{condition}.npy"
if not emb_path.exists():
print(f"Missing embeddings for {condition}; skipping.")
continue
embeddings = np.load(emb_path)
sims, pairs = pairwise_cosine(embeddings)
for (i, j), s in zip(pairs, sims):
rows.append({
"condition": condition,
"i": i,
"j": j,
"cosine": s,
})
print(
f"{condition}: n_outputs={embeddings.shape[0]}, "
f"n_pairs={len(sims)}, mean={sims.mean():.4f}, "
f"std={sims.std(ddof=1):.4f}"
)
df = pd.DataFrame(rows)
df.to_csv(results_root / "pairwise.csv", index=False)
print(f"Saved {results_root / 'pairwise.csv'}")
if __name__ == "__main__":
main()
\end{lstlisting}
\newpage
% =====================================================================
\section{Statistical Analysis}
% =====================================================================
\subsection{\texttt{stats.py}}
\begin{lstlisting}
"""Statistical tests and summary metrics for the similarity comparison.
Reports:
- Per-condition descriptive statistics
- Naive two-sample t-test on pairwise similarities
- Mann-Whitney U (nonparametric check)
- Cohen's d effect size
- Output-level bootstrap 95% CI for the difference in mean similarity
(corrects for pairwise dependence)
"""
from __future__ import annotations
import json
from itertools import combinations
from pathlib import Path
import numpy as np
import pandas as pd
import yaml
from scipy import stats
def load_config(path: str = "config.yaml") -> dict:
with open(path, "r") as f:
return yaml.safe_load(f)
def cohens_d(a: np.ndarray, b: np.ndarray) -> float:
"""Pooled-SD Cohen's d for two independent samples."""
na, nb = len(a), len(b)
va, vb = a.var(ddof=1), b.var(ddof=1)
pooled_sd = np.sqrt(((na - 1) * va + (nb - 1) * vb) / (na + nb - 2))
return (a.mean() - b.mean()) / pooled_sd
def mean_pairwise(embeddings: np.ndarray) -> float:
"""Mean pairwise cosine for an L2-normalized embedding matrix."""
n = embeddings.shape[0]
sims = [
float(embeddings[i] @ embeddings[j])
for i, j in combinations(range(n), 2)
]
return float(np.mean(sims))
def bootstrap_diff(
sparse_emb: np.ndarray,
dense_emb: np.ndarray,
n_iter: int,
rng: np.random.Generator,
) -> tuple[float, float, np.ndarray]:
"""Output-level bootstrap of (mean_sparse - mean_dense).
Resamples outputs (not pairs) with replacement, recomputes mean
pairwise similarity in each condition, returns 95% CI.
"""
n_s, n_d = sparse_emb.shape[0], dense_emb.shape[0]
diffs = np.empty(n_iter, dtype=float)
for k in range(n_iter):
idx_s = rng.integers(0, n_s, size=n_s)
idx_d = rng.integers(0, n_d, size=n_d)
ms = mean_pairwise(sparse_emb[idx_s])
md = mean_pairwise(dense_emb[idx_d])
diffs[k] = ms - md
lo, hi = np.percentile(diffs, [2.5, 97.5])
return float(lo), float(hi), diffs
def main() -> None:
cfg = load_config()
emb_root = Path(cfg["paths"]["embeddings_dir"])
results_root = Path(cfg["paths"]["results_dir"])
results_root.mkdir(parents=True, exist_ok=True)
sparse_emb = np.load(emb_root / "sparse.npy")
dense_emb = np.load(emb_root / "dense.npy")
df = pd.read_csv(results_root / "pairwise.csv")
sparse_sims = df.loc[df["condition"] == "sparse", "cosine"].to_numpy()
dense_sims = df.loc[df["condition"] == "dense", "cosine"].to_numpy()
# Descriptive
desc = {
"sparse": {
"n_outputs": int(sparse_emb.shape[0]),
"n_pairs": int(len(sparse_sims)),
"mean": float(sparse_sims.mean()),
"std": float(sparse_sims.std(ddof=1)),
"median": float(np.median(sparse_sims)),
},
"dense": {
"n_outputs": int(dense_emb.shape[0]),
"n_pairs": int(len(dense_sims)),
"mean": float(dense_sims.mean()),
"std": float(dense_sims.std(ddof=1)),
"median": float(np.median(dense_sims)),
},
}
# Naive t-test (note: pairwise dependence means this is optimistic)
t_stat, t_p = stats.ttest_ind(sparse_sims, dense_sims, equal_var=False)
# Nonparametric check
u_stat, u_p = stats.mannwhitneyu(
sparse_sims, dense_sims, alternative="two-sided"
)
# Effect size
d = cohens_d(sparse_sims, dense_sims)
# Output-level bootstrap (the honest test)
rng = np.random.default_rng(cfg["analysis"]["random_seed"])
lo, hi, _ = bootstrap_diff(
sparse_emb,
dense_emb,
n_iter=cfg["analysis"]["bootstrap_iterations"],
rng=rng,
)
summary = {
"descriptive": desc,
"naive_welch_t_test": {"t": float(t_stat), "p": float(t_p)},
"mann_whitney_u": {"u": float(u_stat), "p": float(u_p)},
"cohens_d": float(d),
"bootstrap_diff_in_means": {
"point_estimate": float(sparse_sims.mean() - dense_sims.mean()),
"ci_low": lo,
"ci_high": hi,
"n_iter": int(cfg["analysis"]["bootstrap_iterations"]),
},
}
with open(results_root / "stats.json", "w") as f:
json.dump(summary, f, indent=2)
# Pretty print
print(json.dumps(summary, indent=2))
if __name__ == "__main__":
main()
\end{lstlisting}
\newpage
% =====================================================================
\section{Plotting}
% =====================================================================
\subsection{\texttt{plot.py}}
\begin{lstlisting}
"""Violin + strip plot of pairwise similarity distributions per condition."""
from __future__ import annotations
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import yaml
def load_config(path: str = "config.yaml") -> dict:
with open(path, "r") as f:
return yaml.safe_load(f)
def main() -> None:
cfg = load_config()
results_root = Path(cfg["paths"]["results_dir"])
df = pd.read_csv(results_root / "pairwise.csv")
sns.set_theme(style="whitegrid", context="talk")
fig, ax = plt.subplots(figsize=(8, 6))
sns.violinplot(
data=df,
x="condition",
y="cosine",
order=["sparse", "dense"],
inner="quartile",
cut=0,
ax=ax,
)
sns.stripplot(
data=df,
x="condition",
y="cosine",
order=["sparse", "dense"],
color="black",
alpha=0.25,
size=2,
ax=ax,
)
ax.set_xlabel("Specification condition")
ax.set_ylabel("Pairwise cosine similarity")
ax.set_title("Output similarity by specification density")
fig.tight_layout()
fig.savefig(results_root / "plot.png", dpi=200)
print(f"Saved {results_root / 'plot.png'}")
if __name__ == "__main__":
main()
\end{lstlisting}
\newpage
% =====================================================================
\section{Pipeline Orchestrator}
% =====================================================================
\subsection{\texttt{run\_all.py}}
\begin{lstlisting}
"""Run the full pipeline end-to-end.
Usage:
python run_all.py
"""
from __future__ import annotations
import subprocess
import sys
STEPS = [
("Generating completions via LMStudio", "generate.py"),
("Embedding outputs", "embed.py"),
("Computing pairwise similarities", "similarity.py"),
("Running statistical tests", "stats.py"),
("Plotting", "plot.py"),
]
def main() -> None:
for title, script in STEPS:
print(f"\n=== {title} ({script}) ===")
result = subprocess.run([sys.executable, script])
if result.returncode != 0:
print(f"Step failed: {script}")
sys.exit(result.returncode)
print("\nPipeline complete. See results/ for outputs.")
if __name__ == "__main__":
main()
\end{lstlisting}
\newpage
% =====================================================================
\section{README}
% =====================================================================
\subsection{\texttt{README.md}}
\begin{lstlisting}[style=yamlstyle,language={}]
# Specification Dilemma Experiment
Small empirical probe for the claim that sparse-specification prompts
produce more homogeneous outputs across users than dense-specification
prompts.
See experiment.pdf for the full specification.
## Design note: matched pairs
The prompts in this repo follow a matched-pairs structure. Each of 30
imagined users has a fixed underlying intent (audience, thesis, tone,
voice, opening move, structural constraint). prompts/dense.json[i]
expresses that user's full intent; prompts/sparse.json[i] is what the
same user would type when underspecifying -- topic only, in roughly
their natural register. The sparse prompts carry no audience, thesis,
tone, or structural specification.
The statistical comparison is unchanged -- cross-user pairwise
similarity in each condition -- but the two conditions now sample the
same population of underlying intents. This tests the sharper claim:
when users with divergent intents underspecify, outputs converge
(priors dominate); when they specify fully, outputs diverge (intents
dominate).
## Setup
1. Install LMStudio and download a strong instruction-tuned model
(e.g. Qwen2.5-72B-Instruct or Llama-3.3-70B-Instruct).
2. Start the LMStudio local server (default: localhost:1234).
3. Create the environment and install dependencies with uv:
uv sync
(or pip install -r requirements.txt inside a venv if not using uv)
4. Edit config.yaml if your LMStudio model name or port differs from
the defaults. If LMStudio is on a remote host, point
lmstudio.base_url at that host (e.g. http://<host>:1234/v1).
5. Smoke-test the endpoint (checks connectivity, seed-honoring, and
approximate per-generation latency):
uv run python smoke_test.py
## Running
Freeze your prompts in prompts/sparse.json and prompts/dense.json
before generating anything.
Then run the full pipeline:
uv run python run_all.py
Or run steps individually:
uv run python generate.py # LMStudio generations
uv run python embed.py # sentence embeddings
uv run python similarity.py # pairwise cosine similarities
uv run python stats.py # t-test, Mann-Whitney, bootstrap, Cohen's d
uv run python plot.py # violin plot
## Outputs
- outputs/{sparse,dense}/NN.txt : raw model completions
- embeddings/{sparse,dense}.npy : L2-normalized embedding matrices
- results/pairwise.csv : all pairwise similarities
- results/stats.json : test statistics and summary
- results/plot.png : similarity distribution plot
## Interpretation
A positive result: sparse-condition mean pairwise similarity is
meaningfully higher than dense-condition mean similarity, the
bootstrap 95% CI on the difference excludes 0, and Cohen's d is
large (>0.8).
A null or inverted result is also interesting and should be reported
honestly.
\end{lstlisting}
% =====================================================================
\section{Essay Integration Notes}
% =====================================================================
The experiment should occupy roughly 300--500 words in the essay itself, including setup, result, and caveat. One paragraph describing the design, one paragraph reporting the result with the plot, and one sentence acknowledging limitations.
The honest framing, to place in a footnote:
\begin{quote}
\small This is a single-task, single-model probe with one embedding-based similarity metric. A fuller treatment would need to show the effect holds across tasks, models, and metrics. I ran this to check whether my intuition survived contact with data; report whichever result you find.
\end{quote}
If the experiment's word budget starts expanding past 500 words, it has begun competing with the argument rather than serving it. The mechanism section and the ``why this is worse than individual decline'' section are the essay's center of gravity; the experiment is a short piece of evidentiary scaffolding that lets those sections claim more than pure argument would permit.
\end{document}