diff --git a/.gitignore b/.gitignore index fd8ada6..d0d4e76 100644 --- a/.gitignore +++ b/.gitignore @@ -15,6 +15,10 @@ __pycache__/ # OS .DS_Store +# Spec (lives outside version control) +experiment.tex +experiment.pdf + # LaTeX build artifacts *.aux *.log diff --git a/README.md b/README.md index 4a14a15..25ce361 100644 --- a/README.md +++ b/README.md @@ -1,87 +1,34 @@ # Specification Dilemma Experiment -Small empirical probe for the claim that sparse-specification prompts -produce more homogeneous outputs across users than dense-specification -prompts. +## Files -See `experiment.pdf` for the full specification. - -## Design note: matched pairs - -`experiment.tex` describes a between-groups design with 30 independently-drawn -sparse prompts and 30 independently-drawn dense prompts. The prompts in this -repo follow a tighter variant: **matched pairs**. Each of 30 imagined users -has a fixed underlying intent (audience, thesis, tone, voice, opening move, -structural constraint). `prompts/dense.json[i]` expresses that user's full -intent; `prompts/sparse.json[i]` is what the same user would type when -underspecifying — topic only, in roughly their natural register. The sparse -prompts carry no audience, thesis, tone, or structural specification. - -The statistical comparison is unchanged — cross-user pairwise similarity in -each condition — but the two conditions now sample the same population of -underlying intents. This tests the sharper claim: when users with divergent -intents underspecify, outputs converge (priors dominate); when they specify -fully, outputs diverge (intents dominate). +- `config.yaml` — LMStudio endpoint, model, generation and analysis parameters +- `prompts/sparse.json` — 30 sparse prompts +- `prompts/dense.json` — 30 dense prompts (matched to sparse by index) +- `smoke_test.py` — pre-flight: connectivity, seed-honoring, per-generation latency +- `generate.py` — runs completions against LMStudio +- `embed.py` — sentence embeddings +- `similarity.py` — pairwise cosine similarities +- `stats.py` — t-test, Mann-Whitney, bootstrap, Cohen's d +- `plot.py` — violin plot +- `run_all.py` — orchestrator (runs the five pipeline scripts in order) +- `pyproject.toml`, `uv.lock` — uv-managed environment +- `requirements.txt` — pip fallback +- `outputs/{sparse,dense}/NN.txt` — model completions (generated) +- `embeddings/{sparse,dense}.npy` — L2-normalized embedding matrices (generated) +- `results/pairwise.csv`, `results/stats.json`, `results/plot.png` — analysis artifacts (generated) ## Setup -1. Install LMStudio and download a strong instruction-tuned model - (e.g. Qwen2.5-72B-Instruct or Llama-3.3-70B-Instruct). -2. Start the LMStudio local server (default: localhost:1234). -3. Create the environment and install dependencies with `uv`: +1. Install LMStudio, load a strong instruction-tuned model, start the local server. +2. `uv sync` +3. Edit `config.yaml` for your LMStudio host, port, and model name. +4. `uv run python smoke_test.py` — verifies the endpoint and reports whether `seed` is honored. - ``` - uv sync - ``` - - (or `pip install -r requirements.txt` inside a venv if not using uv) - -4. Edit `config.yaml` if your LMStudio model name or port differs from - the defaults. If LMStudio is on a remote host, point `lmstudio.base_url` - at that host (e.g. `http://:1234/v1`). - -5. Smoke-test the endpoint (checks connectivity, seed-honoring, and - approximate per-generation latency): - - ``` - uv run python smoke_test.py - ``` - -## Running - -Freeze your prompts in `prompts/sparse.json` and `prompts/dense.json` -before generating anything. - -Then run the full pipeline: +## Run ``` uv run python run_all.py ``` -Or run steps individually: - -``` -uv run python generate.py # LMStudio generations -uv run python embed.py # sentence embeddings -uv run python similarity.py # pairwise cosine similarities -uv run python stats.py # t-test, Mann-Whitney, bootstrap, Cohen's d -uv run python plot.py # violin plot -``` - -## Outputs - -- `outputs/{sparse,dense}/NN.txt` : raw model completions -- `embeddings/{sparse,dense}.npy` : L2-normalized embedding matrices -- `results/pairwise.csv` : all pairwise similarities -- `results/stats.json` : test statistics and summary -- `results/plot.png` : similarity distribution plot - -## Interpretation - -A positive result: sparse-condition mean pairwise similarity is -meaningfully higher than dense-condition mean similarity, the -bootstrap 95% CI on the difference excludes 0, and Cohen's d is -large (>0.8). - -A null or inverted result is also interesting and should be reported -honestly. +Or step-by-step: `generate.py` → `embed.py` → `similarity.py` → `stats.py` → `plot.py`. diff --git a/experiment.pdf b/experiment.pdf deleted file mode 100644 index dd31a0d..0000000 Binary files a/experiment.pdf and /dev/null differ diff --git a/experiment.tex b/experiment.tex deleted file mode 100644 index 3fc05bf..0000000 --- a/experiment.tex +++ /dev/null @@ -1,964 +0,0 @@ -\documentclass[11pt]{article} - -% --- Packages --- -\usepackage[margin=1in]{geometry} -\usepackage{microtype} -\usepackage{parskip} -\usepackage{enumitem} -\usepackage{hyperref} -\usepackage{xcolor} -\usepackage{listings} -\usepackage{titlesec} -\usepackage{fancyhdr} -\usepackage{amsmath} - -% --- Hyperref setup --- -\hypersetup{ - colorlinks=true, - linkcolor=black, - urlcolor=blue!60!black, - citecolor=black, - pdftitle={The Specification Dilemma: Experiment Specification}, - pdfauthor={} -} - -% --- Code listing setup --- -\definecolor{codebg}{rgb}{0.97,0.97,0.97} -\definecolor{codekw}{rgb}{0.20,0.20,0.55} -\definecolor{codestr}{rgb}{0.25,0.50,0.25} -\definecolor{codecmt}{rgb}{0.45,0.45,0.45} - -\lstdefinestyle{pythonstyle}{ - backgroundcolor=\color{codebg}, - basicstyle=\ttfamily\footnotesize, - keywordstyle=\color{codekw}\bfseries, - stringstyle=\color{codestr}, - commentstyle=\color{codecmt}\itshape, - numbers=left, - numberstyle=\tiny\color{gray}, - numbersep=6pt, - frame=single, - framesep=4pt, - rulecolor=\color{gray!30}, - breaklines=true, - breakatwhitespace=true, - showstringspaces=false, - columns=fullflexible, - language=Python, - literate= - {->}{{$\rightarrow$}}2 - {>=}{{$\geq$}}2 - {<=}{{$\leq$}}2, - tabsize=2, -} - -\lstdefinestyle{yamlstyle}{ - backgroundcolor=\color{codebg}, - basicstyle=\ttfamily\footnotesize, - keywordstyle=\color{codekw}\bfseries, - stringstyle=\color{codestr}, - commentstyle=\color{codecmt}\itshape, - frame=single, - framesep=4pt, - rulecolor=\color{gray!30}, - breaklines=true, - showstringspaces=false, - columns=fullflexible, - tabsize=2, -} - -\lstdefinestyle{jsonstyle}{ - backgroundcolor=\color{codebg}, - basicstyle=\ttfamily\footnotesize, - stringstyle=\color{codestr}, - frame=single, - framesep=4pt, - rulecolor=\color{gray!30}, - breaklines=true, - showstringspaces=false, - columns=fullflexible, - tabsize=2, -} - -\lstset{style=pythonstyle} - -% --- Section formatting --- -\titleformat{\section}{\Large\bfseries}{\thesection}{1em}{} -\titleformat{\subsection}{\large\bfseries}{\thesubsection}{1em}{} -\titlespacing*{\section}{0pt}{18pt}{8pt} -\titlespacing*{\subsection}{0pt}{12pt}{6pt} - -% --- Header/footer --- -\pagestyle{fancy} -\fancyhf{} -\fancyhead[L]{\small\itshape The Specification Dilemma} -\fancyhead[R]{\small\itshape Experiment Spec} -\fancyfoot[C]{\thepage} -\renewcommand{\headrulewidth}{0.4pt} - -% --- Title --- -\title{\textbf{The Specification Dilemma}\\[4pt] - \large Experiment Specification and Scaffolding} -\author{} -\date{April 2026} - -\begin{document} - -\maketitle -\thispagestyle{empty} - -\begin{abstract} -\noindent This document specifies a small empirical probe for the claim that as specification sparsity increases, pairwise semantic similarity across outputs generated from plausibly-varied user prompts increases. The experiment is designed to run locally against a strong open-weights model served by LMStudio, using sentence embeddings to measure output homogeneity in a sparse-vs-dense matched-pairs design, where 30 imagined users with distinct underlying intents each contribute both a sparse and a dense prompt. The document includes the full specification, Python scaffolding for generation and analysis, prompt files, and a project layout ready to execute. -\end{abstract} - -\vspace{1em} -\hrule -\vspace{1em} - -\tableofcontents - -\newpage - -% ===================================================================== -\section{Hypothesis} -% ===================================================================== - -As specification sparsity increases (fewer tokens / less detail in the prompt), pairwise semantic similarity across outputs generated from plausibly-varied user prompts increases. Equivalently: sparse specification produces homogenized output across users, even when those users phrase their requests differently. - -This probes the core empirical claim of the essay ``The Specification Dilemma'': that the mechanism by which inference-heavy collaboration produces homogeneity is the convergence of outputs onto the model's shared priors when human specification is insufficient to push the model off its modes. - -% ===================================================================== -\section{Design} -% ===================================================================== - -A two-condition design with \textbf{matched pairs at the user level}, comparing output similarity distributions in a \textbf{sparse specification} condition versus a \textbf{dense specification} condition. - -The unit of variation is \emph{the user}. Each condition contains $N$ prompts, and the two conditions share the same $N$ imagined users. Each user has a fixed underlying intent (audience, thesis, tone, voice, opening move, structural constraint), and that intent is expressed twice: once as a sparse prompt (topic only, in the user's natural register) and once as a dense prompt (the intent in full). The sparse condition simulates those 30 users when they vibe; the dense condition simulates the same 30 users when they specify. Controlling the underlying intents across conditions tightens the contrast: any divergence between the two similarity distributions is attributable to specification completeness rather than to differences between the populations of users sampled. - -\subsection{Task selection} - -One task: \textbf{``Write the opening 300 words of a blog post about remote work.''} Rationale: a task many users actually perform with LLMs, enough creative latitude for homogenization to matter, output length tractable for embedding-based similarity. - -\subsection{Sample size} - -$N = 30$ per condition (60 prompts total). This yields $\binom{30}{2} = 435$ pairwise comparisons per condition, which is enough for a reasonably tight confidence interval on mean similarity. - -\subsection{Prompt generation} - -\textbf{Procedure.} First, write $N = 30$ underlying intents --- one per imagined user, spanning distinct audiences, theses, tones, voices, opening moves, and structural constraints. Then, for each intent, produce a matched pair of prompts: - -\begin{itemize}[leftmargin=*,itemsep=2pt,topsep=4pt] - \item \textbf{Sparse prompt} (5--20 tokens, one sentence): what that user would type when underspecifying --- topic only, in their natural register (formal vs.\ casual, polite vs.\ terse, with or without a length cue). The sparse prompt must carry no audience, thesis, tone, voice, or structural specification. Smuggling those in would collapse the contrast the experiment is trying to measure. - \item \textbf{Dense prompt} (150--300 tokens): the intent in full --- target audience, thesis, tone, structural choices, vocabulary register, things to avoid, and author voice. -\end{itemize} - -The two prompt files preserve order: \texttt{sparse.json[i]} and \texttt{dense.json[i]} are the same imagined user's two expressions. Different imagined users should specify \emph{different} target audiences, tones, and angles, so the 30 dense prompts span the space of plausible divergent intents. - -\textbf{Pre-registration discipline:} write and freeze all 30 intents and all 60 prompts before running any generation. Do not iterate on prompts after seeing outputs. - -\subsection{Generation parameters} - -\begin{itemize}[leftmargin=*,itemsep=2pt,topsep=4pt] - \item \textbf{Model:} one strong open-weights instruction-tuned model on LMStudio (e.g.\ \texttt{Qwen2.5-72B-Instruct} or \texttt{Llama-3.3-70B-Instruct}), full precision or Q8. Held constant across all 60 generations. - \item \textbf{Temperature:} 0.7 - \item \textbf{Top-p:} 0.95 - \item \textbf{Max tokens:} 500 - \item \textbf{Seed:} per-prompt deterministic seed (\texttt{seed = prompt\_index}) for reproducibility - \item \textbf{Generations per prompt:} 1 -\end{itemize} - -\subsection{Similarity measurement} - -\textbf{Embedding model:} \texttt{sentence-transformers/all-mpnet-base-v2} as the default; \texttt{BAAI/bge-large-en-v1.5} as an optional robustness check. - -\textbf{Metric:} cosine similarity between embedding vectors of full 300-word outputs. - -\textbf{Aggregation:} for each condition, compute all 435 pairwise cosine similarities, producing two distributions of similarity scores. - -\subsection{Statistical analysis} - -\textbf{Primary test:} two-sample t-test (or Mann--Whitney U if distributions are non-normal) comparing sparse and dense similarity distributions. - -\textbf{Reported statistics:} mean similarity per condition, standard deviation, test statistic, $p$-value, and Cohen's $d$. - -\textbf{Secondary visualization:} overlaid violin plots of the two similarity distributions. More rhetorically useful in the essay than the $p$-value alone. - -\textbf{Dependence caveat:} the 435 pairwise similarities per condition are not independent (each output appears in 29 pairs), so the effective sample size is smaller than 435. A bootstrap at the output level (resample outputs with replacement, recompute mean pairwise similarity, repeat $10{,}000$ times) gives a more conservative interval. Both the naive t-test and the bootstrap CI should be reported; the scaffolding below computes both. - -\subsection{Expected result and falsification criteria} - -\textbf{If the hypothesis holds:} sparse-condition mean pairwise similarity is meaningfully higher than dense-condition mean similarity, with large effect size (Cohen's $d > 0.8$) and $p < 0.01$. - -\textbf{What would weaken or falsify it:} -\begin{itemize}[leftmargin=*,itemsep=2pt,topsep=4pt] - \item Similar means across conditions (no homogenization effect). - \item Small effect size even if statistically significant. - \item Dense condition \emph{higher} than sparse --- would be surprising and worth investigating (possibly an artifact of shared specification language leaking into outputs). -\end{itemize} - -\subsection{Optional robustness checks} - -In priority order, if time permits: -\begin{enumerate}[leftmargin=*,itemsep=2pt,topsep=4pt] - \item Re-run with a second task to check task-generality. - \item Re-run with a different embedding model to check metric-robustness. - \item Re-run with a different base LLM to check model-generality. -\end{enumerate} - -% ===================================================================== -\section{Project Layout} -% ===================================================================== - -\begin{lstlisting}[style=yamlstyle,language={}] -experiment/ - prompts/ - sparse.json # 30 sparse prompts, matched to dense by index - dense.json # 30 dense prompts, matched to sparse by index - outputs/ - sparse/ # 30 .txt files, one per generation - dense/ # 30 .txt files, one per generation - embeddings/ - sparse.npy # (30, D) embedding matrix - dense.npy # (30, D) embedding matrix - results/ - pairwise.csv # all pairwise similarities, condition labeled - stats.json # test statistics and summary metrics - plot.png # violin + histogram comparison - config.yaml # model name, temperature, paths - smoke_test.py # pre-flight: connectivity, seed, latency - generate.py # runs LMStudio generations - embed.py # computes sentence embeddings - similarity.py # computes pairwise cosine similarities - stats.py # t-test, Mann-Whitney, bootstrap, effect size - plot.py # violin plot of similarity distributions - run_all.py # orchestrator: runs the full pipeline - pyproject.toml # uv-managed environment - requirements.txt # pip fallback - README.md -\end{lstlisting} - -\paragraph{Dependencies.} -\texttt{openai} (LMStudio client), \texttt{sentence-transformers}, \texttt{numpy}, \texttt{scipy}, \texttt{matplotlib}, \texttt{seaborn}, \texttt{pandas}, \texttt{pyyaml}, \texttt{tqdm}. - -\paragraph{Runtime estimate.} -Generation: 60 prompts $\times$ $\sim$15--30s each $\approx$ 15--30 min on strong hardware. Embedding + analysis + plotting: $<$1 min. Total wall time: well under an hour once prompts are frozen. - -\newpage - -% ===================================================================== -\section{Configuration} -% ===================================================================== - -\subsection{\texttt{config.yaml}} - -\begin{lstlisting}[style=yamlstyle,language={}] -# LMStudio server settings -lmstudio: - base_url: "http://localhost:1234/v1" - api_key: "lm-studio" # placeholder; LMStudio ignores this - model: "qwen2.5-72b-instruct" # name as it appears in LMStudio - -# Generation parameters -generation: - temperature: 0.7 - top_p: 0.95 - max_tokens: 500 - -# Embedding model -embedding: - model: "sentence-transformers/all-mpnet-base-v2" - -# Paths (relative to project root) -paths: - prompts_dir: "prompts" - outputs_dir: "outputs" - embeddings_dir: "embeddings" - results_dir: "results" - -# Analysis -analysis: - bootstrap_iterations: 10000 - random_seed: 42 -\end{lstlisting} - -\subsection{\texttt{requirements.txt}} - -\begin{lstlisting}[style=yamlstyle,language={}] -openai>=1.40.0 -sentence-transformers>=3.0.0 -numpy>=1.26.0 -scipy>=1.13.0 -pandas>=2.2.0 -matplotlib>=3.8.0 -seaborn>=0.13.0 -pyyaml>=6.0 -tqdm>=4.66.0 -\end{lstlisting} - -% ===================================================================== -\section{Prompt Files} -% ===================================================================== - -The 30 sparse prompts below are the frozen set shipped with this experiment. \texttt{sparse.json[i]} and \texttt{dense.json[i]} correspond to the same imagined user: the dense prompt expresses that user's full intent, and the sparse prompt is what the same user would type when underspecifying. The sparse set below carries no audience, thesis, tone, voice, or structural direction; only register, punctuation, and length cues differ across users. Freeze your final set before running any generations. - -\subsection{\texttt{prompts/sparse.json}} - -\begin{lstlisting}[style=jsonstyle,language={}] -[ - "remote work blog post intro, 300 words", - "write the opening of a blog post on remote work", - "Draft an opening for a blog post about remote work.", - "can you write a blog post intro about remote work", - "blog intro about working from home, 300 words", - "need a remote work blog intro, ~300 words", - "Please draft the opening 300 words of a blog post about remote work.", - "I'd like the opening of a blog post about remote work, around 300 words.", - "remote work blog post opener, 300 words", - "blog intro, remote work, 300 words", - "first 300 words of a blog post on remote work", - "opening 300 words, blog post about remote work", - "hey can you write a remote work blog intro", - "write the first 300 words of a blog on remote work", - "Draft the opening paragraph of a blog post on remote work.", - "Please write the opening of a blog post about remote work, approximately 300 words.", - "Write the opening of a blog post about remote work, around 300 words.", - "blog post, remote work, first 300 words", - "could you write the intro to a blog post on remote work", - "Write the opening 300 words of a blog post about remote work.", - "opening for a blog post on remote work, ~300 words", - "write a blog intro about remote work please", - "Please draft the opening of a blog post about remote work.", - "blog post opening on remote work, about 300 words", - "Draft opening 300 words --- blog post, remote work.", - "write me a blog intro about remote work", - "Please produce the opening of a blog post on remote work, approximately 300 words.", - "can u write a remote work blog intro", - "need the opener for a blog post about remote work, ~300 words", - "Write the opening of a blog post about remote work." -] -\end{lstlisting} - -\subsection{\texttt{prompts/dense.json} (example entry; see repo for all 30)} - -The dense prompts should each push the model in a different direction. The first entry, shown below, pairs with the first sparse prompt above and illustrates the shape: - -\begin{lstlisting}[style=jsonstyle,language={}] -[ - "Write the opening 300 words of a blog post about remote work. The target audience is mid-career software engineers at 50-500 person startups who have worked remotely for 3+ years and are tired of both 'remote work is utopia' and 'return to office' takes. The tone should be dry, slightly weary, and specific rather than abstract. Open with a concrete observation about a small, texture-of-daily-life detail rather than a statistic or rhetorical question. Avoid the words 'unprecedented', 'new normal', 'journey', 'landscape', and 'game-changer'. The thesis should be that remote work's real cost is not productivity or culture but the erosion of ambient professional development - the kind of learning that happens when you overhear a senior engineer debug something. Voice should resemble a technical writer who also reads literary essays. No bullet points. End the opening paragraph on a sentence that turns the observation into a question the rest of the post will answer." - // ... 29 more, each specifying a different audience, thesis, tone, and - // structural constraint. See guidance below. -] -\end{lstlisting} - -\paragraph{Guidance for producing the 30 intents and matched prompts.} -Vary each of these axes independently across the 30 intents: target audience (engineers, designers, HR leaders, small-business owners, parents, new grads, academics, tradespeople, freelancers, founders, managers, journalists, lawyers, clinicians, etc.); thesis (productivity, loneliness, asynchrony, hiring, real estate, inequality, rituals, surveillance, accessibility, capex amortization, etc.); tone (dry, earnest, skeptical, celebratory, investigative, personal-essay, confessional, technocratic, etc.); opening move (anecdote, statistic-that-turns, quote, scene, counter-intuitive claim, literary reference, quoted memo, etc.); voice (technical writer, essayist, journalist, first-person blogger, trade-publication editor, policy analyst, etc.); structural constraint (no bullets, one-sentence paragraphs, tight lede, delayed thesis, circular, etc.). For each intent, write the dense prompt that expresses it fully; then write the sparse prompt as what that same user would type when underspecifying --- topic-only, in their natural register, carrying none of the specification content. The goal is that if the mechanism holds, the sparse-condition outputs will converge despite 30 different underlying intents, while the dense-condition outputs will diverge because the specifications actually push the model toward different regions of output space. - -\newpage - -% ===================================================================== -\section{Generation Script} -% ===================================================================== - -\subsection{\texttt{generate.py}} - -\begin{lstlisting} -"""Generate completions for sparse and dense prompts via LMStudio. - -LMStudio exposes an OpenAI-compatible server (default: localhost:1234). -Start the server from LMStudio's "Local Server" tab before running. -""" -from __future__ import annotations - -import json -import os -from pathlib import Path - -import yaml -from openai import OpenAI -from tqdm import tqdm - - -def load_config(path: str = "config.yaml") -> dict: - with open(path, "r") as f: - return yaml.safe_load(f) - - -def make_client(cfg: dict) -> OpenAI: - return OpenAI( - base_url=cfg["lmstudio"]["base_url"], - api_key=cfg["lmstudio"]["api_key"], - ) - - -def generate_one( - client: OpenAI, - model: str, - prompt: str, - temperature: float, - top_p: float, - max_tokens: int, - seed: int, -) -> str: - """Single completion. Returns the assistant message content.""" - response = client.chat.completions.create( - model=model, - messages=[{"role": "user", "content": prompt}], - temperature=temperature, - top_p=top_p, - max_tokens=max_tokens, - seed=seed, - ) - return response.choices[0].message.content or "" - - -def run_condition( - client: OpenAI, - cfg: dict, - condition: str, -) -> None: - prompts_path = Path(cfg["paths"]["prompts_dir"]) / f"{condition}.json" - outputs_dir = Path(cfg["paths"]["outputs_dir"]) / condition - outputs_dir.mkdir(parents=True, exist_ok=True) - - with open(prompts_path, "r") as f: - prompts = json.load(f) - - gen_cfg = cfg["generation"] - model = cfg["lmstudio"]["model"] - - for i, prompt in enumerate(tqdm(prompts, desc=f"{condition}")): - out_file = outputs_dir / f"{i:02d}.txt" - if out_file.exists(): - continue # resume support - text = generate_one( - client=client, - model=model, - prompt=prompt, - temperature=gen_cfg["temperature"], - top_p=gen_cfg["top_p"], - max_tokens=gen_cfg["max_tokens"], - seed=i, - ) - out_file.write_text(text, encoding="utf-8") - - -def main() -> None: - cfg = load_config() - client = make_client(cfg) - for condition in ("sparse", "dense"): - run_condition(client, cfg, condition) - print("Generation complete.") - - -if __name__ == "__main__": - main() -\end{lstlisting} - -\newpage - -% ===================================================================== -\section{Embedding Script} -% ===================================================================== - -\subsection{\texttt{embed.py}} - -\begin{lstlisting} -"""Compute sentence embeddings for each generation in each condition.""" -from __future__ import annotations - -from pathlib import Path - -import numpy as np -import yaml -from sentence_transformers import SentenceTransformer -from tqdm import tqdm - - -def load_config(path: str = "config.yaml") -> dict: - with open(path, "r") as f: - return yaml.safe_load(f) - - -def load_outputs(outputs_dir: Path) -> list[str]: - """Load all .txt outputs from a condition directory, sorted by filename.""" - files = sorted(outputs_dir.glob("*.txt")) - return [f.read_text(encoding="utf-8") for f in files] - - -def embed_condition( - model: SentenceTransformer, - texts: list[str], -) -> np.ndarray: - """Return (N, D) embedding matrix. L2-normalized for cosine similarity.""" - embeddings = model.encode( - texts, - batch_size=8, - show_progress_bar=True, - convert_to_numpy=True, - normalize_embeddings=True, - ) - return embeddings - - -def main() -> None: - cfg = load_config() - model = SentenceTransformer(cfg["embedding"]["model"]) - - outputs_root = Path(cfg["paths"]["outputs_dir"]) - emb_root = Path(cfg["paths"]["embeddings_dir"]) - emb_root.mkdir(parents=True, exist_ok=True) - - for condition in ("sparse", "dense"): - texts = load_outputs(outputs_root / condition) - if not texts: - print(f"No outputs found for {condition}; skipping.") - continue - print(f"Embedding {len(texts)} {condition} outputs...") - embeddings = embed_condition(model, texts) - np.save(emb_root / f"{condition}.npy", embeddings) - print(f"Saved {condition}.npy with shape {embeddings.shape}") - - -if __name__ == "__main__": - main() -\end{lstlisting} - -\newpage - -% ===================================================================== -\section{Similarity Computation} -% ===================================================================== - -\subsection{\texttt{similarity.py}} - -\begin{lstlisting} -"""Compute pairwise cosine similarities within each condition.""" -from __future__ import annotations - -from itertools import combinations -from pathlib import Path - -import numpy as np -import pandas as pd -import yaml - - -def load_config(path: str = "config.yaml") -> dict: - with open(path, "r") as f: - return yaml.safe_load(f) - - -def pairwise_cosine(embeddings: np.ndarray) -> tuple[np.ndarray, list[tuple[int, int]]]: - """Return (similarities, index_pairs) for all i None: - cfg = load_config() - emb_root = Path(cfg["paths"]["embeddings_dir"]) - results_root = Path(cfg["paths"]["results_dir"]) - results_root.mkdir(parents=True, exist_ok=True) - - rows = [] - for condition in ("sparse", "dense"): - emb_path = emb_root / f"{condition}.npy" - if not emb_path.exists(): - print(f"Missing embeddings for {condition}; skipping.") - continue - embeddings = np.load(emb_path) - sims, pairs = pairwise_cosine(embeddings) - for (i, j), s in zip(pairs, sims): - rows.append({ - "condition": condition, - "i": i, - "j": j, - "cosine": s, - }) - print( - f"{condition}: n_outputs={embeddings.shape[0]}, " - f"n_pairs={len(sims)}, mean={sims.mean():.4f}, " - f"std={sims.std(ddof=1):.4f}" - ) - - df = pd.DataFrame(rows) - df.to_csv(results_root / "pairwise.csv", index=False) - print(f"Saved {results_root / 'pairwise.csv'}") - - -if __name__ == "__main__": - main() -\end{lstlisting} - -\newpage - -% ===================================================================== -\section{Statistical Analysis} -% ===================================================================== - -\subsection{\texttt{stats.py}} - -\begin{lstlisting} -"""Statistical tests and summary metrics for the similarity comparison. - -Reports: - - Per-condition descriptive statistics - - Naive two-sample t-test on pairwise similarities - - Mann-Whitney U (nonparametric check) - - Cohen's d effect size - - Output-level bootstrap 95% CI for the difference in mean similarity - (corrects for pairwise dependence) -""" -from __future__ import annotations - -import json -from itertools import combinations -from pathlib import Path - -import numpy as np -import pandas as pd -import yaml -from scipy import stats - - -def load_config(path: str = "config.yaml") -> dict: - with open(path, "r") as f: - return yaml.safe_load(f) - - -def cohens_d(a: np.ndarray, b: np.ndarray) -> float: - """Pooled-SD Cohen's d for two independent samples.""" - na, nb = len(a), len(b) - va, vb = a.var(ddof=1), b.var(ddof=1) - pooled_sd = np.sqrt(((na - 1) * va + (nb - 1) * vb) / (na + nb - 2)) - return (a.mean() - b.mean()) / pooled_sd - - -def mean_pairwise(embeddings: np.ndarray) -> float: - """Mean pairwise cosine for an L2-normalized embedding matrix.""" - n = embeddings.shape[0] - sims = [ - float(embeddings[i] @ embeddings[j]) - for i, j in combinations(range(n), 2) - ] - return float(np.mean(sims)) - - -def bootstrap_diff( - sparse_emb: np.ndarray, - dense_emb: np.ndarray, - n_iter: int, - rng: np.random.Generator, -) -> tuple[float, float, np.ndarray]: - """Output-level bootstrap of (mean_sparse - mean_dense). - - Resamples outputs (not pairs) with replacement, recomputes mean - pairwise similarity in each condition, returns 95% CI. - """ - n_s, n_d = sparse_emb.shape[0], dense_emb.shape[0] - diffs = np.empty(n_iter, dtype=float) - for k in range(n_iter): - idx_s = rng.integers(0, n_s, size=n_s) - idx_d = rng.integers(0, n_d, size=n_d) - ms = mean_pairwise(sparse_emb[idx_s]) - md = mean_pairwise(dense_emb[idx_d]) - diffs[k] = ms - md - lo, hi = np.percentile(diffs, [2.5, 97.5]) - return float(lo), float(hi), diffs - - -def main() -> None: - cfg = load_config() - emb_root = Path(cfg["paths"]["embeddings_dir"]) - results_root = Path(cfg["paths"]["results_dir"]) - results_root.mkdir(parents=True, exist_ok=True) - - sparse_emb = np.load(emb_root / "sparse.npy") - dense_emb = np.load(emb_root / "dense.npy") - - df = pd.read_csv(results_root / "pairwise.csv") - sparse_sims = df.loc[df["condition"] == "sparse", "cosine"].to_numpy() - dense_sims = df.loc[df["condition"] == "dense", "cosine"].to_numpy() - - # Descriptive - desc = { - "sparse": { - "n_outputs": int(sparse_emb.shape[0]), - "n_pairs": int(len(sparse_sims)), - "mean": float(sparse_sims.mean()), - "std": float(sparse_sims.std(ddof=1)), - "median": float(np.median(sparse_sims)), - }, - "dense": { - "n_outputs": int(dense_emb.shape[0]), - "n_pairs": int(len(dense_sims)), - "mean": float(dense_sims.mean()), - "std": float(dense_sims.std(ddof=1)), - "median": float(np.median(dense_sims)), - }, - } - - # Naive t-test (note: pairwise dependence means this is optimistic) - t_stat, t_p = stats.ttest_ind(sparse_sims, dense_sims, equal_var=False) - - # Nonparametric check - u_stat, u_p = stats.mannwhitneyu( - sparse_sims, dense_sims, alternative="two-sided" - ) - - # Effect size - d = cohens_d(sparse_sims, dense_sims) - - # Output-level bootstrap (the honest test) - rng = np.random.default_rng(cfg["analysis"]["random_seed"]) - lo, hi, _ = bootstrap_diff( - sparse_emb, - dense_emb, - n_iter=cfg["analysis"]["bootstrap_iterations"], - rng=rng, - ) - - summary = { - "descriptive": desc, - "naive_welch_t_test": {"t": float(t_stat), "p": float(t_p)}, - "mann_whitney_u": {"u": float(u_stat), "p": float(u_p)}, - "cohens_d": float(d), - "bootstrap_diff_in_means": { - "point_estimate": float(sparse_sims.mean() - dense_sims.mean()), - "ci_low": lo, - "ci_high": hi, - "n_iter": int(cfg["analysis"]["bootstrap_iterations"]), - }, - } - - with open(results_root / "stats.json", "w") as f: - json.dump(summary, f, indent=2) - - # Pretty print - print(json.dumps(summary, indent=2)) - - -if __name__ == "__main__": - main() -\end{lstlisting} - -\newpage - -% ===================================================================== -\section{Plotting} -% ===================================================================== - -\subsection{\texttt{plot.py}} - -\begin{lstlisting} -"""Violin + strip plot of pairwise similarity distributions per condition.""" -from __future__ import annotations - -from pathlib import Path - -import matplotlib.pyplot as plt -import pandas as pd -import seaborn as sns -import yaml - - -def load_config(path: str = "config.yaml") -> dict: - with open(path, "r") as f: - return yaml.safe_load(f) - - -def main() -> None: - cfg = load_config() - results_root = Path(cfg["paths"]["results_dir"]) - df = pd.read_csv(results_root / "pairwise.csv") - - sns.set_theme(style="whitegrid", context="talk") - fig, ax = plt.subplots(figsize=(8, 6)) - - sns.violinplot( - data=df, - x="condition", - y="cosine", - order=["sparse", "dense"], - inner="quartile", - cut=0, - ax=ax, - ) - sns.stripplot( - data=df, - x="condition", - y="cosine", - order=["sparse", "dense"], - color="black", - alpha=0.25, - size=2, - ax=ax, - ) - - ax.set_xlabel("Specification condition") - ax.set_ylabel("Pairwise cosine similarity") - ax.set_title("Output similarity by specification density") - fig.tight_layout() - fig.savefig(results_root / "plot.png", dpi=200) - print(f"Saved {results_root / 'plot.png'}") - - -if __name__ == "__main__": - main() -\end{lstlisting} - -\newpage - -% ===================================================================== -\section{Pipeline Orchestrator} -% ===================================================================== - -\subsection{\texttt{run\_all.py}} - -\begin{lstlisting} -"""Run the full pipeline end-to-end. - -Usage: - python run_all.py -""" -from __future__ import annotations - -import subprocess -import sys - - -STEPS = [ - ("Generating completions via LMStudio", "generate.py"), - ("Embedding outputs", "embed.py"), - ("Computing pairwise similarities", "similarity.py"), - ("Running statistical tests", "stats.py"), - ("Plotting", "plot.py"), -] - - -def main() -> None: - for title, script in STEPS: - print(f"\n=== {title} ({script}) ===") - result = subprocess.run([sys.executable, script]) - if result.returncode != 0: - print(f"Step failed: {script}") - sys.exit(result.returncode) - print("\nPipeline complete. See results/ for outputs.") - - -if __name__ == "__main__": - main() -\end{lstlisting} - -\newpage - -% ===================================================================== -\section{README} -% ===================================================================== - -\subsection{\texttt{README.md}} - -\begin{lstlisting}[style=yamlstyle,language={}] -# Specification Dilemma Experiment - -Small empirical probe for the claim that sparse-specification prompts -produce more homogeneous outputs across users than dense-specification -prompts. - -See experiment.pdf for the full specification. - -## Design note: matched pairs - -The prompts in this repo follow a matched-pairs structure. Each of 30 -imagined users has a fixed underlying intent (audience, thesis, tone, -voice, opening move, structural constraint). prompts/dense.json[i] -expresses that user's full intent; prompts/sparse.json[i] is what the -same user would type when underspecifying -- topic only, in roughly -their natural register. The sparse prompts carry no audience, thesis, -tone, or structural specification. - -The statistical comparison is unchanged -- cross-user pairwise -similarity in each condition -- but the two conditions now sample the -same population of underlying intents. This tests the sharper claim: -when users with divergent intents underspecify, outputs converge -(priors dominate); when they specify fully, outputs diverge (intents -dominate). - -## Setup - -1. Install LMStudio and download a strong instruction-tuned model - (e.g. Qwen2.5-72B-Instruct or Llama-3.3-70B-Instruct). -2. Start the LMStudio local server (default: localhost:1234). -3. Create the environment and install dependencies with uv: - - uv sync - - (or pip install -r requirements.txt inside a venv if not using uv) - -4. Edit config.yaml if your LMStudio model name or port differs from - the defaults. If LMStudio is on a remote host, point - lmstudio.base_url at that host (e.g. http://:1234/v1). - -5. Smoke-test the endpoint (checks connectivity, seed-honoring, and - approximate per-generation latency): - - uv run python smoke_test.py - -## Running - -Freeze your prompts in prompts/sparse.json and prompts/dense.json -before generating anything. - -Then run the full pipeline: - - uv run python run_all.py - -Or run steps individually: - - uv run python generate.py # LMStudio generations - uv run python embed.py # sentence embeddings - uv run python similarity.py # pairwise cosine similarities - uv run python stats.py # t-test, Mann-Whitney, bootstrap, Cohen's d - uv run python plot.py # violin plot - -## Outputs - -- outputs/{sparse,dense}/NN.txt : raw model completions -- embeddings/{sparse,dense}.npy : L2-normalized embedding matrices -- results/pairwise.csv : all pairwise similarities -- results/stats.json : test statistics and summary -- results/plot.png : similarity distribution plot - -## Interpretation - -A positive result: sparse-condition mean pairwise similarity is -meaningfully higher than dense-condition mean similarity, the -bootstrap 95% CI on the difference excludes 0, and Cohen's d is -large (>0.8). - -A null or inverted result is also interesting and should be reported -honestly. -\end{lstlisting} - -% ===================================================================== -\section{Essay Integration Notes} -% ===================================================================== - -The experiment should occupy roughly 300--500 words in the essay itself, including setup, result, and caveat. One paragraph describing the design, one paragraph reporting the result with the plot, and one sentence acknowledging limitations. - -The honest framing, to place in a footnote: - -\begin{quote} -\small This is a single-task, single-model probe with one embedding-based similarity metric. A fuller treatment would need to show the effect holds across tasks, models, and metrics. I ran this to check whether my intuition survived contact with data; report whichever result you find. -\end{quote} - -If the experiment's word budget starts expanding past 500 words, it has begun competing with the argument rather than serving it. The mechanism section and the ``why this is worse than individual decline'' section are the essay's center of gravity; the experiment is a short piece of evidentiary scaffolding that lets those sections claim more than pure argument would permit. - -\end{document}