LeVCS/doc/technical-report.md

469 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LeVCS: A Technical Report
**Status:** v0.1.0 — protocol substrate complete, workflow surface deferred.
**Audience:** engineers evaluating LeVCS for their own projects, or designing
the workflow tooling that will sit on top of it.
---
## TL;DR
- **LeVCS is a distributed version control system** in the same lineage as
git, fossil, pijul, and sapling — content-addressed objects, signed
history, three-way merge.
- **Five things are different by design:** identity is in the protocol,
federation is a first-class concept, the merge engine is a cascading
pipeline of format-aware handlers, hashes are BLAKE3, and releases are
signed objects rather than ad-hoc tags.
- **It is a substrate, not a workflow tool.** There is no PR review
surface, no issue tracker, no CI integration, no web UI today. Those
are the next layer up.
- **You can host an instance on a small VPS** behind nginx or Caddy. The
protocol terminates over HTTP; signing is at the application layer.
- **The codebase is small** (~10 crates) and runs `cargo test` in under a
minute on a laptop. 194 tests pass at v0.1.0; baseline benchmarks are
in the repo.
---
## 1. Why a new VCS?
Git is the dominant DVCS. It is also a tool that grew up in 2005 around
a hashing algorithm that had visible cracks (SHA-1) and a federation
model that was, fundamentally, "your remote is a URL string." Twenty
years on, the world git serves looks different:
- Identity is no longer optional. Many projects need to know not just
*who claims to have authored a commit* but who is *authorized* to
alter the repo's history.
- Replication is more complex than push/pull. Mirrors, archives,
cold-storage replicas, and read-only forks are all common — and git
treats them with the same primitives as the source-of-truth remote.
- Merge conflicts are still mostly resolved at the line level. JSON,
YAML, TOML, source code with semantic structure — the line-diff
treatment is wrong for all of these and produces false conflicts on
reformats every team has hit.
- SHA-1 is broken; git's SHA-256 transition has been "in progress" for
most of a decade.
LeVCS is an attempt at a clean restart that takes the DAG model and
content addressing as obvious wins, and rebuilds identity, federation,
merging, and hashing as **protocol-level concerns** rather than
conventions or sidecar tools.
---
## 2. The shape of the system
LeVCS is layered:
```
┌───────────────────────────────────────────────────────┐
│ Workflow tools (TBD: review, issues, web UI) │
├───────────────────────────────────────────────────────┤
│ CLI: `levcs init / commit / push / merge / release` │
├───────────────────────────────────────────────────────┤
│ Federation HTTP API (instances, mirrors, releases) │
├───────────────────────────────────────────────────────┤
│ Object model: Blob / Tree / Commit / Release / Authority │
│ Merge engine: textual → format-aware → tree-sitter │
│ Trust root: signed authority chain (Ed25519) │
│ Content addressing: BLAKE3 │
└───────────────────────────────────────────────────────┘
```
Five object kinds, all content-addressed by their BLAKE3 digest:
- **Blob** — raw file contents.
- **Tree** — `(name, type, mode, hash)` entries; sorted, no duplicates.
- **Commit** — tree + parents + authority + author key + message,
signed.
- **Release** — first-class artifact: tree + predecessor commit + parent
release + label + notes, signed by a maintainer or owner.
- **Authority** — the *membership document* for a repo: who has what
role, signed and chained.
A repository is the set of these objects plus a `refs/` map (`branches/*`,
`releases/*`, `authority/{genesis,current}`) indexing into them. The
`repo_id` is the BLAKE3 of the genesis authority — globally unique by
construction, no central registrar needed.
---
## 3. What's different from git
| Axis | git | LeVCS |
|---|---|---|
| Hash | SHA-1 (deprecated, transitioning) | BLAKE3 |
| Identity | Author string in commit | Signed authority object with explicit roles |
| Push authorization | Server-side hook or hosting platform | Protocol-level role check (Reader/Contributor/Maintainer/Owner) |
| Force-push rule | Server policy (off-protocol) | Protocol enforces maintainer-or-owner role |
| Federation | URL-bound remotes | Global `repo_id` + replicating instances |
| Mirror replication | `git fetch --mirror` (best-effort) | First-class with three storage modes |
| Tags / releases | Mutable string refs (often) | Signed objects with predecessor + parent_release chain |
| Merge granularity | Line-level (myers / patience) | Cascade: textual → format → tree-sitter → plugin |
| Merge audit | No artifact | `.levcs/merge-record` TOML, signed with the commit |
| Web UI / issues | Provided by hosting platform | Out of scope for v1 |
The rest of this section unpacks each axis.
### 3.1 Identity in the protocol, not on top
Git stores `Author: Name <email>` and `Committer: Name <email>` strings
in commits. There is nothing cryptographic about either. Signed commits
are an opt-in (`gpg-sign`, since 2014, and `ssh-sign`, since 2021), but
even when signed they answer "did *some* key sign this?" — not "is the
signer authorized to write to this repo right now?"
LeVCS makes membership a first-class object. An **authority body** has:
```
schema_version repo_id previous_authority version created_micros
members: [(public_key, handle, role, added_micros, added_by), ...]
policy: [(key, value), ...]
```
Roles are a strict ordering: `Reader < Contributor < Maintainer < Owner`.
Every commit references the authority hash that was current when it was
signed. Updating membership is a versioned operation: you write a new
authority object, signed by an Owner, with `previous_authority` pointing
at the prior one. The instance walks the chain on push and rejects any
push whose author key isn't a current member.
The practical consequence: "give Bob push access" is not a hosting-
platform toggle. It is a signed authority update that travels in the
repo and is auditable for the lifetime of the project.
### 3.2 Federation, not "remotes"
A git remote is a URL plus some credentials. There is no fact-of-the-
matter about whether two URLs refer to the *same* repository — git
checks by walking commits, but "same project" is by convention.
LeVCS has a **global repo_id**. It is the BLAKE3 of the genesis
authority object, so two clones of the same project have the same
`repo_id` even if they live on instances on opposite continents. An
instance is a federation peer: it serves `/levcs/v1/repos/<repo_id>/...`
endpoints and replicates state from other instances when configured to.
Mirroring is the protocol's normal mode, not a `git fetch --mirror` cron
job.
This composes with three **storage modes** (§4.3 of the spec):
- **Full** — every reachable object. The source-of-truth instance.
- **Release** — only release objects, their reachable trees and blobs,
and the authority chain. Skips inter-release commits. For
long-lived archive replicas.
- **Metadata** — authority objects, release headers, signed refs only.
No content. For "is this project still alive?" pings.
The instance enforces these on push: a release-mode replica refuses
pushes that update branches; a metadata-mode replica refuses all
pushes (it's populated by mirroring).
### 3.3 The merge cascade
This is the technical centerpiece.
A traditional three-way merge — git, mercurial, fossil — works at the
line level. It is correct for prose and acceptable for code, but it
generates false conflicts on:
- Reformats (linters, prettifiers, whitespace-policy bumps).
- Key reorderings in JSON / YAML / TOML.
- Imports lists in source files that two branches both edited.
- Markdown files where two contributors modified disjoint sections of
the same paragraph.
LeVCS dispatches per-file to a **handler cascade** ranked by aggressiveness:
```
rank 0 textual universal line-level fallback
rank 1 format-aware json | yaml | toml | xml | markdown | prose
rank 2 tree-sitter rust | python | js | ts | go | c | cpp |
java | ruby | bash
rank 3 plugin wasm-sandboxed, user-supplied
```
A repo's `.levcs/merge.toml` maps glob patterns to handlers. Per-user
`.levcs/merge.local.toml` can **demote** but never promote, so a
distrusted plugin can be locally turned off without a repo edit. Each
merged file produces a `FileRecord` in `.levcs/merge-record` listing the
handler used and its hash; the merge-record blob is committed alongside
the resolved tree, so every merge in history is auditable.
Format-aware example: `package.json` where Alice adds a dependency at
the top of `dependencies` and Bob adds one at the bottom. Git produces a
conflict because the lines are adjacent. The JSON handler parses both
sides, computes the structural diff, and merges them — both new entries
appear in the output, no conflict.
Tree-sitter example: two contributors add unrelated `use` statements to
a Rust file. Line diff conflicts. Tree-sitter handler treats the
`use_declaration` list as an ordered set, merges both additions, no
conflict.
The cascade is fail-safe: a tree-sitter handler that bails on a syntax
error falls through to the format-aware handler if applicable, then to
textual. The textual handler always merges — it might produce
conflicts, but it never fails to produce *some* output.
### 3.4 Hashing
Git uses SHA-1. SHAttered (2017) was a practical collision. The
SHA-256 transition is still incomplete in 2026 and is unlikely to ever
finish for the long tail of git infrastructure.
LeVCS uses BLAKE3 from day one. Faster than SHA-256 in practice (the
benchmarks in `bench-results/` show ~5 GiB/s on a laptop for blob
serialize+hash), tree-hashed, no commitment to a specific length-tag
convention. Object IDs are 32 bytes everywhere.
### 3.5 Releases as objects
Git tags are refs that point to commits — or to tag objects, if you
remember to use `-a`. Either way, they are *names*, not artifacts. A
release in LeVCS is a signed object:
```
tree commit's root tree
predecessor commit being released
parent_release prior release in the chain (or zero)
authority authority hash at release time
declarer_key public key of the signing maintainer/owner
timestamp Unix micros
label "v1.0.0" or similar
notes release notes (UTF-8, up to 4 GiB)
```
The chain `parent_release → parent_release → ...` gives you a clean
release history independent of branch topology. The replica modes
above can replicate just releases (and their trees and authority) for
archive instances that don't need the inter-release commit history.
---
## 4. How you use it
### 4.1 Bootstrap
```sh
levcs key generate --label primary
levcs init --key primary
levcs track --all
levcs commit -m "initial import"
```
After `init`, `.levcs/` exists alongside your tree. The genesis authority
names your key as the sole Owner; the `repo_id` is fixed forever. After
`commit`, you have one commit on `refs/branches/main`.
### 4.2 Branch and merge
```sh
levcs branch feature/x
# ... edit files ...
levcs commit -m "wip on x"
levcs branch main # switch back
levcs merge feature/x
```
If the merge produces conflicts, drop into the resolution TUI:
```sh
levcs merge --resolve
```
The TUI shows each conflicted file with the ours/base/theirs panes the
handler emitted, plus the cascade decision (which handler ran, why it
fell through if it did). On accept, it writes the resolved file and a
signed `.levcs/merge-record` entry.
### 4.3 Release
```sh
levcs release v1.0.0 --notes "first release"
```
Writes a Release object with the current commit as `predecessor`, signs
it with your active key, and adds `refs/releases/v1.0.0`. If you've cut
prior releases, `parent_release` chains to the most recent one
automatically.
### 4.4 Federation
```sh
levcs instance --set https://levcs.example.com/levcs/v1
levcs push refs/branches/main
```
The first push to a fresh instance auto-inits the repo using your
genesis authority. Subsequent pushes are role-checked. Pulls are
public-read by default (the `public_read` policy bit on the genesis
authority).
To migrate to a new home:
```sh
levcs migrate https://new-host.example.com/levcs/v1 --set-active
```
`migrate` re-inits and replays the full history at the destination, then
points your local repo at it. The `repo_id` is unchanged — it's the
same project at a new location.
---
## 5. Operating an instance
A single binary, `levcs-instance`, reads a TOML config and listens on
HTTP. Production deployments terminate TLS at a reverse proxy; the
instance binds to localhost. See `deploy/README.md` for a full
walkthrough — systemd unit, Caddy and nginx examples, firewall, and the
laptop-side bootstrap.
The protocol surface is small:
```
GET /health
GET /levcs/v1/instance/info
GET /levcs/v1/instance/peers
GET /levcs/v1/repos/<repo_id>/info
GET /levcs/v1/repos/<repo_id>/refs
GET /levcs/v1/repos/<repo_id>/objects/<hash>
GET /levcs/v1/repos/<repo_id>/pack?have=...&want=...
POST /levcs/v1/repos/<repo_id>/init
POST /levcs/v1/repos/<repo_id>/push
```
That's it. No admin endpoints, no users-and-passwords table, no web UI
to firewall. POSTs require a signed `LeVCS-Signature` header
(Ed25519-over-canonical-request, with timestamp and nonce for replay
protection). GETs are public unless the genesis authority's policy
turned that off.
Storage is a directory tree. Per-object atomic writes via temp-then-
rename, per-repo serializing mutex on push. A consistent backup is just
a snapshot of `/var/lib/levcs`.
---
## 6. What LeVCS isn't (yet)
The honest list of things you'd want for a full project home that LeVCS
does not provide:
- **Code review.** No PR object, no review threads, no comments. The
workflow spec coming next defines these.
- **Issue tracking.** Same — protocol substrate doesn't cover it.
- **CI integration.** No webhooks. CI systems would need to poll `/refs`
on a cadence, which is fine but not turnkey.
- **Web UI.** No branch browser, no diff view, no blame. These can be
built atop the existing GET endpoints; nothing in the protocol is
hostile to a UI, but none ship.
- **Search.** No `git grep` equivalent on the server side. Local-only.
- **Submodules / monorepo tooling.** No analog yet.
If your use case requires any of the above today, run LeVCS *parallel*
to your existing platform. Forgejo, GitHub, Gitea continue to host the
workflow; the LeVCS instance acts as a dogfood replica that gets the
same commits via a `push-both` wrapper. When the workflow surface
lands, the migration story flips.
---
## 7. What is true today (and how we know)
The repo at v0.1.0 has 194 passing tests covering:
- The full §2-§7 object model and protocol surface.
- A 14-scenario merge conformance corpus, eight of which are git-
false-conflict cases the cascade resolves cleanly.
- Property tests on the pack codec and object parsers (fuzz + structured
proptest round-trip).
- An end-to-end "dogfood" integration test that stands up three
instances (source-of-truth, peer, mirror), pushes a chain of commits
plus a release, replicates via mirror sync, migrates to the peer,
and asserts byte-for-byte object equality across all three.
A baseline microbenchmark suite is checked in (`scripts/bench.sh`). On
a Ryzen 7 laptop:
- Pack decode of a 10 × 1 MiB pack: ~2.3 ms (4.3 GiB/s).
- BLAKE3+serialize on 1 MiB blobs: ~190 µs (5.1 GiB/s).
- Textual three-way merge of a 100 KiB document: ~4.6 ms (~80 MiB/s).
- Encode is the bottleneck — zstd level 3 at ~380 MiB/s on
incompressible data.
Numbers are reproducible via `scripts/bench.sh --quick`.
---
## 8. Where the project goes next
The immediate roadmap, in order:
1. **Workflow spec** — the missing layer above. PR/review object,
discussion threads, CI hook conventions, web UI design. This is the
document the rest of v1 builds toward.
2. **Reference workflow tools** — a minimal web UI that reads the
federation API and lets you browse, review, and merge. Probably a
separate repo and process, not bundled into the instance.
3. **CI conventions** — a published webhook protocol so existing CI
systems can integrate without polling.
4. **Plugin handler examples** — a few real wasm handlers (e.g.
protobuf, SQL migrations) to validate the plugin protocol.
5. **Git import** — a one-way import path so existing projects can
adopt LeVCS without hand-replaying history.
If you're reading this because you might write that workflow spec: the
substrate guarantees you have are
(a) signed objects with a verifiable authority chain,
(b) per-file merge records that travel with each commit,
(c) a content-addressed object store that doesn't care what kind of
content it stores, and
(d) federation as a normal operating mode rather than a special case.
Workflow surface is free to use these as building blocks — a "PR" is
just an object kind we don't have yet, an "issue" is another, and the
storage modes already define how a CI system would replicate the
metadata it needs without pulling source.
---
## 9. Trying it
Build:
```sh
git clone <this repo>
cargo build --release
sudo install -m 0755 target/release/levcs target/release/levcs-instance /usr/local/bin/
```
Local single-machine tour:
```sh
levcs key generate --label me
levcs init --key me /tmp/demo
cd /tmp/demo
echo "hello" > a.txt
levcs track --all
levcs commit -m "first"
levcs log
```
Self-host: see `deploy/README.md`.
Read the spec: `spec/levcs-spec.pdf` (kept private until the workflow
spec lands; ask the maintainer for a copy).
Read the code: every crate is small and documented. `crates/levcs-core`
is the object model, `crates/levcs-merge` is the cascade,
`crates/levcs-instance` is the server, `crates/levcs-cli` is the user-
facing tool.
---
*Comments and corrections welcome to the maintainer. The next document
in this series is the workflow spec.*