LeVCS/doc/technical-report.md

18 KiB
Raw Blame History

LeVCS: A Technical Report

Status: v0.1.0 — protocol substrate complete, workflow surface deferred. Audience: engineers evaluating LeVCS for their own projects, or designing the workflow tooling that will sit on top of it.


TL;DR

  • LeVCS is a distributed version control system in the same lineage as git, fossil, pijul, and sapling — content-addressed objects, signed history, three-way merge.
  • Five things are different by design: identity is in the protocol, federation is a first-class concept, the merge engine is a cascading pipeline of format-aware handlers, hashes are BLAKE3, and releases are signed objects rather than ad-hoc tags.
  • It is a substrate, not a workflow tool. There is no PR review surface, no issue tracker, no CI integration, no web UI today. Those are the next layer up.
  • You can host an instance on a small VPS behind nginx or Caddy. The protocol terminates over HTTP; signing is at the application layer.
  • The codebase is small (~10 crates) and runs cargo test in under a minute on a laptop. 194 tests pass at v0.1.0; baseline benchmarks are in the repo.

1. Why a new VCS?

Git is the dominant DVCS. It is also a tool that grew up in 2005 around a hashing algorithm that had visible cracks (SHA-1) and a federation model that was, fundamentally, "your remote is a URL string." Twenty years on, the world git serves looks different:

  • Identity is no longer optional. Many projects need to know not just who claims to have authored a commit but who is authorized to alter the repo's history.
  • Replication is more complex than push/pull. Mirrors, archives, cold-storage replicas, and read-only forks are all common — and git treats them with the same primitives as the source-of-truth remote.
  • Merge conflicts are still mostly resolved at the line level. JSON, YAML, TOML, source code with semantic structure — the line-diff treatment is wrong for all of these and produces false conflicts on reformats every team has hit.
  • SHA-1 is broken; git's SHA-256 transition has been "in progress" for most of a decade.

LeVCS is an attempt at a clean restart that takes the DAG model and content addressing as obvious wins, and rebuilds identity, federation, merging, and hashing as protocol-level concerns rather than conventions or sidecar tools.


2. The shape of the system

LeVCS is layered:

┌───────────────────────────────────────────────────────┐
│ Workflow tools (TBD: review, issues, web UI)          │
├───────────────────────────────────────────────────────┤
│ CLI: `levcs init / commit / push / merge / release`    │
├───────────────────────────────────────────────────────┤
│ Federation HTTP API (instances, mirrors, releases)     │
├───────────────────────────────────────────────────────┤
│ Object model: Blob / Tree / Commit / Release / Authority │
│ Merge engine: textual → format-aware → tree-sitter     │
│ Trust root: signed authority chain (Ed25519)            │
│ Content addressing: BLAKE3                              │
└───────────────────────────────────────────────────────┘

Five object kinds, all content-addressed by their BLAKE3 digest:

  • Blob — raw file contents.
  • Tree(name, type, mode, hash) entries; sorted, no duplicates.
  • Commit — tree + parents + authority + author key + message, signed.
  • Release — first-class artifact: tree + predecessor commit + parent release + label + notes, signed by a maintainer or owner.
  • Authority — the membership document for a repo: who has what role, signed and chained.

A repository is the set of these objects plus a refs/ map (branches/*, releases/*, authority/{genesis,current}) indexing into them. The repo_id is the BLAKE3 of the genesis authority — globally unique by construction, no central registrar needed.


3. What's different from git

Axis git LeVCS
Hash SHA-1 (deprecated, transitioning) BLAKE3
Identity Author string in commit Signed authority object with explicit roles
Push authorization Server-side hook or hosting platform Protocol-level role check (Reader/Contributor/Maintainer/Owner)
Force-push rule Server policy (off-protocol) Protocol enforces maintainer-or-owner role
Federation URL-bound remotes Global repo_id + replicating instances
Mirror replication git fetch --mirror (best-effort) First-class with three storage modes
Tags / releases Mutable string refs (often) Signed objects with predecessor + parent_release chain
Merge granularity Line-level (myers / patience) Cascade: textual → format → tree-sitter → plugin
Merge audit No artifact .levcs/merge-record TOML, signed with the commit
Web UI / issues Provided by hosting platform Out of scope for v1

The rest of this section unpacks each axis.

3.1 Identity in the protocol, not on top

Git stores Author: Name <email> and Committer: Name <email> strings in commits. There is nothing cryptographic about either. Signed commits are an opt-in (gpg-sign, since 2014, and ssh-sign, since 2021), but even when signed they answer "did some key sign this?" — not "is the signer authorized to write to this repo right now?"

LeVCS makes membership a first-class object. An authority body has:

schema_version  repo_id  previous_authority  version  created_micros
members:        [(public_key, handle, role, added_micros, added_by), ...]
policy:         [(key, value), ...]

Roles are a strict ordering: Reader < Contributor < Maintainer < Owner. Every commit references the authority hash that was current when it was signed. Updating membership is a versioned operation: you write a new authority object, signed by an Owner, with previous_authority pointing at the prior one. The instance walks the chain on push and rejects any push whose author key isn't a current member.

The practical consequence: "give Bob push access" is not a hosting- platform toggle. It is a signed authority update that travels in the repo and is auditable for the lifetime of the project.

3.2 Federation, not "remotes"

A git remote is a URL plus some credentials. There is no fact-of-the- matter about whether two URLs refer to the same repository — git checks by walking commits, but "same project" is by convention.

LeVCS has a global repo_id. It is the BLAKE3 of the genesis authority object, so two clones of the same project have the same repo_id even if they live on instances on opposite continents. An instance is a federation peer: it serves /levcs/v1/repos/<repo_id>/... endpoints and replicates state from other instances when configured to. Mirroring is the protocol's normal mode, not a git fetch --mirror cron job.

This composes with three storage modes (§4.3 of the spec):

  • Full — every reachable object. The source-of-truth instance.
  • Release — only release objects, their reachable trees and blobs, and the authority chain. Skips inter-release commits. For long-lived archive replicas.
  • Metadata — authority objects, release headers, signed refs only. No content. For "is this project still alive?" pings.

The instance enforces these on push: a release-mode replica refuses pushes that update branches; a metadata-mode replica refuses all pushes (it's populated by mirroring).

3.3 The merge cascade

This is the technical centerpiece.

A traditional three-way merge — git, mercurial, fossil — works at the line level. It is correct for prose and acceptable for code, but it generates false conflicts on:

  • Reformats (linters, prettifiers, whitespace-policy bumps).
  • Key reorderings in JSON / YAML / TOML.
  • Imports lists in source files that two branches both edited.
  • Markdown files where two contributors modified disjoint sections of the same paragraph.

LeVCS dispatches per-file to a handler cascade ranked by aggressiveness:

rank 0  textual           universal line-level fallback
rank 1  format-aware      json | yaml | toml | xml | markdown | prose
rank 2  tree-sitter       rust | python | js | ts | go | c | cpp |
                          java | ruby | bash
rank 3  plugin            wasm-sandboxed, user-supplied

A repo's .levcs/merge.toml maps glob patterns to handlers. Per-user .levcs/merge.local.toml can demote but never promote, so a distrusted plugin can be locally turned off without a repo edit. Each merged file produces a FileRecord in .levcs/merge-record listing the handler used and its hash; the merge-record blob is committed alongside the resolved tree, so every merge in history is auditable.

Format-aware example: package.json where Alice adds a dependency at the top of dependencies and Bob adds one at the bottom. Git produces a conflict because the lines are adjacent. The JSON handler parses both sides, computes the structural diff, and merges them — both new entries appear in the output, no conflict.

Tree-sitter example: two contributors add unrelated use statements to a Rust file. Line diff conflicts. Tree-sitter handler treats the use_declaration list as an ordered set, merges both additions, no conflict.

The cascade is fail-safe: a tree-sitter handler that bails on a syntax error falls through to the format-aware handler if applicable, then to textual. The textual handler always merges — it might produce conflicts, but it never fails to produce some output.

3.4 Hashing

Git uses SHA-1. SHAttered (2017) was a practical collision. The SHA-256 transition is still incomplete in 2026 and is unlikely to ever finish for the long tail of git infrastructure.

LeVCS uses BLAKE3 from day one. Faster than SHA-256 in practice (the benchmarks in bench-results/ show ~5 GiB/s on a laptop for blob serialize+hash), tree-hashed, no commitment to a specific length-tag convention. Object IDs are 32 bytes everywhere.

3.5 Releases as objects

Git tags are refs that point to commits — or to tag objects, if you remember to use -a. Either way, they are names, not artifacts. A release in LeVCS is a signed object:

tree            commit's root tree
predecessor     commit being released
parent_release  prior release in the chain (or zero)
authority       authority hash at release time
declarer_key    public key of the signing maintainer/owner
timestamp       Unix micros
label           "v1.0.0" or similar
notes           release notes (UTF-8, up to 4 GiB)

The chain parent_release → parent_release → ... gives you a clean release history independent of branch topology. The replica modes above can replicate just releases (and their trees and authority) for archive instances that don't need the inter-release commit history.


4. How you use it

4.1 Bootstrap

levcs key generate --label primary
levcs init --key primary
levcs track --all
levcs commit -m "initial import"

After init, .levcs/ exists alongside your tree. The genesis authority names your key as the sole Owner; the repo_id is fixed forever. After commit, you have one commit on refs/branches/main.

4.2 Branch and merge

levcs branch feature/x
# ... edit files ...
levcs commit -m "wip on x"
levcs branch main          # switch back
levcs merge feature/x

If the merge produces conflicts, drop into the resolution TUI:

levcs merge --resolve

The TUI shows each conflicted file with the ours/base/theirs panes the handler emitted, plus the cascade decision (which handler ran, why it fell through if it did). On accept, it writes the resolved file and a signed .levcs/merge-record entry.

4.3 Release

levcs release v1.0.0 --notes "first release"

Writes a Release object with the current commit as predecessor, signs it with your active key, and adds refs/releases/v1.0.0. If you've cut prior releases, parent_release chains to the most recent one automatically.

4.4 Federation

levcs instance --set https://levcs.example.com/levcs/v1
levcs push refs/branches/main

The first push to a fresh instance auto-inits the repo using your genesis authority. Subsequent pushes are role-checked. Pulls are public-read by default (the public_read policy bit on the genesis authority).

To migrate to a new home:

levcs migrate https://new-host.example.com/levcs/v1 --set-active

migrate re-inits and replays the full history at the destination, then points your local repo at it. The repo_id is unchanged — it's the same project at a new location.


5. Operating an instance

A single binary, levcs-instance, reads a TOML config and listens on HTTP. Production deployments terminate TLS at a reverse proxy; the instance binds to localhost. See deploy/README.md for a full walkthrough — systemd unit, Caddy and nginx examples, firewall, and the laptop-side bootstrap.

The protocol surface is small:

GET  /health
GET  /levcs/v1/instance/info
GET  /levcs/v1/instance/peers
GET  /levcs/v1/repos/<repo_id>/info
GET  /levcs/v1/repos/<repo_id>/refs
GET  /levcs/v1/repos/<repo_id>/objects/<hash>
GET  /levcs/v1/repos/<repo_id>/pack?have=...&want=...
POST /levcs/v1/repos/<repo_id>/init
POST /levcs/v1/repos/<repo_id>/push

That's it. No admin endpoints, no users-and-passwords table, no web UI to firewall. POSTs require a signed LeVCS-Signature header (Ed25519-over-canonical-request, with timestamp and nonce for replay protection). GETs are public unless the genesis authority's policy turned that off.

Storage is a directory tree. Per-object atomic writes via temp-then- rename, per-repo serializing mutex on push. A consistent backup is just a snapshot of /var/lib/levcs.


6. What LeVCS isn't (yet)

The honest list of things you'd want for a full project home that LeVCS does not provide:

  • Code review. No PR object, no review threads, no comments. The workflow spec coming next defines these.
  • Issue tracking. Same — protocol substrate doesn't cover it.
  • CI integration. No webhooks. CI systems would need to poll /refs on a cadence, which is fine but not turnkey.
  • Web UI. No branch browser, no diff view, no blame. These can be built atop the existing GET endpoints; nothing in the protocol is hostile to a UI, but none ship.
  • Search. No git grep equivalent on the server side. Local-only.
  • Submodules / monorepo tooling. No analog yet.

If your use case requires any of the above today, run LeVCS parallel to your existing platform. Forgejo, GitHub, Gitea continue to host the workflow; the LeVCS instance acts as a dogfood replica that gets the same commits via a push-both wrapper. When the workflow surface lands, the migration story flips.


7. What is true today (and how we know)

The repo at v0.1.0 has 194 passing tests covering:

  • The full §2-§7 object model and protocol surface.
  • A 14-scenario merge conformance corpus, eight of which are git- false-conflict cases the cascade resolves cleanly.
  • Property tests on the pack codec and object parsers (fuzz + structured proptest round-trip).
  • An end-to-end "dogfood" integration test that stands up three instances (source-of-truth, peer, mirror), pushes a chain of commits plus a release, replicates via mirror sync, migrates to the peer, and asserts byte-for-byte object equality across all three.

A baseline microbenchmark suite is checked in (scripts/bench.sh). On a Ryzen 7 laptop:

  • Pack decode of a 10 × 1 MiB pack: ~2.3 ms (4.3 GiB/s).
  • BLAKE3+serialize on 1 MiB blobs: ~190 µs (5.1 GiB/s).
  • Textual three-way merge of a 100 KiB document: ~4.6 ms (~80 MiB/s).
  • Encode is the bottleneck — zstd level 3 at ~380 MiB/s on incompressible data.

Numbers are reproducible via scripts/bench.sh --quick.


8. Where the project goes next

The immediate roadmap, in order:

  1. Workflow spec — the missing layer above. PR/review object, discussion threads, CI hook conventions, web UI design. This is the document the rest of v1 builds toward.
  2. Reference workflow tools — a minimal web UI that reads the federation API and lets you browse, review, and merge. Probably a separate repo and process, not bundled into the instance.
  3. CI conventions — a published webhook protocol so existing CI systems can integrate without polling.
  4. Plugin handler examples — a few real wasm handlers (e.g. protobuf, SQL migrations) to validate the plugin protocol.
  5. Git import — a one-way import path so existing projects can adopt LeVCS without hand-replaying history.

If you're reading this because you might write that workflow spec: the substrate guarantees you have are (a) signed objects with a verifiable authority chain, (b) per-file merge records that travel with each commit, (c) a content-addressed object store that doesn't care what kind of content it stores, and (d) federation as a normal operating mode rather than a special case. Workflow surface is free to use these as building blocks — a "PR" is just an object kind we don't have yet, an "issue" is another, and the storage modes already define how a CI system would replicate the metadata it needs without pulling source.


9. Trying it

Build:

git clone <this repo>
cargo build --release
sudo install -m 0755 target/release/levcs target/release/levcs-instance /usr/local/bin/

Local single-machine tour:

levcs key generate --label me
levcs init --key me /tmp/demo
cd /tmp/demo
echo "hello" > a.txt
levcs track --all
levcs commit -m "first"
levcs log

Self-host: see deploy/README.md.

Read the spec: spec/levcs-spec.pdf (kept private until the workflow spec lands; ask the maintainer for a copy).

Read the code: every crate is small and documented. crates/levcs-core is the object model, crates/levcs-merge is the cascade, crates/levcs-instance is the server, crates/levcs-cli is the user- facing tool.


Comments and corrections welcome to the maintainer. The next document in this series is the workflow spec.