docs(web/snix/castore): port "why not git" document

Change-Id: I4ac77f264d2704018cd33dbf80746e82d193686e
Reviewed-on: https://cl.snix.dev/c/snix/+/30315
Reviewed-by: Domen Kožar <domen@cachix.org>
Tested-by: besadii
Autosubmit: Florian Klink <flokli@flokli.de>
This commit is contained in:
Florian Klink 2025-04-12 19:38:28 +02:00 committed by clbot
parent b2d2d622e0
commit c9082a586c
3 changed files with 23 additions and 15 deletions

View file

@ -82,7 +82,7 @@ name MUST be unique across all three lists.
[rustdoc-node]: https://snix.dev/rustdoc/snix_castore/enum.Node.html
[rustdoc-pathcomponent]: https://snix.dev/rustdoc/snix_castore/struct.PathComponent.html
[BLAKE3]: https://github.com/BLAKE3-team/BLAKE3
[^why-not-git-trees]: For a detailed comparison with the git model, and what (and why we do differently, see TODO LINK)
[^why-not-git-trees]: For a detailed comparison with the git model, and what (and why we do differently, see [here]({{< relref "why-not-git.md" >}}))
[^directory-digest]: We currently use the [BLAKE3][] digest of the protobuf
serialization of the `proto::Directory` struct to calculate
these digests. While pretty stable across most

View file

@ -0,0 +1,66 @@
---
title: "Why not Git?"
summary: ""
date: 2025-04-04T16:16:37+00:00
lastmod: 2025-04-04T16:16:37+00:00
draft: false
weight: 42
toc: false
---
We've been experimenting with (some variations of) the git tree and object
format, and ultimately decided against using it as an internal format, and
instead adapted our own [Data Model][castore-data-model].
While castore shares some similarities with the format used in git for trees and
objects, the git one has shown some significant disadvantages:
### The binary encoding itself
#### git trees
The git tree object format is a very binary, error-prone and
"made-to-be-read-and-written-from-C" format.
Tree objects are a combination of null-terminated strings, and fields of known
length. References to other tree objects use the literal sha1 hash of another
tree object in this encoding.
Extensions of the format/changes are very hard to do right, because parsers are
not aware they might be parsing something different.
The [Snix Castore Data Model][castore-data-model] uses a canonical protobuf
serialization, and uses the [blake3][blake3] hash of that serialization to point
to other `Directory` messages.
It's both compact and with a wide range of libraries for encoders and decoders
in many programming languages.
The choice of protobuf makes it easy to add new fields, and make old clients
aware of some unknown fields being detected [^adding-fields].
#### git blob
On disk, git blob objects start with a "blob" prefix, then the size of the
payload, and then the data itself. The hash of a blob is the literal sha1sum
over all of this - which makes it something very git specific to request for.
The [Snix Castore Data Model][castore-data-model] simply uses the
[blake3][blake3] hash of the literal contents when referring to a file/blob,
which makes it very easy to ask other data sources for the same data, as no
git-specific payload is included in the hash.
This also plays very well together with things like [iroh][iroh-discussion],
which plans to provide a way to substitute (large)blobs by their blake3 hash
over the IPFS network.
In addition to that, [blake3][blake3] makes it possible to do
[verified streaming][bao], as already described in other parts of the
documentation.
The git tree object format uses sha1 both for references to other trees and
hashes of blobs, which isn't really a hash function to fundamentally base
everything on in 2023.
The [migration to sha256][git-sha256] also has been dead for some years now,
and it's unclear what a "blake3" version of this would even look like.
[bao]: https://github.com/oconnor663/bao
[blake3]: https://github.com/BLAKE3-team/BLAKE3
[castore-data-model]: {{< relref "data-model.md" >}}
[git-sha256]: https://git-scm.com/docs/hash-function-transition/
[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197
[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect.