docs(web/snix/castore): port "why not git" document

Change-Id: I4ac77f264d2704018cd33dbf80746e82d193686e Reviewed-on: https://cl.snix.dev/c/snix/+/30315 Reviewed-by: Domen Kožar <domen@cachix.org> Tested-by: besadii Autosubmit: Florian Klink <flokli@flokli.de>
2025-04-12 19:38:28 +02:00 · 2025-04-12 19:38:28 +02:00 · c9082a586c
commit c9082a586c
parent b2d2d622e0
3 changed files with 23 additions and 15 deletions
--- a/web/content/docs/components/castore/data-model.md
+++ b/web/content/docs/components/castore/data-model.md
@ -82,7 +82,7 @@ name MUST be unique across all three lists.
 [rustdoc-node]: https://snix.dev/rustdoc/snix_castore/enum.Node.html
 [rustdoc-pathcomponent]: https://snix.dev/rustdoc/snix_castore/struct.PathComponent.html
 [BLAKE3]: https://github.com/BLAKE3-team/BLAKE3
-[^why-not-git-trees]: For a detailed comparison with the git model, and what (and why we do differently, see TODO LINK)
+[^why-not-git-trees]: For a detailed comparison with the git model, and what (and why we do differently, see [here]({{< relref "why-not-git.md" >}}))
 [^directory-digest]: We currently use the [BLAKE3][] digest of the protobuf
                     serialization of the `proto::Directory` struct to calculate
                     these digests. While pretty stable across most
--- a/web/content/docs/components/castore/why-not-git.md
+++ b/web/content/docs/components/castore/why-not-git.md
@ -0,0 +1,66 @@
+---
+title: "Why not Git?"
+summary: ""
+date: 2025-04-04T16:16:37+00:00
+lastmod: 2025-04-04T16:16:37+00:00
+draft: false
+weight: 42
+toc: false
+---
+
+We've been experimenting with (some variations of) the git tree and object
+format, and ultimately decided against using it as an internal format, and
+instead adapted our own [Data Model][castore-data-model].
+
+While castore shares some similarities with the format used in git for trees and
+objects, the git one has shown some significant disadvantages:
+
+### The binary encoding itself
+
+#### git trees
+The git tree object format is a very binary, error-prone and
+"made-to-be-read-and-written-from-C" format.
+
+Tree objects are a combination of null-terminated strings, and fields of known
+length. References to other tree objects use the literal sha1 hash of another
+tree object in this encoding.
+Extensions of the format/changes are very hard to do right, because parsers are
+not aware they might be parsing something different.
+
+The [Snix Castore Data Model][castore-data-model] uses a canonical protobuf
+serialization, and uses the [blake3][blake3] hash of that serialization to point
+to other `Directory` messages.
+It's both compact and with a wide range of libraries for encoders and decoders
+in many programming languages.
+The choice of protobuf makes it easy to add new fields, and make old clients
+aware of some unknown fields being detected [^adding-fields].
+
+#### git blob
+On disk, git blob objects start with a "blob" prefix, then the size of the
+payload, and then the data itself. The hash of a blob is the literal sha1sum
+over all of this - which makes it something very git specific to request for.
+
+The [Snix Castore Data Model][castore-data-model] simply uses the
+[blake3][blake3] hash of the literal contents when referring to a file/blob,
+which makes it very easy to ask other data sources for the same data, as no
+git-specific payload is included in the hash.
+This also plays very well together with things like [iroh][iroh-discussion],
+which plans to provide a way to substitute (large)blobs by their blake3 hash
+over the IPFS network.
+
+In addition to that, [blake3][blake3] makes it possible to do
+[verified streaming][bao], as already described in other parts of the
+documentation.
+
+The git tree object format uses sha1 both for references to other trees and
+hashes of blobs, which isn't really a hash function to fundamentally base
+everything on in 2023.
+The [migration to sha256][git-sha256] also has been dead for some years now,
+and it's unclear what a "blake3" version of this would even look like.
+
+[bao]: https://github.com/oconnor663/bao
+[blake3]: https://github.com/BLAKE3-team/BLAKE3
+[castore-data-model]: {{< relref "data-model.md" >}}
+[git-sha256]: https://git-scm.com/docs/hash-function-transition/
+[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197
+[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect.