docs(web/snix/castore): port castore data model

Also restructure this, explaining the Rust data types, and then
explaining the differences with the proto implementation, which uses
"entry" since cl/30296.

Change-Id: Ie264ab60998f0d891b4a4ea680a2d9dbe1c5929e
Reviewed-on: https://cl.snix.dev/c/snix/+/30314
Autosubmit: Florian Klink <flokli@flokli.de>
Reviewed-by: Domen Kožar <domen@cachix.org>
Tested-by: besadii
This commit is contained in:
Florian Klink 2025-04-12 19:24:31 +02:00 committed by clbot
parent d1990c9a93
commit b2d2d622e0
4 changed files with 102 additions and 51 deletions

View file

@ -21,7 +21,6 @@
- [Store API](./store/api.md) - [Store API](./store/api.md)
- [BlobStore Chunking](./castore/blobstore-chunking.md) - [BlobStore Chunking](./castore/blobstore-chunking.md)
- [BlobStore Protocol](./castore/blobstore-protocol.md) - [BlobStore Protocol](./castore/blobstore-protocol.md)
- [Data Model](./castore/data-model.md)
- [Why not git trees?](./castore/why-not-git-trees.md) - [Why not git trees?](./castore/why-not-git-trees.md)
# Nix # Nix

View file

@ -1,50 +0,0 @@
# Data model
This provides some more notes on the fields used in castore.proto.
See [Store API](../store/api.md) for the full context.
## Directory message
`Directory` messages use the blake3 hash of their canonical protobuf
serialization as its identifier.
A `Directory` message contains three lists, `directories`, `files` and
`symlinks`, holding `DirectoryNode`, `FileNode` and `SymlinkNode` messages
respectively. They describe all the direct child elements that are contained in
a directory.
All three message types have a `name` field, specifying the (base)name of the
element (which MUST not contain slashes or null bytes, and MUST not be '.' or '..').
For reproducibility reasons, the lists MUST be sorted by that name and the
name MUST be unique across all three lists.
In addition to the `name` field, the various *Node messages have the following
fields:
## DirectoryNode
A `DirectoryNode` message represents a child directory.
It has a `digest` field, which points to the identifier of another `Directory`
message, making a `Directory` a merkle tree (or strictly speaking, a graph, as
two elements pointing to a child directory with the same contents would point
to the same `Directory` message).
There's also a `size` field, containing the (total) number of all child
elements in the referenced `Directory`, which helps for inode calculation.
## FileNode
A `FileNode` message represents a child (regular) file.
Its `digest` field contains the blake3 hash of the file contents. It can be
looked up in the `BlobService`.
The `size` field contains the size of the blob the `digest` field refers to.
The `executable` field specifies whether the file should be marked as
executable or not.
## SymlinkNode
A `SymlinkNode` message represents a child symlink.
In addition to the `name` field, the only additional field is the `target`,
which is a string containing the target of the symlink.

View file

@ -0,0 +1,10 @@
---
title: "Castore"
description: ""
summary: ""
date: 2025-04-04T16:43:14+01:00
lastmod: 2025-04-04T16:43:14+01:00
draft: false
weight: 42
---

View file

@ -0,0 +1,92 @@
---
title: "Data Model"
summary: ""
date: 2025-04-04T16:16:37+00:00
lastmod: 2025-04-04T16:16:37+00:00
draft: false
weight: 41
toc: true
---
This describes the data model used in `snix-castore` to describe file system
trees. blob / chunk storage is covered by other documents.
For those familiar, `snix-castore` uses a similar concept as git tree objects,
which also is a merkle structure. [^why-not-git-trees]
## [Node][rustdoc-node]
`snix-castore` can represent three different types.
Nodes themselves don't have names, names are given by being in a
[Directory](#directory) structure.
### `Node::File`
A (regular) file.
We store the [BLAKE3] digest of the raw file contents, the length of the raw
data, and an executable bit.
### `Node::Symlink`
A symbolic link.
We store the symlink target contents.
### `Node::Directory`
A (child) directory.
We store the digest of the [Directory](#directory) structure describing its
"contents".
We also store a `size` field, containing the (total) number of all child
elements in the referenced `Directory`, which helps for inode calculation.
## [Directory][rustdoc-node]
The Directory struct contains all nodes in a single directory (on that level),
alongside with their (base)names (called [PathComponent]).
`.` and `..` are not included.
For the Directory struct, a *Digest* can be calculated[^directory-digest], which
is what the parent `Node::Directory` will use as a reference, to build a merkle
structure.
## [PathComponent][rustdoc-pathcomponent]
This is a more strict version of bytes, reduced to valid path components in a
[Directory](#directory).
It disallows slashes, null bytes, `.`, `..` and the
empty string. It also rejects too long names (> 255 bytes).
## Merkle DAG
The pointers from `Node::File` to `Directory`, and this one potentially
containing `Node::File` again makes the whole structure a merkle tree (or
strictly speaking, a graph, as two elements pointing to a child directory with
the same contents would point to the same `Directory` message).
## Protobuf
In addition to the Rust types described above, there's also a protobuf
representation, which differs slightly:
Instead of nodes being unnamed, and `Directory` containing a map from
`PathComponent` to `Node` (and keys being the basenames in that directory),
the `Directory` message contains three lists, `directories`, `files` and
`symlinks`, holding `DirectoryEntry`, `FileEntry` and `SymlinkEntry` messages
respectively.
These contain all fields present in the corresponding `Node` enum kind, as well
as a `name` field, representing the basename in that directory.
For reproducibility reasons, the lists MUST be sorted by that name and the
name MUST be unique across all three lists.
[rustdoc-directory]: https://snix.dev/rustdoc/snix_castore/struct.Directory.html
[rustdoc-node]: https://snix.dev/rustdoc/snix_castore/enum.Node.html
[rustdoc-pathcomponent]: https://snix.dev/rustdoc/snix_castore/struct.PathComponent.html
[BLAKE3]: https://github.com/BLAKE3-team/BLAKE3
[^why-not-git-trees]: For a detailed comparison with the git model, and what (and why we do differently, see TODO LINK)
[^directory-digest]: We currently use the [BLAKE3][] digest of the protobuf
serialization of the `proto::Directory` struct to calculate
these digests. While pretty stable across most
implementations, there's no guarantee this will always stay
as-is, so we might switch to another serialization with
stronger guarantees on that front in the future.
See [#111](https://git.snix.dev/snix/snix/issues/111) for details.