diff --git a/snix/docs/src/SUMMARY.md b/snix/docs/src/SUMMARY.md index 2cb82d266..a3e120b75 100644 --- a/snix/docs/src/SUMMARY.md +++ b/snix/docs/src/SUMMARY.md @@ -21,7 +21,6 @@ - [Store API](./store/api.md) - [BlobStore Chunking](./castore/blobstore-chunking.md) - [BlobStore Protocol](./castore/blobstore-protocol.md) -- [Data Model](./castore/data-model.md) - [Why not git trees?](./castore/why-not-git-trees.md) # Nix diff --git a/snix/docs/src/castore/data-model.md b/snix/docs/src/castore/data-model.md deleted file mode 100644 index 7f7e396a2..000000000 --- a/snix/docs/src/castore/data-model.md +++ /dev/null @@ -1,50 +0,0 @@ -# Data model - -This provides some more notes on the fields used in castore.proto. - -See [Store API](../store/api.md) for the full context. - -## Directory message -`Directory` messages use the blake3 hash of their canonical protobuf -serialization as its identifier. - -A `Directory` message contains three lists, `directories`, `files` and -`symlinks`, holding `DirectoryNode`, `FileNode` and `SymlinkNode` messages -respectively. They describe all the direct child elements that are contained in -a directory. - -All three message types have a `name` field, specifying the (base)name of the -element (which MUST not contain slashes or null bytes, and MUST not be '.' or '..'). -For reproducibility reasons, the lists MUST be sorted by that name and the -name MUST be unique across all three lists. - -In addition to the `name` field, the various *Node messages have the following -fields: - -## DirectoryNode -A `DirectoryNode` message represents a child directory. - -It has a `digest` field, which points to the identifier of another `Directory` -message, making a `Directory` a merkle tree (or strictly speaking, a graph, as -two elements pointing to a child directory with the same contents would point -to the same `Directory` message). - -There's also a `size` field, containing the (total) number of all child -elements in the referenced `Directory`, which helps for inode calculation. - -## FileNode -A `FileNode` message represents a child (regular) file. - -Its `digest` field contains the blake3 hash of the file contents. It can be -looked up in the `BlobService`. - -The `size` field contains the size of the blob the `digest` field refers to. - -The `executable` field specifies whether the file should be marked as -executable or not. - -## SymlinkNode -A `SymlinkNode` message represents a child symlink. - -In addition to the `name` field, the only additional field is the `target`, -which is a string containing the target of the symlink. diff --git a/web/content/docs/components/castore/_index.md b/web/content/docs/components/castore/_index.md new file mode 100644 index 000000000..48120d04d --- /dev/null +++ b/web/content/docs/components/castore/_index.md @@ -0,0 +1,10 @@ +--- +title: "Castore" +description: "" +summary: "" +date: 2025-04-04T16:43:14+01:00 +lastmod: 2025-04-04T16:43:14+01:00 +draft: false +weight: 42 +--- + diff --git a/web/content/docs/components/castore/data-model.md b/web/content/docs/components/castore/data-model.md new file mode 100644 index 000000000..0d64fc3f7 --- /dev/null +++ b/web/content/docs/components/castore/data-model.md @@ -0,0 +1,92 @@ +--- +title: "Data Model" +summary: "" +date: 2025-04-04T16:16:37+00:00 +lastmod: 2025-04-04T16:16:37+00:00 +draft: false +weight: 41 +toc: true +--- + +This describes the data model used in `snix-castore` to describe file system +trees. blob / chunk storage is covered by other documents. + +For those familiar, `snix-castore` uses a similar concept as git tree objects, +which also is a merkle structure. [^why-not-git-trees] + +## [Node][rustdoc-node] +`snix-castore` can represent three different types. +Nodes themselves don't have names, names are given by being in a +[Directory](#directory) structure. + +### `Node::File` +A (regular) file. +We store the [BLAKE3] digest of the raw file contents, the length of the raw +data, and an executable bit. + +### `Node::Symlink` +A symbolic link. +We store the symlink target contents. + +### `Node::Directory` +A (child) directory. +We store the digest of the [Directory](#directory) structure describing its +"contents". + +We also store a `size` field, containing the (total) number of all child +elements in the referenced `Directory`, which helps for inode calculation. + + +## [Directory][rustdoc-node] +The Directory struct contains all nodes in a single directory (on that level), +alongside with their (base)names (called [PathComponent]). + +`.` and `..` are not included. + +For the Directory struct, a *Digest* can be calculated[^directory-digest], which +is what the parent `Node::Directory` will use as a reference, to build a merkle +structure. + +## [PathComponent][rustdoc-pathcomponent] +This is a more strict version of bytes, reduced to valid path components in a +[Directory](#directory). + +It disallows slashes, null bytes, `.`, `..` and the +empty string. It also rejects too long names (> 255 bytes). + +## Merkle DAG +The pointers from `Node::File` to `Directory`, and this one potentially +containing `Node::File` again makes the whole structure a merkle tree (or +strictly speaking, a graph, as two elements pointing to a child directory with +the same contents would point to the same `Directory` message). + + +## Protobuf +In addition to the Rust types described above, there's also a protobuf +representation, which differs slightly: + +Instead of nodes being unnamed, and `Directory` containing a map from +`PathComponent` to `Node` (and keys being the basenames in that directory), +the `Directory` message contains three lists, `directories`, `files` and +`symlinks`, holding `DirectoryEntry`, `FileEntry` and `SymlinkEntry` messages +respectively. + +These contain all fields present in the corresponding `Node` enum kind, as well +as a `name` field, representing the basename in that directory. + +For reproducibility reasons, the lists MUST be sorted by that name and the +name MUST be unique across all three lists. + + +[rustdoc-directory]: https://snix.dev/rustdoc/snix_castore/struct.Directory.html +[rustdoc-node]: https://snix.dev/rustdoc/snix_castore/enum.Node.html +[rustdoc-pathcomponent]: https://snix.dev/rustdoc/snix_castore/struct.PathComponent.html +[BLAKE3]: https://github.com/BLAKE3-team/BLAKE3 +[^why-not-git-trees]: For a detailed comparison with the git model, and what (and why we do differently, see TODO LINK) +[^directory-digest]: We currently use the [BLAKE3][] digest of the protobuf + serialization of the `proto::Directory` struct to calculate + these digests. While pretty stable across most + implementations, there's no guarantee this will always stay + as-is, so we might switch to another serialization with + stronger guarantees on that front in the future. + See [#111](https://git.snix.dev/snix/snix/issues/111) for details.