chore(tvix/docs): move [ca]store docs to tvix/docs

Change-Id: Idd78ffae34b6ea7b93d13de73b98c61a348869fb
Reviewed-on: https://cl.tvl.fyi/c/depot/+/11808
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
Autosubmit: flokli <flokli@flokli.de>
This commit is contained in:
Florian Klink 2024-06-13 22:04:32 +03:00 committed by clbot
parent adc7353bd1
commit 6947dc4349
7 changed files with 8 additions and 1 deletions

View file

@ -4,6 +4,13 @@
- [Architecture & data flow](./architecture.md)
- [TODOs](./TODO.md)
# Store
- [Store API](./store/api.md)
- [BlobStore Chunking](./castore/blobstore-chunking.md)
- [BlobStore Protocol](./castore/blobstore-protocol.md)
- [Data Model](./castore/data-model.md)
- [Why not git trees?](./castore/why-not-git-trees.md)
# Nix
- [Specification of the Nix Language](./language-spec.md)
- [Nix language version history](./lang-version.md)

View file

@ -0,0 +1,147 @@
# BlobStore: Chunking & Verified Streaming
`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3]
as hash function.
Returned data is fetched by using the digest as lookup key, and can be verified
to be correct by feeding the received data through the hash function and
ensuring it matches the digest initially used for the lookup.
This means, data can be downloaded by any untrusted third-party as well, as the
received data is validated to match the digest it was originally requested with.
However, for larger blobs of data, having to download the entire blob at once is
wasteful, if we only care about a part of the blob. Think about mounting a
seekable data structure, like loop-mounting an .iso file, or doing partial reads
in a large Parquet file, a column-oriented data format.
> We want to have the possibility to *seek* into a larger file.
This however shouldn't compromise on data integrity properties - we should not
need to trust a peer we're downloading from to be "honest" about the partial
data we're reading. We should be able to verify smaller reads.
Especially when substituting from an untrusted third-party, we want to be able
to detect quickly if that third-party is sending us wrong data, and terminate
the connection early.
## Chunking
In content-addressed systems, this problem has historically been solved by
breaking larger blobs into smaller chunks, which can be fetched individually,
and making a hash of *this listing* the blob digest/identifier.
- BitTorrent for example breaks files up into smaller chunks, and maintains
a list of sha1 digests for each of these chunks. Magnet links contain a
digest over this listing as an identifier. (See [bittorrent-v2][here for
more details]).
With the identifier, a client can fetch the entire list, and then recursively
"unpack the graph" of nodes, until it ends up with a list of individual small
chunks, which can be fetched individually.
- Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the
digest of the root node as an identitier.
These approaches solve the problem of being able to fetch smaller chunks in a
trusted fashion. They can also do some deduplication, in case there's the same
leaf nodes same leaf nodes in multiple places.
However, they also have a big disadvantage. The chunking parameters, and the
"topology" of the graph structure itself "bleed" into the root hash of the
entire data structure itself.
Depending on the chunking parameters used, there's different representations for
the same data, causing less data sharing/reuse in the overall system, in terms of how
many chunks need to be downloaded vs. are already available locally, as well as
how compact data is stored on-disk.
This can be workarounded by agreeing on only a single way of chunking, but it's
not pretty and misses a lot of deduplication potential.
### Chunking in Tvix' Blobstore
tvix-castore's BlobStore uses a hybrid approach to eliminate some of the
disadvantages, while still being content-addressed internally, with the
highlighted benefits.
It uses [blake3] as hash function, and the blake3 digest of **the raw data
itself** as an identifier (rather than some application-specific Merkle DAG that
also embeds some chunking information).
BLAKE3 is a tree hash where all left nodes fully populated, contrary to
conventional serial hash functions. To be able to validate the hash of a node,
one only needs the hash of the (2) children [^1], if any.
This means one only needs to the root digest to validate a constructions, and these
constructions can be sent [separately][bao-spec].
This relieves us from the need of having to encode more granular chunking into
our data model / identifier upfront, but can make this mostly a transport/
storage concern.
For some more description on the (remote) protocol, check
`./blobstore-protocol.md`.
#### Logical vs. physical chunking
Due to the properties of the BLAKE3 hash function, we have logical blocks of
1KiB, but this doesn't necessarily imply we need to restrict ourselves to these
chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers,
or are stored on-disk.
The only thing we need to be able to read and verify an arbitrary byte range is
having the covering range of aligned 1K blocks, and a construction from the root
digest to the 1K block.
Note the intermediate hash tree can be further trimmed, [omitting][bao-tree]
lower parts of the tree while still providing verified streaming - at the cost
of having to fetch larger covering ranges of aligned blocks.
Let's pick an example. We identify each KiB by a number here for illustrational
purposes.
Assuming we omit the last two layers of the hash tree, we end up with logical
4KiB leaf chunks (`bao_shift` of `2`).
For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`,
`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion:
`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]`
Assuming the server now informs us about the following physical chunking:
```
[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]`
```
If our application now wants to arbitrarily read from 0 until 4 (inclusive):
```
[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
|-------------|
```
…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`.
`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're
interested in.
We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we
can assemble `[ 4 5 6 7 ]` to verify both logical chunks:
```
[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
^ ^ ^ ^
|----4KiB----|------4KiB-----|
```
Each physical chunk fetched can be validated to have the blake3 digest that was
communicated upfront, and can be stored in a client-side cache/storage, so
subsequent / other requests for the same data will be fast(er).
---
[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree
[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/
[blake3]: https://github.com/BLAKE3-team/BLAKE3
[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
[bao-tree]: https://github.com/n0-computer/bao-tree

View file

@ -0,0 +1,104 @@
# BlobStore: Protocol / Composition
This documents describes the protocol that BlobStore uses to substitute blobs
other ("remote") BlobStores.
How to come up with the blake3 digest of the blob to fetch is left to another
layer in the stack.
To put this into the context of Tvix as a Nix alternative, a blob represents an
individual file inside a StorePath.
In the Tvix Data Model, this is accomplished by having a `FileNode` (either the
`root_node` in a `PathInfo` message, or a individual file inside a `Directory`
message) encode a BLAKE3 digest.
However, the whole infrastructure can be applied for other usecases requiring
exchange/storage or access into data of which the blake3 digest is known.
## Protocol and Interfaces
As an RPC protocol, BlobStore currently uses gRPC.
On the Rust side of things, every blob service implements the
[`BlobService`](../src/blobservice/mod.rs) async trait, which isn't
gRPC-specific.
This `BlobService` trait provides functionality to check for existence of Blobs,
read from blobs, and write new blobs.
It also provides a method to ask for more granular chunks if they are available.
In addition to some in-memory, on-disk and (soon) object-storage-based
implementations, we also have a `BlobService` implementation that talks to a
gRPC server, as well as a gRPC server wrapper component, which provides a gRPC
service for anything implementing the `BlobService` trait.
This makes it very easy to talk to a remote `BlobService`, which does not even
need to be written in the same language, as long it speaks the same gRPC
protocol.
It also puts very little requirements on someone implementing a new
`BlobService`, and how its internal storage or chunking algorithm looks like.
The gRPC protocol is documented in `../protos/rpc_blobstore.proto`.
Contrary to the `BlobService` trait, it does not have any options for seeking/
ranging, as it's more desirable to provide this through chunking (see also
`./blobstore-chunking.md`).
## Composition
Different `BlobStore` are supposed to be "composed"/"layered" to express
caching, multiple local and remote sources.
The fronting interface can be the same, it'd just be multiple "tiers" that can
respond to requests, depending on where the data resides. [^1]
This makes it very simple for consumers, as they don't need to be aware of the
entire substitutor config.
The flexibility of this doesn't need to be exposed to the user in the default
case; in most cases we should be fine with some form of on-disk storage and a
bunch of substituters with different priorities.
### gRPC Clients
Clients are encouraged to always read blobs in a chunked fashion (asking for a
list of chunks for a blob via `BlobService.Stat()`, then fetching chunks via
`BlobService.Read()` as needed), instead of directly reading the entire blob via
`BlobService.Read()`.
In a composition setting, this provides opportunity for caching, and avoids
downloading some chunks if they're already present locally (for example, because
they were already downloaded by reading from a similar blob earlier).
It also removes the need for seeking to be a part of the gRPC protocol
alltogether, as chunks are supposed to be "reasonably small" [^2].
There's some further optimization potential, a `BlobService.Stat()` request
could tell the server it's happy with very small blobs just being inlined in
an additional additional field in the response, which would allow clients to
populate their local chunk store in a single roundtrip.
## Verified Streaming
As already described in `./docs/blobstore-chunking.md`, the physical chunk
information sent in a `BlobService.Stat()` response is still sufficient to fetch
in an authenticated fashion.
The exact protocol and formats are still a bit in flux, but here's some notes:
- `BlobService.Stat()` request gets a `send_bao` field (bool), signalling a
[BAO][bao-spec] should be sent. Could also be `bao_shift` integer, signalling
how detailed (down to the leaf chunks) it should go.
The exact format (and request fields) still need to be defined, edef has some
ideas around omitting some intermediate hash nodes over the wire and
recomputing them, reducing size by another ~50% over [bao-tree].
- `BlobService.Stat()` response gets some bao-related fields (`bao_shift`
field, signalling the actual format/shift level the server replies with, the
actual bao, and maybe some format specifier).
It would be nice to also be compatible with the baos used by [iroh], so we
can provide an implementation using it too.
---
[^1]: We might want to have some backchannel, so it becomes possible to provide
feedback to the user that something is downloaded.
[^2]: Something between 512K-4M, TBD.
[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
[bao-tree]: https://github.com/n0-computer/bao-tree
[iroh]: https://github.com/n0-computer/iroh

View file

@ -0,0 +1,50 @@
# Data model
This provides some more notes on the fields used in castore.proto.
See `//tvix/store/docs/api.md` for the full context.
## Directory message
`Directory` messages use the blake3 hash of their canonical protobuf
serialization as its identifier.
A `Directory` message contains three lists, `directories`, `files` and
`symlinks`, holding `DirectoryNode`, `FileNode` and `SymlinkNode` messages
respectively. They describe all the direct child elements that are contained in
a directory.
All three message types have a `name` field, specifying the (base)name of the
element (which MUST not contain slashes or null bytes, and MUST not be '.' or '..').
For reproducibility reasons, the lists MUST be sorted by that name and the
name MUST be unique across all three lists.
In addition to the `name` field, the various *Node messages have the following
fields:
## DirectoryNode
A `DirectoryNode` message represents a child directory.
It has a `digest` field, which points to the identifier of another `Directory`
message, making a `Directory` a merkle tree (or strictly speaking, a graph, as
two elements pointing to a child directory with the same contents would point
to the same `Directory` message).
There's also a `size` field, containing the (total) number of all child
elements in the referenced `Directory`, which helps for inode calculation.
## FileNode
A `FileNode` message represents a child (regular) file.
Its `digest` field contains the blake3 hash of the file contents. It can be
looked up in the `BlobService`.
The `size` field contains the size of the blob the `digest` field refers to.
The `executable` field specifies whether the file should be marked as
executable or not.
## SymlinkNode
A `SymlinkNode` message represents a child symlink.
In addition to the `name` field, the only additional field is the `target`,
which is a string containing the target of the symlink.

View file

@ -0,0 +1,57 @@
## Why not git tree objects?
We've been experimenting with (some variations of) the git tree and object
format, and ultimately decided against using it as an internal format, and
instead adapted the one documented in the other documents here.
While the tvix-store API protocol shares some similarities with the format used
in git for trees and objects, the git one has shown some significant
disadvantages:
### The binary encoding itself
#### trees
The git tree object format is a very binary, error-prone and
"made-to-be-read-and-written-from-C" format.
Tree objects are a combination of null-terminated strings, and fields of known
length. References to other tree objects use the literal sha1 hash of another
tree object in this encoding.
Extensions of the format/changes are very hard to do right, because parsers are
not aware they might be parsing something different.
The tvix-store protocol uses a canonical protobuf serialization, and uses
the [blake3][blake3] hash of that serialization to point to other `Directory`
messages.
It's both compact and with a wide range of libraries for encoders and decoders
in many programming languages.
The choice of protobuf makes it easy to add new fields, and make old clients
aware of some unknown fields being detected [^adding-fields].
#### blob
On disk, git blob objects start with a "blob" prefix, then the size of the
payload, and then the data itself. The hash of a blob is the literal sha1sum
over all of this - which makes it something very git specific to request for.
tvix-store simply uses the [blake3][blake3] hash of the literal contents
when referring to a file/blob, which makes it very easy to ask other data
sources for the same data, as no git-specific payload is included in the hash.
This also plays very well together with things like [iroh][iroh-discussion],
which plans to provide a way to substitute (large)blobs by their blake3 hash
over the IPFS network.
In addition to that, [blake3][blake3] makes it possible to do
[verified streaming][bao], as already described in other parts of the
documentation.
The git tree object format uses sha1 both for references to other trees and
hashes of blobs, which isn't really a hash function to fundamentally base
everything on in 2023.
The [migration to sha256][git-sha256] also has been dead for some years now,
and it's unclear what a "blake3" version of this would even look like.
[bao]: https://github.com/oconnor663/bao
[blake3]: https://github.com/BLAKE3-team/BLAKE3
[git-sha256]: https://git-scm.com/docs/hash-function-transition/
[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197
[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect.

288
tvix/docs/src/store/api.md Normal file
View file

@ -0,0 +1,288 @@
tvix-[ca]store API
==============
This document outlines the design of the API exposed by tvix-castore and tvix-
store, as well as other implementations of this store protocol.
This document is meant to be read side-by-side with
[castore.md](../../castore/docs/data-model.md) which describes the data model
in more detail.
The store API has four main consumers:
1. The evaluator (or more correctly, the CLI/coordinator, in the Tvix
case) communicates with the store to:
* Upload files and directories (e.g. from `builtins.path`, or `src = ./path`
Nix expressions).
* Read files from the store where necessary (e.g. when `nixpkgs` is
located in the store, or for IFD).
2. The builder communicates with the store to:
* Upload files and directories after a build, to persist build artifacts in
the store.
3. Tvix clients (such as users that have Tvix installed, or, depending
on perspective, builder environments) expect the store to
"materialise" on disk to provide a directory layout with store
paths.
4. Stores may communicate with other stores, to substitute already built store
paths, i.e. a store acts as a binary cache for other stores.
The store API attempts to reuse parts of its API between these three
consumers by making similarities explicit in the protocol. This leads
to a protocol that is slightly more complex than a simple "file
upload/download"-system, but at significantly greater efficiency, both in terms
of deduplication opportunities as well as granularity.
## The Store model
Contents inside a tvix-store can be grouped into three different message types:
* Blobs
* Directories
* PathInfo (see further down)
(check `castore.md` for more detailed field descriptions)
### Blobs
A blob object contains the literal file contents of regular (or executable)
files.
### Directory
A directory object describes the direct children of a directory.
It contains:
- name of child (regular or executable) files, and their [blake3][blake3] hash.
- name of child symlinks, and their target (as string)
- name of child directories, and their [blake3][blake3] hash (forming a Merkle DAG)
### Content-addressed Store Model
For example, lets consider a directory layout like this, with some
imaginary hashes of file contents:
```
.
├── file-1.txt hash: 5891b5b522d5df086d0ff0b110fb
└── nested
└── file-2.txt hash: abc6fd595fc079d3114d4b71a4d8
```
A hash for the *directory* `nested` can be created by creating the `Directory`
object:
```json
{
"directories": [],
"files": [{
"name": "file-2.txt",
"digest": "abc6fd595fc079d3114d4b71a4d8",
"size": 123,
}],
"symlink": [],
}
```
And then hashing a serialised form of that data structure. We use the blake3
hash of the canonical protobuf representation. Let's assume the hash was
`ff0029485729bcde993720749232`.
To create the directory object one layer up, we now refer to our `nested`
directory object in `directories`, and to `file-1.txt` in `files`:
```json
{
"directories": [{
"name": "nested",
"digest": "ff0029485729bcde993720749232",
"size": 1,
}],
"files": [{
"name": "file-1.txt",
"digest": "5891b5b522d5df086d0ff0b110fb",
"size": 124,
}]
}
```
This Merkle DAG of Directory objects, and flat store of blobs can be used to
describe any file/directory/symlink inside a store path. Due to its content-
addressed nature, it'll automatically deduplicate (re-)used (sub)directories,
and allow substitution from any (untrusted) source.
The thing that's now only missing is the metadata to map/"mount" from the
content-addressed world to a physical path.
### PathInfo
As most paths in the Nix store currently are input-addressed [^input-addressed],
and the `tvix-castore` data model is also not intrinsically using NAR hashes,
we need something mapping from an input-addressed "output path hash" (or a Nix-
specific content-addressed path) to the contents in the `tvix-castore` world.
That's what `PathInfo` provides. It embeds the root node (Directory, File or
Symlink) at a given store path.
The root nodes' `name` field is populated with the (base)name inside
`/nix/store`, so `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-pname-1.2.3`.
The `PathInfo` message also stores references to other store paths, and some
more NARInfo-specific metadata (signatures, narhash, narsize).
## API overview
There's three different services:
### BlobService
`BlobService` can be used to store and retrieve blobs of data, used to host
regular file contents.
It is content-addressed, using [blake3][blake3]
as a hashing function.
As blake3 is a tree hash, there's an opportunity to do
[verified streaming][bao] of parts of the file,
which doesn't need to trust any more information than the root hash itself.
Future extensions of the `BlobService` protocol will enable this.
### DirectoryService
`DirectoryService` allows lookups (and uploads) of `Directory` messages, and
whole reference graphs of them.
### PathInfoService
The PathInfo service provides lookups from a store path hash to a `PathInfo`
message.
## Example flows
Below there are some common use cases of tvix-store, and how the different
services are used.
### Upload files and directories
This is needed for `builtins.path` or `src = ./path` in Nix expressions (A), as
well as for uploading build artifacts to a store (B).
The path specified needs to be (recursively, BFS-style) traversed.
* All file contents need to be hashed with blake3, and submitted to the
*BlobService* if not already present.
A reference to them needs to be added to the parent Directory object that's
constructed.
* All symlinks need to be added to the parent directory they reside in.
* Whenever a Directory has been fully traversed, it needs to be uploaded to
the *DirectoryService* and a reference to it needs to be added to the parent
Directory object.
Most of the hashing / directory traversal/uploading can happen in parallel,
as long as Directory objects only refer to Directory objects and Blobs that
have already been uploaded.
When reaching the root, a `PathInfo` object needs to be constructed.
* In the case of content-addressed paths (A), the name of the root node is
based on the NAR representation of the contents.
It might make sense to be able to offload the NAR calculation to the store,
which can cache it.
* In the case of build artifacts (B), the output path is input-addressed and
known upfront.
Contrary to Nix, this has the advantage of not having to upload a lot of things
to the store that didn't change.
### Reading files from the store from the evaluator
This is the case when `nixpkgs` is located in the store, or IFD in general.
The store client asks the `PathInfoService` for the `PathInfo` of the output
path in the request, and looks at the root node.
If something other than the root of the store path is requested, like for
example `maintainers/maintainer-list.nix`, the root_node Directory is inspected
and potentially a chain of `Directory` objects requested from
*DirectoryService*. [^n+1query].
When the desired file is reached, the *BlobService* can be used to read the
contents of this file, and return it back to the evaluator.
FUTUREWORK: define how importing from symlinks should/does work.
Contrary to Nix, this has the advantage of not having to copy all of the
contents of a store path to the evaluating machine, but really only fetching
the files the evaluator currently cares about.
### Materializing store paths on disk
This is useful for people running a Tvix-only system, or running builds on a
"Tvix remote builder" in its own mount namespace.
In a system with Nix installed, we can't simply manually "extract" things to
`/nix/store`, as Nix assumes to own all writes to this location.
In these use cases, we're probably better off exposing a tvix-store as a local
binary cache (that's what `//tvix/nar-bridge-go` does).
Assuming we are in an environment where we control `/nix/store` exclusively, a
"realize to disk" would either "extract" things from the `tvix-store` to a
filesystem, or expose a `FUSE`/`virtio-fs` filesystem.
The latter is already implemented, and particularly interesting for (remote)
build workloads, as build inputs can be realized on-demand, which saves copying
around a lot of never- accessed files.
In both cases, the API interactions are similar.
* The *PathInfoService* is asked for the `PathInfo` of the requested store path.
* If everything should be "extracted", the *DirectoryService* is asked for all
`Directory` objects in the closure, the file structure is created, all Blobs
are downloaded and placed in their corresponding location and all symlinks
are created accordingly.
* If this is a FUSE filesystem, we can decide to only request a subset,
similar to the "Reading files from the store from the evaluator" use case,
even though it might make sense to keep all Directory objects around.
(See the caveat in "Trust model" though!)
### Stores communicating with other stores
The gRPC API exposed by the tvix-store allows composing multiple stores, and
implementing some caching strategies, that store clients don't need to be aware
of.
* For example, a caching strategy could have a fast local tvix-store, that's
asked first and filled with data from a slower remote tvix-store.
* Multiple stores could be asked for the same data, and whatever store returns
the right data first wins.
## Trust model / Distribution
As already described above, the only non-content-addressed service is the
`PathInfo` service.
This means, all other messages (such as `Blob` and `Directory` messages) can be
substituted from many different, untrusted sources/mirrors, which will make
plugging in additional substitution strategies like IPFS, local network
neighbors super simple. That's also why it's living in the `tvix-castore` crate.
As for `PathInfo`, we don't specify an additional signature mechanism yet, but
carry the NAR-based signatures from Nix along.
This means, if we don't trust a remote `PathInfo` object, we currently need to
"stream" the NAR representation to validate these signatures.
However, the slow part is downloading of NAR files, and considering we have
more granularity available, we might only need to download some small blobs,
rather than a whole NAR file.
A future signature mechanism, that is only signing (parts of) the `PathInfo`
message, which only points to content-addressed data will enable verified
partial access into a store path, opening up opportunities for lazy filesystem
access etc.
[blake3]: https://github.com/BLAKE3-team/BLAKE3
[bao]: https://github.com/oconnor663/bao
[^input-addressed]: Nix hashes the A-Term representation of a .drv, after doing
some replacements on refered Input Derivations to calculate
output paths.
[^n+1query]: This would expose an N+1 query problem. However it's not a problem
in practice, as there's usually always a "local" caching store in
the loop, and *DirectoryService* supports a recursive lookup for
all `Directory` children of a `Directory`