chore(tvix/docs): move [ca]store docs to tvix/docs
Change-Id: Idd78ffae34b6ea7b93d13de73b98c61a348869fb Reviewed-on: https://cl.tvl.fyi/c/depot/+/11808 Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su> Autosubmit: flokli <flokli@flokli.de>
This commit is contained in:
parent
adc7353bd1
commit
6947dc4349
7 changed files with 8 additions and 1 deletions
|
|
@ -4,6 +4,13 @@
|
|||
- [Architecture & data flow](./architecture.md)
|
||||
- [TODOs](./TODO.md)
|
||||
|
||||
# Store
|
||||
- [Store API](./store/api.md)
|
||||
- [BlobStore Chunking](./castore/blobstore-chunking.md)
|
||||
- [BlobStore Protocol](./castore/blobstore-protocol.md)
|
||||
- [Data Model](./castore/data-model.md)
|
||||
- [Why not git trees?](./castore/why-not-git-trees.md)
|
||||
|
||||
# Nix
|
||||
- [Specification of the Nix Language](./language-spec.md)
|
||||
- [Nix language version history](./lang-version.md)
|
||||
|
|
|
|||
147
tvix/docs/src/castore/blobstore-chunking.md
Normal file
147
tvix/docs/src/castore/blobstore-chunking.md
Normal file
|
|
@ -0,0 +1,147 @@
|
|||
# BlobStore: Chunking & Verified Streaming
|
||||
|
||||
`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3]
|
||||
as hash function.
|
||||
|
||||
Returned data is fetched by using the digest as lookup key, and can be verified
|
||||
to be correct by feeding the received data through the hash function and
|
||||
ensuring it matches the digest initially used for the lookup.
|
||||
|
||||
This means, data can be downloaded by any untrusted third-party as well, as the
|
||||
received data is validated to match the digest it was originally requested with.
|
||||
|
||||
However, for larger blobs of data, having to download the entire blob at once is
|
||||
wasteful, if we only care about a part of the blob. Think about mounting a
|
||||
seekable data structure, like loop-mounting an .iso file, or doing partial reads
|
||||
in a large Parquet file, a column-oriented data format.
|
||||
|
||||
> We want to have the possibility to *seek* into a larger file.
|
||||
|
||||
This however shouldn't compromise on data integrity properties - we should not
|
||||
need to trust a peer we're downloading from to be "honest" about the partial
|
||||
data we're reading. We should be able to verify smaller reads.
|
||||
|
||||
Especially when substituting from an untrusted third-party, we want to be able
|
||||
to detect quickly if that third-party is sending us wrong data, and terminate
|
||||
the connection early.
|
||||
|
||||
## Chunking
|
||||
In content-addressed systems, this problem has historically been solved by
|
||||
breaking larger blobs into smaller chunks, which can be fetched individually,
|
||||
and making a hash of *this listing* the blob digest/identifier.
|
||||
|
||||
- BitTorrent for example breaks files up into smaller chunks, and maintains
|
||||
a list of sha1 digests for each of these chunks. Magnet links contain a
|
||||
digest over this listing as an identifier. (See [bittorrent-v2][here for
|
||||
more details]).
|
||||
With the identifier, a client can fetch the entire list, and then recursively
|
||||
"unpack the graph" of nodes, until it ends up with a list of individual small
|
||||
chunks, which can be fetched individually.
|
||||
- Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the
|
||||
digest of the root node as an identitier.
|
||||
|
||||
These approaches solve the problem of being able to fetch smaller chunks in a
|
||||
trusted fashion. They can also do some deduplication, in case there's the same
|
||||
leaf nodes same leaf nodes in multiple places.
|
||||
|
||||
However, they also have a big disadvantage. The chunking parameters, and the
|
||||
"topology" of the graph structure itself "bleed" into the root hash of the
|
||||
entire data structure itself.
|
||||
|
||||
Depending on the chunking parameters used, there's different representations for
|
||||
the same data, causing less data sharing/reuse in the overall system, in terms of how
|
||||
many chunks need to be downloaded vs. are already available locally, as well as
|
||||
how compact data is stored on-disk.
|
||||
|
||||
This can be workarounded by agreeing on only a single way of chunking, but it's
|
||||
not pretty and misses a lot of deduplication potential.
|
||||
|
||||
### Chunking in Tvix' Blobstore
|
||||
tvix-castore's BlobStore uses a hybrid approach to eliminate some of the
|
||||
disadvantages, while still being content-addressed internally, with the
|
||||
highlighted benefits.
|
||||
|
||||
It uses [blake3] as hash function, and the blake3 digest of **the raw data
|
||||
itself** as an identifier (rather than some application-specific Merkle DAG that
|
||||
also embeds some chunking information).
|
||||
|
||||
BLAKE3 is a tree hash where all left nodes fully populated, contrary to
|
||||
conventional serial hash functions. To be able to validate the hash of a node,
|
||||
one only needs the hash of the (2) children [^1], if any.
|
||||
|
||||
This means one only needs to the root digest to validate a constructions, and these
|
||||
constructions can be sent [separately][bao-spec].
|
||||
|
||||
This relieves us from the need of having to encode more granular chunking into
|
||||
our data model / identifier upfront, but can make this mostly a transport/
|
||||
storage concern.
|
||||
|
||||
For some more description on the (remote) protocol, check
|
||||
`./blobstore-protocol.md`.
|
||||
|
||||
#### Logical vs. physical chunking
|
||||
|
||||
Due to the properties of the BLAKE3 hash function, we have logical blocks of
|
||||
1KiB, but this doesn't necessarily imply we need to restrict ourselves to these
|
||||
chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers,
|
||||
or are stored on-disk.
|
||||
|
||||
The only thing we need to be able to read and verify an arbitrary byte range is
|
||||
having the covering range of aligned 1K blocks, and a construction from the root
|
||||
digest to the 1K block.
|
||||
|
||||
Note the intermediate hash tree can be further trimmed, [omitting][bao-tree]
|
||||
lower parts of the tree while still providing verified streaming - at the cost
|
||||
of having to fetch larger covering ranges of aligned blocks.
|
||||
|
||||
Let's pick an example. We identify each KiB by a number here for illustrational
|
||||
purposes.
|
||||
|
||||
Assuming we omit the last two layers of the hash tree, we end up with logical
|
||||
4KiB leaf chunks (`bao_shift` of `2`).
|
||||
|
||||
For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`,
|
||||
`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion:
|
||||
|
||||
`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]`
|
||||
|
||||
Assuming the server now informs us about the following physical chunking:
|
||||
|
||||
```
|
||||
[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]`
|
||||
```
|
||||
|
||||
If our application now wants to arbitrarily read from 0 until 4 (inclusive):
|
||||
|
||||
```
|
||||
[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
|
||||
|-------------|
|
||||
|
||||
```
|
||||
|
||||
…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`.
|
||||
|
||||
|
||||
`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're
|
||||
interested in.
|
||||
|
||||
We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we
|
||||
can assemble `[ 4 5 6 7 ]` to verify both logical chunks:
|
||||
|
||||
```
|
||||
[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
|
||||
^ ^ ^ ^
|
||||
|----4KiB----|------4KiB-----|
|
||||
```
|
||||
|
||||
Each physical chunk fetched can be validated to have the blake3 digest that was
|
||||
communicated upfront, and can be stored in a client-side cache/storage, so
|
||||
subsequent / other requests for the same data will be fast(er).
|
||||
|
||||
---
|
||||
|
||||
[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree
|
||||
[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/
|
||||
[blake3]: https://github.com/BLAKE3-team/BLAKE3
|
||||
[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
|
||||
[bao-tree]: https://github.com/n0-computer/bao-tree
|
||||
104
tvix/docs/src/castore/blobstore-protocol.md
Normal file
104
tvix/docs/src/castore/blobstore-protocol.md
Normal file
|
|
@ -0,0 +1,104 @@
|
|||
# BlobStore: Protocol / Composition
|
||||
|
||||
This documents describes the protocol that BlobStore uses to substitute blobs
|
||||
other ("remote") BlobStores.
|
||||
|
||||
How to come up with the blake3 digest of the blob to fetch is left to another
|
||||
layer in the stack.
|
||||
|
||||
To put this into the context of Tvix as a Nix alternative, a blob represents an
|
||||
individual file inside a StorePath.
|
||||
In the Tvix Data Model, this is accomplished by having a `FileNode` (either the
|
||||
`root_node` in a `PathInfo` message, or a individual file inside a `Directory`
|
||||
message) encode a BLAKE3 digest.
|
||||
|
||||
However, the whole infrastructure can be applied for other usecases requiring
|
||||
exchange/storage or access into data of which the blake3 digest is known.
|
||||
|
||||
## Protocol and Interfaces
|
||||
As an RPC protocol, BlobStore currently uses gRPC.
|
||||
|
||||
On the Rust side of things, every blob service implements the
|
||||
[`BlobService`](../src/blobservice/mod.rs) async trait, which isn't
|
||||
gRPC-specific.
|
||||
|
||||
This `BlobService` trait provides functionality to check for existence of Blobs,
|
||||
read from blobs, and write new blobs.
|
||||
It also provides a method to ask for more granular chunks if they are available.
|
||||
|
||||
In addition to some in-memory, on-disk and (soon) object-storage-based
|
||||
implementations, we also have a `BlobService` implementation that talks to a
|
||||
gRPC server, as well as a gRPC server wrapper component, which provides a gRPC
|
||||
service for anything implementing the `BlobService` trait.
|
||||
|
||||
This makes it very easy to talk to a remote `BlobService`, which does not even
|
||||
need to be written in the same language, as long it speaks the same gRPC
|
||||
protocol.
|
||||
|
||||
It also puts very little requirements on someone implementing a new
|
||||
`BlobService`, and how its internal storage or chunking algorithm looks like.
|
||||
|
||||
The gRPC protocol is documented in `../protos/rpc_blobstore.proto`.
|
||||
Contrary to the `BlobService` trait, it does not have any options for seeking/
|
||||
ranging, as it's more desirable to provide this through chunking (see also
|
||||
`./blobstore-chunking.md`).
|
||||
|
||||
## Composition
|
||||
Different `BlobStore` are supposed to be "composed"/"layered" to express
|
||||
caching, multiple local and remote sources.
|
||||
|
||||
The fronting interface can be the same, it'd just be multiple "tiers" that can
|
||||
respond to requests, depending on where the data resides. [^1]
|
||||
|
||||
This makes it very simple for consumers, as they don't need to be aware of the
|
||||
entire substitutor config.
|
||||
|
||||
The flexibility of this doesn't need to be exposed to the user in the default
|
||||
case; in most cases we should be fine with some form of on-disk storage and a
|
||||
bunch of substituters with different priorities.
|
||||
|
||||
### gRPC Clients
|
||||
Clients are encouraged to always read blobs in a chunked fashion (asking for a
|
||||
list of chunks for a blob via `BlobService.Stat()`, then fetching chunks via
|
||||
`BlobService.Read()` as needed), instead of directly reading the entire blob via
|
||||
`BlobService.Read()`.
|
||||
|
||||
In a composition setting, this provides opportunity for caching, and avoids
|
||||
downloading some chunks if they're already present locally (for example, because
|
||||
they were already downloaded by reading from a similar blob earlier).
|
||||
|
||||
It also removes the need for seeking to be a part of the gRPC protocol
|
||||
alltogether, as chunks are supposed to be "reasonably small" [^2].
|
||||
|
||||
There's some further optimization potential, a `BlobService.Stat()` request
|
||||
could tell the server it's happy with very small blobs just being inlined in
|
||||
an additional additional field in the response, which would allow clients to
|
||||
populate their local chunk store in a single roundtrip.
|
||||
|
||||
## Verified Streaming
|
||||
As already described in `./docs/blobstore-chunking.md`, the physical chunk
|
||||
information sent in a `BlobService.Stat()` response is still sufficient to fetch
|
||||
in an authenticated fashion.
|
||||
|
||||
The exact protocol and formats are still a bit in flux, but here's some notes:
|
||||
|
||||
- `BlobService.Stat()` request gets a `send_bao` field (bool), signalling a
|
||||
[BAO][bao-spec] should be sent. Could also be `bao_shift` integer, signalling
|
||||
how detailed (down to the leaf chunks) it should go.
|
||||
The exact format (and request fields) still need to be defined, edef has some
|
||||
ideas around omitting some intermediate hash nodes over the wire and
|
||||
recomputing them, reducing size by another ~50% over [bao-tree].
|
||||
- `BlobService.Stat()` response gets some bao-related fields (`bao_shift`
|
||||
field, signalling the actual format/shift level the server replies with, the
|
||||
actual bao, and maybe some format specifier).
|
||||
It would be nice to also be compatible with the baos used by [iroh], so we
|
||||
can provide an implementation using it too.
|
||||
|
||||
---
|
||||
|
||||
[^1]: We might want to have some backchannel, so it becomes possible to provide
|
||||
feedback to the user that something is downloaded.
|
||||
[^2]: Something between 512K-4M, TBD.
|
||||
[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
|
||||
[bao-tree]: https://github.com/n0-computer/bao-tree
|
||||
[iroh]: https://github.com/n0-computer/iroh
|
||||
50
tvix/docs/src/castore/data-model.md
Normal file
50
tvix/docs/src/castore/data-model.md
Normal file
|
|
@ -0,0 +1,50 @@
|
|||
# Data model
|
||||
|
||||
This provides some more notes on the fields used in castore.proto.
|
||||
|
||||
See `//tvix/store/docs/api.md` for the full context.
|
||||
|
||||
## Directory message
|
||||
`Directory` messages use the blake3 hash of their canonical protobuf
|
||||
serialization as its identifier.
|
||||
|
||||
A `Directory` message contains three lists, `directories`, `files` and
|
||||
`symlinks`, holding `DirectoryNode`, `FileNode` and `SymlinkNode` messages
|
||||
respectively. They describe all the direct child elements that are contained in
|
||||
a directory.
|
||||
|
||||
All three message types have a `name` field, specifying the (base)name of the
|
||||
element (which MUST not contain slashes or null bytes, and MUST not be '.' or '..').
|
||||
For reproducibility reasons, the lists MUST be sorted by that name and the
|
||||
name MUST be unique across all three lists.
|
||||
|
||||
In addition to the `name` field, the various *Node messages have the following
|
||||
fields:
|
||||
|
||||
## DirectoryNode
|
||||
A `DirectoryNode` message represents a child directory.
|
||||
|
||||
It has a `digest` field, which points to the identifier of another `Directory`
|
||||
message, making a `Directory` a merkle tree (or strictly speaking, a graph, as
|
||||
two elements pointing to a child directory with the same contents would point
|
||||
to the same `Directory` message).
|
||||
|
||||
There's also a `size` field, containing the (total) number of all child
|
||||
elements in the referenced `Directory`, which helps for inode calculation.
|
||||
|
||||
## FileNode
|
||||
A `FileNode` message represents a child (regular) file.
|
||||
|
||||
Its `digest` field contains the blake3 hash of the file contents. It can be
|
||||
looked up in the `BlobService`.
|
||||
|
||||
The `size` field contains the size of the blob the `digest` field refers to.
|
||||
|
||||
The `executable` field specifies whether the file should be marked as
|
||||
executable or not.
|
||||
|
||||
## SymlinkNode
|
||||
A `SymlinkNode` message represents a child symlink.
|
||||
|
||||
In addition to the `name` field, the only additional field is the `target`,
|
||||
which is a string containing the target of the symlink.
|
||||
57
tvix/docs/src/castore/why-not-git-trees.md
Normal file
57
tvix/docs/src/castore/why-not-git-trees.md
Normal file
|
|
@ -0,0 +1,57 @@
|
|||
## Why not git tree objects?
|
||||
|
||||
We've been experimenting with (some variations of) the git tree and object
|
||||
format, and ultimately decided against using it as an internal format, and
|
||||
instead adapted the one documented in the other documents here.
|
||||
|
||||
While the tvix-store API protocol shares some similarities with the format used
|
||||
in git for trees and objects, the git one has shown some significant
|
||||
disadvantages:
|
||||
|
||||
### The binary encoding itself
|
||||
|
||||
#### trees
|
||||
The git tree object format is a very binary, error-prone and
|
||||
"made-to-be-read-and-written-from-C" format.
|
||||
|
||||
Tree objects are a combination of null-terminated strings, and fields of known
|
||||
length. References to other tree objects use the literal sha1 hash of another
|
||||
tree object in this encoding.
|
||||
Extensions of the format/changes are very hard to do right, because parsers are
|
||||
not aware they might be parsing something different.
|
||||
|
||||
The tvix-store protocol uses a canonical protobuf serialization, and uses
|
||||
the [blake3][blake3] hash of that serialization to point to other `Directory`
|
||||
messages.
|
||||
It's both compact and with a wide range of libraries for encoders and decoders
|
||||
in many programming languages.
|
||||
The choice of protobuf makes it easy to add new fields, and make old clients
|
||||
aware of some unknown fields being detected [^adding-fields].
|
||||
|
||||
#### blob
|
||||
On disk, git blob objects start with a "blob" prefix, then the size of the
|
||||
payload, and then the data itself. The hash of a blob is the literal sha1sum
|
||||
over all of this - which makes it something very git specific to request for.
|
||||
|
||||
tvix-store simply uses the [blake3][blake3] hash of the literal contents
|
||||
when referring to a file/blob, which makes it very easy to ask other data
|
||||
sources for the same data, as no git-specific payload is included in the hash.
|
||||
This also plays very well together with things like [iroh][iroh-discussion],
|
||||
which plans to provide a way to substitute (large)blobs by their blake3 hash
|
||||
over the IPFS network.
|
||||
|
||||
In addition to that, [blake3][blake3] makes it possible to do
|
||||
[verified streaming][bao], as already described in other parts of the
|
||||
documentation.
|
||||
|
||||
The git tree object format uses sha1 both for references to other trees and
|
||||
hashes of blobs, which isn't really a hash function to fundamentally base
|
||||
everything on in 2023.
|
||||
The [migration to sha256][git-sha256] also has been dead for some years now,
|
||||
and it's unclear what a "blake3" version of this would even look like.
|
||||
|
||||
[bao]: https://github.com/oconnor663/bao
|
||||
[blake3]: https://github.com/BLAKE3-team/BLAKE3
|
||||
[git-sha256]: https://git-scm.com/docs/hash-function-transition/
|
||||
[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197
|
||||
[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect.
|
||||
288
tvix/docs/src/store/api.md
Normal file
288
tvix/docs/src/store/api.md
Normal file
|
|
@ -0,0 +1,288 @@
|
|||
tvix-[ca]store API
|
||||
==============
|
||||
|
||||
This document outlines the design of the API exposed by tvix-castore and tvix-
|
||||
store, as well as other implementations of this store protocol.
|
||||
|
||||
This document is meant to be read side-by-side with
|
||||
[castore.md](../../castore/docs/data-model.md) which describes the data model
|
||||
in more detail.
|
||||
|
||||
The store API has four main consumers:
|
||||
|
||||
1. The evaluator (or more correctly, the CLI/coordinator, in the Tvix
|
||||
case) communicates with the store to:
|
||||
|
||||
* Upload files and directories (e.g. from `builtins.path`, or `src = ./path`
|
||||
Nix expressions).
|
||||
* Read files from the store where necessary (e.g. when `nixpkgs` is
|
||||
located in the store, or for IFD).
|
||||
|
||||
2. The builder communicates with the store to:
|
||||
|
||||
* Upload files and directories after a build, to persist build artifacts in
|
||||
the store.
|
||||
|
||||
3. Tvix clients (such as users that have Tvix installed, or, depending
|
||||
on perspective, builder environments) expect the store to
|
||||
"materialise" on disk to provide a directory layout with store
|
||||
paths.
|
||||
|
||||
4. Stores may communicate with other stores, to substitute already built store
|
||||
paths, i.e. a store acts as a binary cache for other stores.
|
||||
|
||||
The store API attempts to reuse parts of its API between these three
|
||||
consumers by making similarities explicit in the protocol. This leads
|
||||
to a protocol that is slightly more complex than a simple "file
|
||||
upload/download"-system, but at significantly greater efficiency, both in terms
|
||||
of deduplication opportunities as well as granularity.
|
||||
|
||||
## The Store model
|
||||
|
||||
Contents inside a tvix-store can be grouped into three different message types:
|
||||
|
||||
* Blobs
|
||||
* Directories
|
||||
* PathInfo (see further down)
|
||||
|
||||
(check `castore.md` for more detailed field descriptions)
|
||||
|
||||
### Blobs
|
||||
A blob object contains the literal file contents of regular (or executable)
|
||||
files.
|
||||
|
||||
### Directory
|
||||
A directory object describes the direct children of a directory.
|
||||
|
||||
It contains:
|
||||
- name of child (regular or executable) files, and their [blake3][blake3] hash.
|
||||
- name of child symlinks, and their target (as string)
|
||||
- name of child directories, and their [blake3][blake3] hash (forming a Merkle DAG)
|
||||
|
||||
### Content-addressed Store Model
|
||||
For example, lets consider a directory layout like this, with some
|
||||
imaginary hashes of file contents:
|
||||
|
||||
```
|
||||
.
|
||||
├── file-1.txt hash: 5891b5b522d5df086d0ff0b110fb
|
||||
└── nested
|
||||
└── file-2.txt hash: abc6fd595fc079d3114d4b71a4d8
|
||||
```
|
||||
|
||||
A hash for the *directory* `nested` can be created by creating the `Directory`
|
||||
object:
|
||||
|
||||
```json
|
||||
{
|
||||
"directories": [],
|
||||
"files": [{
|
||||
"name": "file-2.txt",
|
||||
"digest": "abc6fd595fc079d3114d4b71a4d8",
|
||||
"size": 123,
|
||||
}],
|
||||
"symlink": [],
|
||||
}
|
||||
```
|
||||
|
||||
And then hashing a serialised form of that data structure. We use the blake3
|
||||
hash of the canonical protobuf representation. Let's assume the hash was
|
||||
`ff0029485729bcde993720749232`.
|
||||
|
||||
To create the directory object one layer up, we now refer to our `nested`
|
||||
directory object in `directories`, and to `file-1.txt` in `files`:
|
||||
|
||||
```json
|
||||
{
|
||||
"directories": [{
|
||||
"name": "nested",
|
||||
"digest": "ff0029485729bcde993720749232",
|
||||
"size": 1,
|
||||
}],
|
||||
"files": [{
|
||||
"name": "file-1.txt",
|
||||
"digest": "5891b5b522d5df086d0ff0b110fb",
|
||||
"size": 124,
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
This Merkle DAG of Directory objects, and flat store of blobs can be used to
|
||||
describe any file/directory/symlink inside a store path. Due to its content-
|
||||
addressed nature, it'll automatically deduplicate (re-)used (sub)directories,
|
||||
and allow substitution from any (untrusted) source.
|
||||
|
||||
The thing that's now only missing is the metadata to map/"mount" from the
|
||||
content-addressed world to a physical path.
|
||||
|
||||
### PathInfo
|
||||
As most paths in the Nix store currently are input-addressed [^input-addressed],
|
||||
and the `tvix-castore` data model is also not intrinsically using NAR hashes,
|
||||
we need something mapping from an input-addressed "output path hash" (or a Nix-
|
||||
specific content-addressed path) to the contents in the `tvix-castore` world.
|
||||
|
||||
That's what `PathInfo` provides. It embeds the root node (Directory, File or
|
||||
Symlink) at a given store path.
|
||||
|
||||
The root nodes' `name` field is populated with the (base)name inside
|
||||
`/nix/store`, so `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-pname-1.2.3`.
|
||||
|
||||
The `PathInfo` message also stores references to other store paths, and some
|
||||
more NARInfo-specific metadata (signatures, narhash, narsize).
|
||||
|
||||
|
||||
## API overview
|
||||
|
||||
There's three different services:
|
||||
|
||||
### BlobService
|
||||
`BlobService` can be used to store and retrieve blobs of data, used to host
|
||||
regular file contents.
|
||||
|
||||
It is content-addressed, using [blake3][blake3]
|
||||
as a hashing function.
|
||||
|
||||
As blake3 is a tree hash, there's an opportunity to do
|
||||
[verified streaming][bao] of parts of the file,
|
||||
which doesn't need to trust any more information than the root hash itself.
|
||||
Future extensions of the `BlobService` protocol will enable this.
|
||||
|
||||
### DirectoryService
|
||||
`DirectoryService` allows lookups (and uploads) of `Directory` messages, and
|
||||
whole reference graphs of them.
|
||||
|
||||
|
||||
### PathInfoService
|
||||
The PathInfo service provides lookups from a store path hash to a `PathInfo`
|
||||
message.
|
||||
|
||||
## Example flows
|
||||
|
||||
Below there are some common use cases of tvix-store, and how the different
|
||||
services are used.
|
||||
|
||||
### Upload files and directories
|
||||
This is needed for `builtins.path` or `src = ./path` in Nix expressions (A), as
|
||||
well as for uploading build artifacts to a store (B).
|
||||
|
||||
The path specified needs to be (recursively, BFS-style) traversed.
|
||||
* All file contents need to be hashed with blake3, and submitted to the
|
||||
*BlobService* if not already present.
|
||||
A reference to them needs to be added to the parent Directory object that's
|
||||
constructed.
|
||||
* All symlinks need to be added to the parent directory they reside in.
|
||||
* Whenever a Directory has been fully traversed, it needs to be uploaded to
|
||||
the *DirectoryService* and a reference to it needs to be added to the parent
|
||||
Directory object.
|
||||
|
||||
Most of the hashing / directory traversal/uploading can happen in parallel,
|
||||
as long as Directory objects only refer to Directory objects and Blobs that
|
||||
have already been uploaded.
|
||||
|
||||
When reaching the root, a `PathInfo` object needs to be constructed.
|
||||
|
||||
* In the case of content-addressed paths (A), the name of the root node is
|
||||
based on the NAR representation of the contents.
|
||||
It might make sense to be able to offload the NAR calculation to the store,
|
||||
which can cache it.
|
||||
* In the case of build artifacts (B), the output path is input-addressed and
|
||||
known upfront.
|
||||
|
||||
Contrary to Nix, this has the advantage of not having to upload a lot of things
|
||||
to the store that didn't change.
|
||||
|
||||
### Reading files from the store from the evaluator
|
||||
This is the case when `nixpkgs` is located in the store, or IFD in general.
|
||||
|
||||
The store client asks the `PathInfoService` for the `PathInfo` of the output
|
||||
path in the request, and looks at the root node.
|
||||
|
||||
If something other than the root of the store path is requested, like for
|
||||
example `maintainers/maintainer-list.nix`, the root_node Directory is inspected
|
||||
and potentially a chain of `Directory` objects requested from
|
||||
*DirectoryService*. [^n+1query].
|
||||
|
||||
When the desired file is reached, the *BlobService* can be used to read the
|
||||
contents of this file, and return it back to the evaluator.
|
||||
|
||||
FUTUREWORK: define how importing from symlinks should/does work.
|
||||
|
||||
Contrary to Nix, this has the advantage of not having to copy all of the
|
||||
contents of a store path to the evaluating machine, but really only fetching
|
||||
the files the evaluator currently cares about.
|
||||
|
||||
### Materializing store paths on disk
|
||||
This is useful for people running a Tvix-only system, or running builds on a
|
||||
"Tvix remote builder" in its own mount namespace.
|
||||
|
||||
In a system with Nix installed, we can't simply manually "extract" things to
|
||||
`/nix/store`, as Nix assumes to own all writes to this location.
|
||||
In these use cases, we're probably better off exposing a tvix-store as a local
|
||||
binary cache (that's what `//tvix/nar-bridge-go` does).
|
||||
|
||||
Assuming we are in an environment where we control `/nix/store` exclusively, a
|
||||
"realize to disk" would either "extract" things from the `tvix-store` to a
|
||||
filesystem, or expose a `FUSE`/`virtio-fs` filesystem.
|
||||
|
||||
The latter is already implemented, and particularly interesting for (remote)
|
||||
build workloads, as build inputs can be realized on-demand, which saves copying
|
||||
around a lot of never- accessed files.
|
||||
|
||||
In both cases, the API interactions are similar.
|
||||
* The *PathInfoService* is asked for the `PathInfo` of the requested store path.
|
||||
* If everything should be "extracted", the *DirectoryService* is asked for all
|
||||
`Directory` objects in the closure, the file structure is created, all Blobs
|
||||
are downloaded and placed in their corresponding location and all symlinks
|
||||
are created accordingly.
|
||||
* If this is a FUSE filesystem, we can decide to only request a subset,
|
||||
similar to the "Reading files from the store from the evaluator" use case,
|
||||
even though it might make sense to keep all Directory objects around.
|
||||
(See the caveat in "Trust model" though!)
|
||||
|
||||
### Stores communicating with other stores
|
||||
The gRPC API exposed by the tvix-store allows composing multiple stores, and
|
||||
implementing some caching strategies, that store clients don't need to be aware
|
||||
of.
|
||||
|
||||
* For example, a caching strategy could have a fast local tvix-store, that's
|
||||
asked first and filled with data from a slower remote tvix-store.
|
||||
|
||||
* Multiple stores could be asked for the same data, and whatever store returns
|
||||
the right data first wins.
|
||||
|
||||
|
||||
## Trust model / Distribution
|
||||
As already described above, the only non-content-addressed service is the
|
||||
`PathInfo` service.
|
||||
|
||||
This means, all other messages (such as `Blob` and `Directory` messages) can be
|
||||
substituted from many different, untrusted sources/mirrors, which will make
|
||||
plugging in additional substitution strategies like IPFS, local network
|
||||
neighbors super simple. That's also why it's living in the `tvix-castore` crate.
|
||||
|
||||
As for `PathInfo`, we don't specify an additional signature mechanism yet, but
|
||||
carry the NAR-based signatures from Nix along.
|
||||
|
||||
This means, if we don't trust a remote `PathInfo` object, we currently need to
|
||||
"stream" the NAR representation to validate these signatures.
|
||||
|
||||
However, the slow part is downloading of NAR files, and considering we have
|
||||
more granularity available, we might only need to download some small blobs,
|
||||
rather than a whole NAR file.
|
||||
|
||||
A future signature mechanism, that is only signing (parts of) the `PathInfo`
|
||||
message, which only points to content-addressed data will enable verified
|
||||
partial access into a store path, opening up opportunities for lazy filesystem
|
||||
access etc.
|
||||
|
||||
|
||||
|
||||
[blake3]: https://github.com/BLAKE3-team/BLAKE3
|
||||
[bao]: https://github.com/oconnor663/bao
|
||||
[^input-addressed]: Nix hashes the A-Term representation of a .drv, after doing
|
||||
some replacements on refered Input Derivations to calculate
|
||||
output paths.
|
||||
[^n+1query]: This would expose an N+1 query problem. However it's not a problem
|
||||
in practice, as there's usually always a "local" caching store in
|
||||
the loop, and *DirectoryService* supports a recursive lookup for
|
||||
all `Directory` children of a `Directory`
|
||||
Loading…
Add table
Add a link
Reference in a new issue