Improve change some little things I noticed while reading through it. Change-Id: I033209eece395e5aad4e10825e8dd6c0cfe68191 Reviewed-on: https://cl.tvl.fyi/c/depot/+/8725 Reviewed-by: flokli <flokli@flokli.de> Tested-by: BuildkiteCI
		
			
				
	
	
		
			283 lines
		
	
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			283 lines
		
	
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
tvix-store API
 | 
						|
==============
 | 
						|
 | 
						|
This document outlines the design of the API exposed by tvix-store, as
 | 
						|
well as other implementations of this store protocol.
 | 
						|
 | 
						|
This document is meant to be read side-by-side with [castore.md](./castore.md) which describes the data model in more detail.
 | 
						|
 | 
						|
The store API has four main consumers:
 | 
						|
 | 
						|
1. The evaluator (or more correctly, the CLI/coordinator, in the Tvix
 | 
						|
   case) communicates with the store to:
 | 
						|
 | 
						|
   * Upload files and directories (e.g. from `builtins.path`, or `src = ./path`
 | 
						|
     Nix expressions).
 | 
						|
   * Read files from the store where necessary (e.g. when `nixpkgs` is
 | 
						|
     located in the store, or for IFD).
 | 
						|
 | 
						|
2. The builder communicates with the store to:
 | 
						|
 | 
						|
   * Upload files and directories after a build, to persist build artifacts in
 | 
						|
     the store.
 | 
						|
 | 
						|
3. Tvix clients (such as users that have Tvix installed, or, depending
 | 
						|
   on perspective, builder environments) expect the store to
 | 
						|
   "materialise" on disk to provide a directory layout with store
 | 
						|
   paths.
 | 
						|
 | 
						|
4. Stores may communicate with other stores, to substitute already built store
 | 
						|
   paths, i.e. a store acts as a binary cache for other stores.
 | 
						|
 | 
						|
The store API attempts to reuse parts of its API between these three
 | 
						|
consumers by making similarities explicit in the protocol. This leads
 | 
						|
to a protocol that is slightly more complex than a simple "file
 | 
						|
upload/download"-system, but at significantly greater efficiency, both in terms
 | 
						|
of deduplication opportunities as well as granularity.
 | 
						|
 | 
						|
## The Store model
 | 
						|
 | 
						|
Contents inside a tvix-store can be grouped into three different message types:
 | 
						|
 | 
						|
 * Blobs
 | 
						|
 * Directories
 | 
						|
 * PathInfo (see further down)
 | 
						|
 | 
						|
(check `castore.md` for more detailed field descriptions)
 | 
						|
 | 
						|
### Blobs
 | 
						|
A blob object contains the literal file contents of regular (or executable)
 | 
						|
files.
 | 
						|
 | 
						|
### Directory
 | 
						|
A directory object describes the direct children of a directory.
 | 
						|
 | 
						|
It contains:
 | 
						|
 - name of child (regular or executable) files, and their [blake3][blake3] hash.
 | 
						|
 - name of child symlinks, and their target (as string)
 | 
						|
 - name of child directories, and their [blake3][blake3] hash (forming a Merkle DAG)
 | 
						|
 | 
						|
### Content-addressed Store Model
 | 
						|
For example, lets consider a directory layout like this, with some
 | 
						|
imaginary hashes of file contents:
 | 
						|
 | 
						|
```
 | 
						|
.
 | 
						|
├── file-1.txt        hash: 5891b5b522d5df086d0ff0b110fb
 | 
						|
└── nested
 | 
						|
    └── file-2.txt    hash: abc6fd595fc079d3114d4b71a4d8
 | 
						|
```
 | 
						|
 | 
						|
A hash for the *directory* `nested` can be created by creating the `Directory`
 | 
						|
object:
 | 
						|
 | 
						|
```json
 | 
						|
{
 | 
						|
  "directories": [],
 | 
						|
  "files": [{
 | 
						|
    "name": "file-2.txt",
 | 
						|
    "digest": "abc6fd595fc079d3114d4b71a4d8",
 | 
						|
    "size": 123,
 | 
						|
  }],
 | 
						|
  "symlink": [],
 | 
						|
}
 | 
						|
```
 | 
						|
 | 
						|
And then hashing a serialised form of that data structure. We use the blake3
 | 
						|
hash of the canonical protobuf representation. Let's assume the hash was
 | 
						|
`ff0029485729bcde993720749232`.
 | 
						|
 | 
						|
To create the directory object one layer up, we now refer to our `nested`
 | 
						|
directory object in `directories`, and to `file-1.txt` in `files`:
 | 
						|
 | 
						|
```json
 | 
						|
{
 | 
						|
  "directories": [{
 | 
						|
    "name": "nested",
 | 
						|
    "digest": "ff0029485729bcde993720749232",
 | 
						|
    "size": 1,
 | 
						|
  }],
 | 
						|
  "files": [{
 | 
						|
    "name": "file-1.txt",
 | 
						|
    "digest": "5891b5b522d5df086d0ff0b110fb",
 | 
						|
    "size": 124,
 | 
						|
  }]
 | 
						|
}
 | 
						|
```
 | 
						|
 | 
						|
This Merkle DAG of Directory objects, and flat store of blobs can be used to
 | 
						|
describe any file/directory/symlink inside a store path. Due to its content-
 | 
						|
addressed nature, it'll automatically deduplicate (re-)used (sub)directories,
 | 
						|
and allow substitution from any (untrusted) source.
 | 
						|
 | 
						|
The thing that's now only missing is the metadata to map/"mount" from the
 | 
						|
content-addressed world to a physical path.
 | 
						|
 | 
						|
### PathInfo
 | 
						|
As most paths in the Nix store currently are input-addressed [^input-addressed],
 | 
						|
we need something mapping from an input-addressed "output path hash" to the
 | 
						|
contents in the content- addressed world.
 | 
						|
 | 
						|
That's what `PathInfo` provides. It embeds the root node (Directory, File or
 | 
						|
Symlink) at a given store path.
 | 
						|
 | 
						|
The root nodes' `name` field is populated with the (base)name inside
 | 
						|
`/nix/store`, so `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-pname-1.2.3`.
 | 
						|
 | 
						|
The `PathInfo` message also stores references to other store paths, and some
 | 
						|
more NARInfo-specific metadata (signatures, narhash, narsize).
 | 
						|
 | 
						|
 | 
						|
## API overview
 | 
						|
 | 
						|
There's three different services:
 | 
						|
 | 
						|
### BlobService
 | 
						|
`BlobService` can be used to store and retrieve blobs of data, used to host
 | 
						|
regular file contents.
 | 
						|
 | 
						|
It is content-addressed, using [blake3][blake3]
 | 
						|
as a hashing function.
 | 
						|
 | 
						|
As blake3 is a tree hash, there's an opportunity to do
 | 
						|
[verified streaming][bao] of parts of the file,
 | 
						|
which doesn't need to trust any more information than the root hash itself.
 | 
						|
Future extensions of the `BlobService` protocol will enable this.
 | 
						|
 | 
						|
### DirectoryService
 | 
						|
`DirectoryService` allows lookups (and uploads) of `Directory` messages, and
 | 
						|
whole reference graphs of them.
 | 
						|
 | 
						|
 | 
						|
### PathInfoService
 | 
						|
The PathInfo service provides lookups from a store path hash to a `PathInfo`
 | 
						|
message.
 | 
						|
 | 
						|
## Example flows
 | 
						|
 | 
						|
Below there are some common use cases of tvix-store, and how the different
 | 
						|
services are used.
 | 
						|
 | 
						|
###  Upload files and directories
 | 
						|
This is needed for `builtins.path` or `src = ./path` in Nix expressions (A), as
 | 
						|
well as for uploading build artifacts to a store (B).
 | 
						|
 | 
						|
The path specified needs to be (recursively, BFS-style) traversed.
 | 
						|
 * All file contents need to be hashed with blake3, and submitted to the
 | 
						|
   *BlobService* if not already present.
 | 
						|
   A reference to them needs to be added to the parent Directory object that's
 | 
						|
   constructed.
 | 
						|
 * All symlinks need to be added to the parent directory they reside in.
 | 
						|
 * Whenever a Directory has been fully traversed, it needs to be uploaded to
 | 
						|
   the *DirectoryService* and a reference to it needs to be added to the parent
 | 
						|
   Directory object.
 | 
						|
 | 
						|
Most of the hashing / directory traversal/uploading can happen in parallel,
 | 
						|
as long as Directory objects only refer to Directory objects and Blobs that
 | 
						|
have already been uploaded.
 | 
						|
 | 
						|
When reaching the root, a `PathInfo` object needs to be constructed.
 | 
						|
 | 
						|
 * In the case of content-addressed paths (A), the name of the root node is
 | 
						|
   based on the NAR representation of the contents.
 | 
						|
   It might make sense to be able to offload the NAR calculation to the store,
 | 
						|
   which can cache it.
 | 
						|
 * In the case of build artifacts (B), the output path is input-addressed and
 | 
						|
   known upfront.
 | 
						|
 | 
						|
Contrary to Nix, this has the advantage of not having to upload a lot of things
 | 
						|
to the store that didn't change.
 | 
						|
 | 
						|
### Reading files from the store from the evaluator
 | 
						|
This is the case when `nixpkgs` is located in the store, or IFD in general.
 | 
						|
 | 
						|
The store client asks the `PathInfoService` for the `PathInfo` of the output
 | 
						|
path in the request, and looks at the root node.
 | 
						|
 | 
						|
If something other than the root of the store path is requested, like for
 | 
						|
example `maintainers/maintainer-list.nix`, the root_node Directory is inspected
 | 
						|
and potentially a chain of `Directory` objects requested from
 | 
						|
*DirectoryService*. [^n+1query].
 | 
						|
 | 
						|
When the desired file is reached, the *BlobService* can be used to read the
 | 
						|
contents of this file, and return it back to the evaluator.
 | 
						|
 | 
						|
FUTUREWORK: define how importing from symlinks should/does work.
 | 
						|
 | 
						|
Contrary to Nix, this has the advantage of not having to copy all of the
 | 
						|
contents of a store path to the evaluating machine, but really only fetching
 | 
						|
the files the evaluator currently cares about.
 | 
						|
 | 
						|
### Materializing store paths on disk
 | 
						|
This is useful for people running a Tvix-only system, or running builds on a
 | 
						|
"Tvix remote builder" in its own mount namespace.
 | 
						|
 | 
						|
In a system with Nix installed, we can't simply manually "extract" things to
 | 
						|
`/nix/store`, as Nix assumes to own all writes to this location.
 | 
						|
In these use cases, we're probably better off exposing a tvix-store as a local
 | 
						|
binary cache (that's what nar-bridge does).
 | 
						|
 | 
						|
Assuming we are in an environment where we control `/nix/store` exclusively, a
 | 
						|
"realize to disk" would either "extract" things from the tvix-store to a
 | 
						|
filesystem, or expose a FUSE filesystem. The latter would be particularly
 | 
						|
interesting for remote build workloads, as build inputs can be realized on-
 | 
						|
demand, which saves copying around a lot of never-accessed files.
 | 
						|
 | 
						|
In both cases, the API interactions are similar.
 | 
						|
 * The *PathInfoService* is asked for the `PathInfo` of the requested store path.
 | 
						|
 * If everything should be "extracted", the *DirectoryService* is asked for all
 | 
						|
   `Directory` objects in the closure, the file structure is created, all Blobs
 | 
						|
   are downloaded and placed in their corresponding location and all symlinks
 | 
						|
   are created accordingly.
 | 
						|
 * If this is a FUSE filesystem, we can decide to only request a subset,
 | 
						|
   similar to the "Reading files from the store from the evaluator" use case,
 | 
						|
   even though it might make sense to keep all Directory objects around.
 | 
						|
   (See the caveat in "Trust model" though!)
 | 
						|
 | 
						|
### Stores communicating with other stores
 | 
						|
The gRPC API exposed by the tvix-store allows composing multiple stores, and
 | 
						|
implementing some caching strategies, that store clients don't need to be aware
 | 
						|
of.
 | 
						|
 | 
						|
 * For example, a caching strategy could have a fast local tvix-store, that's
 | 
						|
   asked first and filled with data from a slower remote tvix-store.
 | 
						|
 | 
						|
 * Multiple stores could be asked for the same data, and whatever store returns
 | 
						|
   the right data first wins.
 | 
						|
 | 
						|
 | 
						|
## Trust model / Distribution
 | 
						|
As already described above, the only non-content-addressed service is the
 | 
						|
`PathInfo` service.
 | 
						|
 | 
						|
This means, all other messages (such as `Blob` and `Directory` messages) can be
 | 
						|
substituted from many different, untrusted sources/mirrors, which will make
 | 
						|
plugging in additional substitution strategies like IPFS, local network
 | 
						|
neighbors super simple.
 | 
						|
 | 
						|
As for `PathInfo`, we don't specify an additional signature mechanism yet, but
 | 
						|
carry the NAR-based signatures from Nix along.
 | 
						|
 | 
						|
This means, if we don't trust a remote `PathInfo` object, we currently need to
 | 
						|
"stream" the NAR representation to validate these signatures.
 | 
						|
 | 
						|
However, the slow part is downloading of NAR files, and considering we have
 | 
						|
more granularity available, we might only need to download some small blobs,
 | 
						|
rather than a whole NAR file.
 | 
						|
 | 
						|
A future signature mechanism, that is only signing (parts of) the `PathInfo`
 | 
						|
message, which only points to content-addressed data will enable verified
 | 
						|
partial access into a store path, opening up opportunities for lazy filesystem
 | 
						|
access, which is very useful in remote builder scenarios.
 | 
						|
 | 
						|
 | 
						|
 | 
						|
[blake3]: https://github.com/BLAKE3-team/BLAKE3
 | 
						|
[bao]: https://github.com/oconnor663/bao
 | 
						|
[^input-addressed]: Nix hashes the A-Term representation of a .drv, after doing
 | 
						|
                    some replacements on refered Input Derivations to calculate
 | 
						|
                    output paths.
 | 
						|
[^n+1query]: This would expose an N+1 query problem. However it's not a problem
 | 
						|
             in practice, as there's usually always a "local" caching store in
 | 
						|
             the loop, and *DirectoryService* supports a recursive lookup for
 | 
						|
             all `Directory` children of a `Directory`
 |