chore(tvix/docs): move [ca]store docs to tvix/docs
Change-Id: Idd78ffae34b6ea7b93d13de73b98c61a348869fb Reviewed-on: https://cl.tvl.fyi/c/depot/+/11808 Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su> Autosubmit: flokli <flokli@flokli.de>
This commit is contained in:
parent
adc7353bd1
commit
6947dc4349
7 changed files with 8 additions and 1 deletions
147
tvix/docs/src/castore/blobstore-chunking.md
Normal file
147
tvix/docs/src/castore/blobstore-chunking.md
Normal file
|
|
@ -0,0 +1,147 @@
|
|||
# BlobStore: Chunking & Verified Streaming
|
||||
|
||||
`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3]
|
||||
as hash function.
|
||||
|
||||
Returned data is fetched by using the digest as lookup key, and can be verified
|
||||
to be correct by feeding the received data through the hash function and
|
||||
ensuring it matches the digest initially used for the lookup.
|
||||
|
||||
This means, data can be downloaded by any untrusted third-party as well, as the
|
||||
received data is validated to match the digest it was originally requested with.
|
||||
|
||||
However, for larger blobs of data, having to download the entire blob at once is
|
||||
wasteful, if we only care about a part of the blob. Think about mounting a
|
||||
seekable data structure, like loop-mounting an .iso file, or doing partial reads
|
||||
in a large Parquet file, a column-oriented data format.
|
||||
|
||||
> We want to have the possibility to *seek* into a larger file.
|
||||
|
||||
This however shouldn't compromise on data integrity properties - we should not
|
||||
need to trust a peer we're downloading from to be "honest" about the partial
|
||||
data we're reading. We should be able to verify smaller reads.
|
||||
|
||||
Especially when substituting from an untrusted third-party, we want to be able
|
||||
to detect quickly if that third-party is sending us wrong data, and terminate
|
||||
the connection early.
|
||||
|
||||
## Chunking
|
||||
In content-addressed systems, this problem has historically been solved by
|
||||
breaking larger blobs into smaller chunks, which can be fetched individually,
|
||||
and making a hash of *this listing* the blob digest/identifier.
|
||||
|
||||
- BitTorrent for example breaks files up into smaller chunks, and maintains
|
||||
a list of sha1 digests for each of these chunks. Magnet links contain a
|
||||
digest over this listing as an identifier. (See [bittorrent-v2][here for
|
||||
more details]).
|
||||
With the identifier, a client can fetch the entire list, and then recursively
|
||||
"unpack the graph" of nodes, until it ends up with a list of individual small
|
||||
chunks, which can be fetched individually.
|
||||
- Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the
|
||||
digest of the root node as an identitier.
|
||||
|
||||
These approaches solve the problem of being able to fetch smaller chunks in a
|
||||
trusted fashion. They can also do some deduplication, in case there's the same
|
||||
leaf nodes same leaf nodes in multiple places.
|
||||
|
||||
However, they also have a big disadvantage. The chunking parameters, and the
|
||||
"topology" of the graph structure itself "bleed" into the root hash of the
|
||||
entire data structure itself.
|
||||
|
||||
Depending on the chunking parameters used, there's different representations for
|
||||
the same data, causing less data sharing/reuse in the overall system, in terms of how
|
||||
many chunks need to be downloaded vs. are already available locally, as well as
|
||||
how compact data is stored on-disk.
|
||||
|
||||
This can be workarounded by agreeing on only a single way of chunking, but it's
|
||||
not pretty and misses a lot of deduplication potential.
|
||||
|
||||
### Chunking in Tvix' Blobstore
|
||||
tvix-castore's BlobStore uses a hybrid approach to eliminate some of the
|
||||
disadvantages, while still being content-addressed internally, with the
|
||||
highlighted benefits.
|
||||
|
||||
It uses [blake3] as hash function, and the blake3 digest of **the raw data
|
||||
itself** as an identifier (rather than some application-specific Merkle DAG that
|
||||
also embeds some chunking information).
|
||||
|
||||
BLAKE3 is a tree hash where all left nodes fully populated, contrary to
|
||||
conventional serial hash functions. To be able to validate the hash of a node,
|
||||
one only needs the hash of the (2) children [^1], if any.
|
||||
|
||||
This means one only needs to the root digest to validate a constructions, and these
|
||||
constructions can be sent [separately][bao-spec].
|
||||
|
||||
This relieves us from the need of having to encode more granular chunking into
|
||||
our data model / identifier upfront, but can make this mostly a transport/
|
||||
storage concern.
|
||||
|
||||
For some more description on the (remote) protocol, check
|
||||
`./blobstore-protocol.md`.
|
||||
|
||||
#### Logical vs. physical chunking
|
||||
|
||||
Due to the properties of the BLAKE3 hash function, we have logical blocks of
|
||||
1KiB, but this doesn't necessarily imply we need to restrict ourselves to these
|
||||
chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers,
|
||||
or are stored on-disk.
|
||||
|
||||
The only thing we need to be able to read and verify an arbitrary byte range is
|
||||
having the covering range of aligned 1K blocks, and a construction from the root
|
||||
digest to the 1K block.
|
||||
|
||||
Note the intermediate hash tree can be further trimmed, [omitting][bao-tree]
|
||||
lower parts of the tree while still providing verified streaming - at the cost
|
||||
of having to fetch larger covering ranges of aligned blocks.
|
||||
|
||||
Let's pick an example. We identify each KiB by a number here for illustrational
|
||||
purposes.
|
||||
|
||||
Assuming we omit the last two layers of the hash tree, we end up with logical
|
||||
4KiB leaf chunks (`bao_shift` of `2`).
|
||||
|
||||
For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`,
|
||||
`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion:
|
||||
|
||||
`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]`
|
||||
|
||||
Assuming the server now informs us about the following physical chunking:
|
||||
|
||||
```
|
||||
[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]`
|
||||
```
|
||||
|
||||
If our application now wants to arbitrarily read from 0 until 4 (inclusive):
|
||||
|
||||
```
|
||||
[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
|
||||
|-------------|
|
||||
|
||||
```
|
||||
|
||||
…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`.
|
||||
|
||||
|
||||
`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're
|
||||
interested in.
|
||||
|
||||
We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we
|
||||
can assemble `[ 4 5 6 7 ]` to verify both logical chunks:
|
||||
|
||||
```
|
||||
[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
|
||||
^ ^ ^ ^
|
||||
|----4KiB----|------4KiB-----|
|
||||
```
|
||||
|
||||
Each physical chunk fetched can be validated to have the blake3 digest that was
|
||||
communicated upfront, and can be stored in a client-side cache/storage, so
|
||||
subsequent / other requests for the same data will be fast(er).
|
||||
|
||||
---
|
||||
|
||||
[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree
|
||||
[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/
|
||||
[blake3]: https://github.com/BLAKE3-team/BLAKE3
|
||||
[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
|
||||
[bao-tree]: https://github.com/n0-computer/bao-tree
|
||||
104
tvix/docs/src/castore/blobstore-protocol.md
Normal file
104
tvix/docs/src/castore/blobstore-protocol.md
Normal file
|
|
@ -0,0 +1,104 @@
|
|||
# BlobStore: Protocol / Composition
|
||||
|
||||
This documents describes the protocol that BlobStore uses to substitute blobs
|
||||
other ("remote") BlobStores.
|
||||
|
||||
How to come up with the blake3 digest of the blob to fetch is left to another
|
||||
layer in the stack.
|
||||
|
||||
To put this into the context of Tvix as a Nix alternative, a blob represents an
|
||||
individual file inside a StorePath.
|
||||
In the Tvix Data Model, this is accomplished by having a `FileNode` (either the
|
||||
`root_node` in a `PathInfo` message, or a individual file inside a `Directory`
|
||||
message) encode a BLAKE3 digest.
|
||||
|
||||
However, the whole infrastructure can be applied for other usecases requiring
|
||||
exchange/storage or access into data of which the blake3 digest is known.
|
||||
|
||||
## Protocol and Interfaces
|
||||
As an RPC protocol, BlobStore currently uses gRPC.
|
||||
|
||||
On the Rust side of things, every blob service implements the
|
||||
[`BlobService`](../src/blobservice/mod.rs) async trait, which isn't
|
||||
gRPC-specific.
|
||||
|
||||
This `BlobService` trait provides functionality to check for existence of Blobs,
|
||||
read from blobs, and write new blobs.
|
||||
It also provides a method to ask for more granular chunks if they are available.
|
||||
|
||||
In addition to some in-memory, on-disk and (soon) object-storage-based
|
||||
implementations, we also have a `BlobService` implementation that talks to a
|
||||
gRPC server, as well as a gRPC server wrapper component, which provides a gRPC
|
||||
service for anything implementing the `BlobService` trait.
|
||||
|
||||
This makes it very easy to talk to a remote `BlobService`, which does not even
|
||||
need to be written in the same language, as long it speaks the same gRPC
|
||||
protocol.
|
||||
|
||||
It also puts very little requirements on someone implementing a new
|
||||
`BlobService`, and how its internal storage or chunking algorithm looks like.
|
||||
|
||||
The gRPC protocol is documented in `../protos/rpc_blobstore.proto`.
|
||||
Contrary to the `BlobService` trait, it does not have any options for seeking/
|
||||
ranging, as it's more desirable to provide this through chunking (see also
|
||||
`./blobstore-chunking.md`).
|
||||
|
||||
## Composition
|
||||
Different `BlobStore` are supposed to be "composed"/"layered" to express
|
||||
caching, multiple local and remote sources.
|
||||
|
||||
The fronting interface can be the same, it'd just be multiple "tiers" that can
|
||||
respond to requests, depending on where the data resides. [^1]
|
||||
|
||||
This makes it very simple for consumers, as they don't need to be aware of the
|
||||
entire substitutor config.
|
||||
|
||||
The flexibility of this doesn't need to be exposed to the user in the default
|
||||
case; in most cases we should be fine with some form of on-disk storage and a
|
||||
bunch of substituters with different priorities.
|
||||
|
||||
### gRPC Clients
|
||||
Clients are encouraged to always read blobs in a chunked fashion (asking for a
|
||||
list of chunks for a blob via `BlobService.Stat()`, then fetching chunks via
|
||||
`BlobService.Read()` as needed), instead of directly reading the entire blob via
|
||||
`BlobService.Read()`.
|
||||
|
||||
In a composition setting, this provides opportunity for caching, and avoids
|
||||
downloading some chunks if they're already present locally (for example, because
|
||||
they were already downloaded by reading from a similar blob earlier).
|
||||
|
||||
It also removes the need for seeking to be a part of the gRPC protocol
|
||||
alltogether, as chunks are supposed to be "reasonably small" [^2].
|
||||
|
||||
There's some further optimization potential, a `BlobService.Stat()` request
|
||||
could tell the server it's happy with very small blobs just being inlined in
|
||||
an additional additional field in the response, which would allow clients to
|
||||
populate their local chunk store in a single roundtrip.
|
||||
|
||||
## Verified Streaming
|
||||
As already described in `./docs/blobstore-chunking.md`, the physical chunk
|
||||
information sent in a `BlobService.Stat()` response is still sufficient to fetch
|
||||
in an authenticated fashion.
|
||||
|
||||
The exact protocol and formats are still a bit in flux, but here's some notes:
|
||||
|
||||
- `BlobService.Stat()` request gets a `send_bao` field (bool), signalling a
|
||||
[BAO][bao-spec] should be sent. Could also be `bao_shift` integer, signalling
|
||||
how detailed (down to the leaf chunks) it should go.
|
||||
The exact format (and request fields) still need to be defined, edef has some
|
||||
ideas around omitting some intermediate hash nodes over the wire and
|
||||
recomputing them, reducing size by another ~50% over [bao-tree].
|
||||
- `BlobService.Stat()` response gets some bao-related fields (`bao_shift`
|
||||
field, signalling the actual format/shift level the server replies with, the
|
||||
actual bao, and maybe some format specifier).
|
||||
It would be nice to also be compatible with the baos used by [iroh], so we
|
||||
can provide an implementation using it too.
|
||||
|
||||
---
|
||||
|
||||
[^1]: We might want to have some backchannel, so it becomes possible to provide
|
||||
feedback to the user that something is downloaded.
|
||||
[^2]: Something between 512K-4M, TBD.
|
||||
[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
|
||||
[bao-tree]: https://github.com/n0-computer/bao-tree
|
||||
[iroh]: https://github.com/n0-computer/iroh
|
||||
50
tvix/docs/src/castore/data-model.md
Normal file
50
tvix/docs/src/castore/data-model.md
Normal file
|
|
@ -0,0 +1,50 @@
|
|||
# Data model
|
||||
|
||||
This provides some more notes on the fields used in castore.proto.
|
||||
|
||||
See `//tvix/store/docs/api.md` for the full context.
|
||||
|
||||
## Directory message
|
||||
`Directory` messages use the blake3 hash of their canonical protobuf
|
||||
serialization as its identifier.
|
||||
|
||||
A `Directory` message contains three lists, `directories`, `files` and
|
||||
`symlinks`, holding `DirectoryNode`, `FileNode` and `SymlinkNode` messages
|
||||
respectively. They describe all the direct child elements that are contained in
|
||||
a directory.
|
||||
|
||||
All three message types have a `name` field, specifying the (base)name of the
|
||||
element (which MUST not contain slashes or null bytes, and MUST not be '.' or '..').
|
||||
For reproducibility reasons, the lists MUST be sorted by that name and the
|
||||
name MUST be unique across all three lists.
|
||||
|
||||
In addition to the `name` field, the various *Node messages have the following
|
||||
fields:
|
||||
|
||||
## DirectoryNode
|
||||
A `DirectoryNode` message represents a child directory.
|
||||
|
||||
It has a `digest` field, which points to the identifier of another `Directory`
|
||||
message, making a `Directory` a merkle tree (or strictly speaking, a graph, as
|
||||
two elements pointing to a child directory with the same contents would point
|
||||
to the same `Directory` message).
|
||||
|
||||
There's also a `size` field, containing the (total) number of all child
|
||||
elements in the referenced `Directory`, which helps for inode calculation.
|
||||
|
||||
## FileNode
|
||||
A `FileNode` message represents a child (regular) file.
|
||||
|
||||
Its `digest` field contains the blake3 hash of the file contents. It can be
|
||||
looked up in the `BlobService`.
|
||||
|
||||
The `size` field contains the size of the blob the `digest` field refers to.
|
||||
|
||||
The `executable` field specifies whether the file should be marked as
|
||||
executable or not.
|
||||
|
||||
## SymlinkNode
|
||||
A `SymlinkNode` message represents a child symlink.
|
||||
|
||||
In addition to the `name` field, the only additional field is the `target`,
|
||||
which is a string containing the target of the symlink.
|
||||
57
tvix/docs/src/castore/why-not-git-trees.md
Normal file
57
tvix/docs/src/castore/why-not-git-trees.md
Normal file
|
|
@ -0,0 +1,57 @@
|
|||
## Why not git tree objects?
|
||||
|
||||
We've been experimenting with (some variations of) the git tree and object
|
||||
format, and ultimately decided against using it as an internal format, and
|
||||
instead adapted the one documented in the other documents here.
|
||||
|
||||
While the tvix-store API protocol shares some similarities with the format used
|
||||
in git for trees and objects, the git one has shown some significant
|
||||
disadvantages:
|
||||
|
||||
### The binary encoding itself
|
||||
|
||||
#### trees
|
||||
The git tree object format is a very binary, error-prone and
|
||||
"made-to-be-read-and-written-from-C" format.
|
||||
|
||||
Tree objects are a combination of null-terminated strings, and fields of known
|
||||
length. References to other tree objects use the literal sha1 hash of another
|
||||
tree object in this encoding.
|
||||
Extensions of the format/changes are very hard to do right, because parsers are
|
||||
not aware they might be parsing something different.
|
||||
|
||||
The tvix-store protocol uses a canonical protobuf serialization, and uses
|
||||
the [blake3][blake3] hash of that serialization to point to other `Directory`
|
||||
messages.
|
||||
It's both compact and with a wide range of libraries for encoders and decoders
|
||||
in many programming languages.
|
||||
The choice of protobuf makes it easy to add new fields, and make old clients
|
||||
aware of some unknown fields being detected [^adding-fields].
|
||||
|
||||
#### blob
|
||||
On disk, git blob objects start with a "blob" prefix, then the size of the
|
||||
payload, and then the data itself. The hash of a blob is the literal sha1sum
|
||||
over all of this - which makes it something very git specific to request for.
|
||||
|
||||
tvix-store simply uses the [blake3][blake3] hash of the literal contents
|
||||
when referring to a file/blob, which makes it very easy to ask other data
|
||||
sources for the same data, as no git-specific payload is included in the hash.
|
||||
This also plays very well together with things like [iroh][iroh-discussion],
|
||||
which plans to provide a way to substitute (large)blobs by their blake3 hash
|
||||
over the IPFS network.
|
||||
|
||||
In addition to that, [blake3][blake3] makes it possible to do
|
||||
[verified streaming][bao], as already described in other parts of the
|
||||
documentation.
|
||||
|
||||
The git tree object format uses sha1 both for references to other trees and
|
||||
hashes of blobs, which isn't really a hash function to fundamentally base
|
||||
everything on in 2023.
|
||||
The [migration to sha256][git-sha256] also has been dead for some years now,
|
||||
and it's unclear what a "blake3" version of this would even look like.
|
||||
|
||||
[bao]: https://github.com/oconnor663/bao
|
||||
[blake3]: https://github.com/BLAKE3-team/BLAKE3
|
||||
[git-sha256]: https://git-scm.com/docs/hash-function-transition/
|
||||
[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197
|
||||
[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect.
|
||||
Loading…
Add table
Add a link
Reference in a new issue