This came up recently again, it makes sense to document the reasoning behind the decision. Change-Id: Ic51d5bc7998c70e8b070b6f42877d8e88613935b Reviewed-on: https://cl.tvl.fyi/c/depot/+/8223 Reviewed-by: raitobezarius <tvl@lahfa.xyz> Tested-by: BuildkiteCI Autosubmit: flokli <flokli@flokli.de>
		
			
				
	
	
		
			57 lines
		
	
	
		
			No EOL
		
	
	
		
			2.7 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			57 lines
		
	
	
		
			No EOL
		
	
	
		
			2.7 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ## Why not git tree objects?
 | |
| 
 | |
| We've been experimenting with (some variations of) the git tree and object
 | |
| format, and ultimately decided against using it as an internal format, and
 | |
| instead adapted the one documented in the other documents here.
 | |
| 
 | |
| While the tvix-store API protocol shares some similarities with the format used
 | |
| in git for trees and objects, the git one has shown some significant
 | |
| disadvantages:
 | |
| 
 | |
| ### The binary encoding itself
 | |
| 
 | |
| #### trees
 | |
| The git tree object format is a very binary, error-prone and
 | |
| "made-to-be-read-and-written-from-C" format.
 | |
| 
 | |
| Tree objects are a combination of null-terminated strings, and fields of known
 | |
| length. References to other tree objects use the literal sha1 hash of another
 | |
| tree object in this encoding.
 | |
| Extensions of the format/changes are very hard to do right, because parsers are
 | |
| not aware they might be parsing something different.
 | |
| 
 | |
| The tvix-store protocol uses a canonical protobuf serialization, and uses
 | |
| the [blake3][blake3] hash of that serialization to point to other `Directory`
 | |
| messages.
 | |
| It's both compact and with a wide range of libraries for encoders and decoders
 | |
| in many programming languages.
 | |
| The choice of protobuf makes it easy to add new fields, and make old clients
 | |
| aware of some unknown fields being detected [^adding-fields].
 | |
| 
 | |
| #### blob
 | |
| On disk, git blob objects start with a "blob" prefix, then the size of the
 | |
| payload, and then the data itself. The hash of a blob is the literal sha1sum
 | |
| over all of this - which makes it something very git specific to request for.
 | |
| 
 | |
| tvix-store simply uses the [blake3][blake3] hash of the literal contents
 | |
| when referring to a file/blob, which makes it very easy to ask other data
 | |
| sources for the same data, as no git-specific payload is included in the hash.
 | |
| This also plays very well together with things like [iroh][iroh-discussion],
 | |
| which plans to provide a way to substitute (large)blobs by their blake3 hash
 | |
| over the IPFS network.
 | |
| 
 | |
| In addition to that, [blake3][blake3] makes it possible to do
 | |
| [verified streaming][bao], as already described in other parts of the
 | |
| documentation.
 | |
| 
 | |
| The git tree object format uses sha1 both for references to other trees and
 | |
| hashes of blobs, which isn't really a hash function to fundamentally base
 | |
| everything on in 2023.
 | |
| The [migration to sha256][git-sha256] also has been dead for some years now,
 | |
| and it's unclear how a "blake3" version of this would even look like.
 | |
| 
 | |
| [bao]: https://github.com/oconnor663/bao
 | |
| [blake3]: https://github.com/BLAKE3-team/BLAKE3
 | |
| [git-sha256]: https://git-scm.com/docs/hash-function-transition/
 | |
| [iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197
 | |
| [^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect. |