350 lines
		
	
	
	
		
			15 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			350 lines
		
	
	
	
		
			15 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
Git Commit Graph Design Notes
 | 
						|
=============================
 | 
						|
 | 
						|
Git walks the commit graph for many reasons, including:
 | 
						|
 | 
						|
1. Listing and filtering commit history.
 | 
						|
2. Computing merge bases.
 | 
						|
 | 
						|
These operations can become slow as the commit count grows. The merge
 | 
						|
base calculation shows up in many user-facing commands, such as 'merge-base'
 | 
						|
or 'status' and can take minutes to compute depending on history shape.
 | 
						|
 | 
						|
There are two main costs here:
 | 
						|
 | 
						|
1. Decompressing and parsing commits.
 | 
						|
2. Walking the entire graph to satisfy topological order constraints.
 | 
						|
 | 
						|
The commit-graph file is a supplemental data structure that accelerates
 | 
						|
commit graph walks. If a user downgrades or disables the 'core.commitGraph'
 | 
						|
config setting, then the existing ODB is sufficient. The file is stored
 | 
						|
as "commit-graph" either in the .git/objects/info directory or in the info
 | 
						|
directory of an alternate.
 | 
						|
 | 
						|
The commit-graph file stores the commit graph structure along with some
 | 
						|
extra metadata to speed up graph walks. By listing commit OIDs in lexi-
 | 
						|
cographic order, we can identify an integer position for each commit and
 | 
						|
refer to the parents of a commit using those integer positions. We use
 | 
						|
binary search to find initial commits and then use the integer positions
 | 
						|
for fast lookups during the walk.
 | 
						|
 | 
						|
A consumer may load the following info for a commit from the graph:
 | 
						|
 | 
						|
1. The commit OID.
 | 
						|
2. The list of parents, along with their integer position.
 | 
						|
3. The commit date.
 | 
						|
4. The root tree OID.
 | 
						|
5. The generation number (see definition below).
 | 
						|
 | 
						|
Values 1-4 satisfy the requirements of parse_commit_gently().
 | 
						|
 | 
						|
Define the "generation number" of a commit recursively as follows:
 | 
						|
 | 
						|
 * A commit with no parents (a root commit) has generation number one.
 | 
						|
 | 
						|
 * A commit with at least one parent has generation number one more than
 | 
						|
   the largest generation number among its parents.
 | 
						|
 | 
						|
Equivalently, the generation number of a commit A is one more than the
 | 
						|
length of a longest path from A to a root commit. The recursive definition
 | 
						|
is easier to use for computation and observing the following property:
 | 
						|
 | 
						|
    If A and B are commits with generation numbers N and M, respectively,
 | 
						|
    and N <= M, then A cannot reach B. That is, we know without searching
 | 
						|
    that B is not an ancestor of A because it is further from a root commit
 | 
						|
    than A.
 | 
						|
 | 
						|
    Conversely, when checking if A is an ancestor of B, then we only need
 | 
						|
    to walk commits until all commits on the walk boundary have generation
 | 
						|
    number at most N. If we walk commits using a priority queue seeded by
 | 
						|
    generation numbers, then we always expand the boundary commit with highest
 | 
						|
    generation number and can easily detect the stopping condition.
 | 
						|
 | 
						|
This property can be used to significantly reduce the time it takes to
 | 
						|
walk commits and determine topological relationships. Without generation
 | 
						|
numbers, the general heuristic is the following:
 | 
						|
 | 
						|
    If A and B are commits with commit time X and Y, respectively, and
 | 
						|
    X < Y, then A _probably_ cannot reach B.
 | 
						|
 | 
						|
This heuristic is currently used whenever the computation is allowed to
 | 
						|
violate topological relationships due to clock skew (such as "git log"
 | 
						|
with default order), but is not used when the topological order is
 | 
						|
required (such as merge base calculations, "git log --graph").
 | 
						|
 | 
						|
In practice, we expect some commits to be created recently and not stored
 | 
						|
in the commit graph. We can treat these commits as having "infinite"
 | 
						|
generation number and walk until reaching commits with known generation
 | 
						|
number.
 | 
						|
 | 
						|
We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
 | 
						|
in the commit-graph file. If a commit-graph file was written by a version
 | 
						|
of Git that did not compute generation numbers, then those commits will
 | 
						|
have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
 | 
						|
 | 
						|
Since the commit-graph file is closed under reachability, we can guarantee
 | 
						|
the following weaker condition on all commits:
 | 
						|
 | 
						|
    If A and B are commits with generation numbers N amd M, respectively,
 | 
						|
    and N < M, then A cannot reach B.
 | 
						|
 | 
						|
Note how the strict inequality differs from the inequality when we have
 | 
						|
fully-computed generation numbers. Using strict inequality may result in
 | 
						|
walking a few extra commits, but the simplicity in dealing with commits
 | 
						|
with generation number *_INFINITY or *_ZERO is valuable.
 | 
						|
 | 
						|
We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
 | 
						|
generation numbers are computed to be at least this value. We limit at
 | 
						|
this value since it is the largest value that can be stored in the
 | 
						|
commit-graph file using the 30 bits available to generation numbers. This
 | 
						|
presents another case where a commit can have generation number equal to
 | 
						|
that of a parent.
 | 
						|
 | 
						|
Design Details
 | 
						|
--------------
 | 
						|
 | 
						|
- The commit-graph file is stored in a file named 'commit-graph' in the
 | 
						|
  .git/objects/info directory. This could be stored in the info directory
 | 
						|
  of an alternate.
 | 
						|
 | 
						|
- The core.commitGraph config setting must be on to consume graph files.
 | 
						|
 | 
						|
- The file format includes parameters for the object ID hash function,
 | 
						|
  so a future change of hash algorithm does not require a change in format.
 | 
						|
 | 
						|
- Commit grafts and replace objects can change the shape of the commit
 | 
						|
  history. The latter can also be enabled/disabled on the fly using
 | 
						|
  `--no-replace-objects`. This leads to difficultly storing both possible
 | 
						|
  interpretations of a commit id, especially when computing generation
 | 
						|
  numbers. The commit-graph will not be read or written when
 | 
						|
  replace-objects or grafts are present.
 | 
						|
 | 
						|
- Shallow clones create grafts of commits by dropping their parents. This
 | 
						|
  leads the commit-graph to think those commits have generation number 1.
 | 
						|
  If and when those commits are made unshallow, those generation numbers
 | 
						|
  become invalid. Since shallow clones are intended to restrict the commit
 | 
						|
  history to a very small set of commits, the commit-graph feature is less
 | 
						|
  helpful for these clones, anyway. The commit-graph will not be read or
 | 
						|
  written when shallow commits are present.
 | 
						|
 | 
						|
Commit Graphs Chains
 | 
						|
--------------------
 | 
						|
 | 
						|
Typically, repos grow with near-constant velocity (commits per day). Over time,
 | 
						|
the number of commits added by a fetch operation is much smaller than the
 | 
						|
number of commits in the full history. By creating a "chain" of commit-graphs,
 | 
						|
we enable fast writes of new commit data without rewriting the entire commit
 | 
						|
history -- at least, most of the time.
 | 
						|
 | 
						|
## File Layout
 | 
						|
 | 
						|
A commit-graph chain uses multiple files, and we use a fixed naming convention
 | 
						|
to organize these files. Each commit-graph file has a name
 | 
						|
`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
 | 
						|
valued hash stored in the footer of that file (which is a hash of the file's
 | 
						|
contents before that hash). For a chain of commit-graph files, a plain-text
 | 
						|
file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
 | 
						|
hashes for the files in order from "lowest" to "highest".
 | 
						|
 | 
						|
For example, if the `commit-graph-chain` file contains the lines
 | 
						|
 | 
						|
```
 | 
						|
	{hash0}
 | 
						|
	{hash1}
 | 
						|
	{hash2}
 | 
						|
```
 | 
						|
 | 
						|
then the commit-graph chain looks like the following diagram:
 | 
						|
 | 
						|
 +-----------------------+
 | 
						|
 |  graph-{hash2}.graph  |
 | 
						|
 +-----------------------+
 | 
						|
	  |
 | 
						|
 +-----------------------+
 | 
						|
 |                       |
 | 
						|
 |  graph-{hash1}.graph  |
 | 
						|
 |                       |
 | 
						|
 +-----------------------+
 | 
						|
	  |
 | 
						|
 +-----------------------+
 | 
						|
 |                       |
 | 
						|
 |                       |
 | 
						|
 |                       |
 | 
						|
 |  graph-{hash0}.graph  |
 | 
						|
 |                       |
 | 
						|
 |                       |
 | 
						|
 |                       |
 | 
						|
 +-----------------------+
 | 
						|
 | 
						|
Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
 | 
						|
commits in `graph-{hash1}.graph`, and X2 be the number of commits in
 | 
						|
`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
 | 
						|
then we interpret this as being the commit in position (X0 + X1 + i), and that
 | 
						|
will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
 | 
						|
positions to refer to their parents, which may be in `graph-{hash1}.graph` or
 | 
						|
`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
 | 
						|
its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
 | 
						|
X2).
 | 
						|
 | 
						|
Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data
 | 
						|
specifying the hashes of all files in the lower layers. In the above example,
 | 
						|
`graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains
 | 
						|
`{hash0}` and `{hash1}`.
 | 
						|
 | 
						|
## Merging commit-graph files
 | 
						|
 | 
						|
If we only added a new commit-graph file on every write, we would run into a
 | 
						|
linear search problem through many commit-graph files.  Instead, we use a merge
 | 
						|
strategy to decide when the stack should collapse some number of levels.
 | 
						|
 | 
						|
The diagram below shows such a collapse. As a set of new commits are added, it
 | 
						|
is determined by the merge strategy that the files should collapse to
 | 
						|
`graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and
 | 
						|
the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}`
 | 
						|
file.
 | 
						|
 | 
						|
			    +---------------------+
 | 
						|
			    |                     |
 | 
						|
			    |    (new commits)    |
 | 
						|
			    |                     |
 | 
						|
			    +---------------------+
 | 
						|
			    |                     |
 | 
						|
 +-----------------------+  +---------------------+
 | 
						|
 |  graph-{hash2} |->|                     |
 | 
						|
 +-----------------------+  +---------------------+
 | 
						|
	  |                 |                     |
 | 
						|
 +-----------------------+  +---------------------+
 | 
						|
 |                       |  |                     |
 | 
						|
 |  graph-{hash1} |->|                     |
 | 
						|
 |                       |  |                     |
 | 
						|
 +-----------------------+  +---------------------+
 | 
						|
	  |                  tmp_graphXXX
 | 
						|
 +-----------------------+
 | 
						|
 |                       |
 | 
						|
 |                       |
 | 
						|
 |                       |
 | 
						|
 |  graph-{hash0} |
 | 
						|
 |                       |
 | 
						|
 |                       |
 | 
						|
 |                       |
 | 
						|
 +-----------------------+
 | 
						|
 | 
						|
During this process, the commits to write are combined, sorted and we write the
 | 
						|
contents to a temporary file, all while holding a `commit-graph-chain.lock`
 | 
						|
lock-file.  When the file is flushed, we rename it to `graph-{hash3}`
 | 
						|
according to the computed `{hash3}`. Finally, we write the new chain data to
 | 
						|
`commit-graph-chain.lock`:
 | 
						|
 | 
						|
```
 | 
						|
	{hash3}
 | 
						|
	{hash0}
 | 
						|
```
 | 
						|
 | 
						|
We then close the lock-file.
 | 
						|
 | 
						|
## Merge Strategy
 | 
						|
 | 
						|
When writing a set of commits that do not exist in the commit-graph stack of
 | 
						|
height N, we default to creating a new file at level N + 1. We then decide to
 | 
						|
merge with the Nth level if one of two conditions hold:
 | 
						|
 | 
						|
  1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in
 | 
						|
     level N is less than X times the number of commits in level N + 1.
 | 
						|
 | 
						|
  2. `--max-commits=<C>` is specified with non-zero C and the number of commits
 | 
						|
     in level N + 1 is more than C commits.
 | 
						|
 | 
						|
This decision cascades down the levels: when we merge a level we create a new
 | 
						|
set of commits that then compares to the next level.
 | 
						|
 | 
						|
The first condition bounds the number of levels to be logarithmic in the total
 | 
						|
number of commits.  The second condition bounds the total number of commits in
 | 
						|
a `graph-{hashN}` file and not in the `commit-graph` file, preventing
 | 
						|
significant performance issues when the stack merges and another process only
 | 
						|
partially reads the previous stack.
 | 
						|
 | 
						|
The merge strategy values (2 for the size multiple, 64,000 for the maximum
 | 
						|
number of commits) could be extracted into config settings for full
 | 
						|
flexibility.
 | 
						|
 | 
						|
## Deleting graph-{hash} files
 | 
						|
 | 
						|
After a new tip file is written, some `graph-{hash}` files may no longer
 | 
						|
be part of a chain. It is important to remove these files from disk, eventually.
 | 
						|
The main reason to delay removal is that another process could read the
 | 
						|
`commit-graph-chain` file before it is rewritten, but then look for the
 | 
						|
`graph-{hash}` files after they are deleted.
 | 
						|
 | 
						|
To allow holding old split commit-graphs for a while after they are unreferenced,
 | 
						|
we update the modified times of the files when they become unreferenced. Then,
 | 
						|
we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}`
 | 
						|
files whose modified times are older than a given expiry window. This window
 | 
						|
defaults to zero, but can be changed using command-line arguments or a config
 | 
						|
setting.
 | 
						|
 | 
						|
## Chains across multiple object directories
 | 
						|
 | 
						|
In a repo with alternates, we look for the `commit-graph-chain` file starting
 | 
						|
in the local object directory and then in each alternate. The first file that
 | 
						|
exists defines our chain. As we look for the `graph-{hash}` files for
 | 
						|
each `{hash}` in the chain file, we follow the same pattern for the host
 | 
						|
directories.
 | 
						|
 | 
						|
This allows commit-graphs to be split across multiple forks in a fork network.
 | 
						|
The typical case is a large "base" repo with many smaller forks.
 | 
						|
 | 
						|
As the base repo advances, it will likely update and merge its commit-graph
 | 
						|
chain more frequently than the forks. If a fork updates their commit-graph after
 | 
						|
the base repo, then it should "reparent" the commit-graph chain onto the new
 | 
						|
chain in the base repo. When reading each `graph-{hash}` file, we track
 | 
						|
the object directory containing it. During a write of a new commit-graph file,
 | 
						|
we check for any changes in the source object directory and read the
 | 
						|
`commit-graph-chain` file for that source and create a new file based on those
 | 
						|
files. During this "reparent" operation, we necessarily need to collapse all
 | 
						|
levels in the fork, as all of the files are invalid against the new base file.
 | 
						|
 | 
						|
It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph`
 | 
						|
files in this scenario. It falls to the user to define the proper settings for
 | 
						|
their custom environment:
 | 
						|
 | 
						|
 1. When merging levels in the base repo, the unreferenced files may still be
 | 
						|
    referenced by chains from fork repos.
 | 
						|
 | 
						|
 2. The expiry time should be set to a length of time such that every fork has
 | 
						|
    time to recompute their commit-graph chain to "reparent" onto the new base
 | 
						|
    file(s).
 | 
						|
 | 
						|
 3. If the commit-graph chain is updated in the base, the fork will not have
 | 
						|
    access to the new chain until its chain is updated to reference those files.
 | 
						|
    (This may change in the future [5].)
 | 
						|
 | 
						|
Related Links
 | 
						|
-------------
 | 
						|
[0] https://bugs.chromium.org/p/git/issues/detail?id=8
 | 
						|
    Chromium work item for: Serialized Commit Graph
 | 
						|
 | 
						|
[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
 | 
						|
    An abandoned patch that introduced generation numbers.
 | 
						|
 | 
						|
[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
 | 
						|
    Discussion about generation numbers on commits and how they interact
 | 
						|
    with fsck.
 | 
						|
 | 
						|
[3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/
 | 
						|
    More discussion about generation numbers and not storing them inside
 | 
						|
    commit objects. A valuable quote:
 | 
						|
 | 
						|
    "I think we should be moving more in the direction of keeping
 | 
						|
     repo-local caches for optimizations. Reachability bitmaps have been
 | 
						|
     a big performance win. I think we should be doing the same with our
 | 
						|
     properties of commits. Not just generation numbers, but making it
 | 
						|
     cheap to access the graph structure without zlib-inflating whole
 | 
						|
     commit objects (i.e., packv4 or something like the "metapacks" I
 | 
						|
     proposed a few years ago)."
 | 
						|
 | 
						|
[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
 | 
						|
    A patch to remove the ahead-behind calculation from 'status'.
 | 
						|
 | 
						|
[5] https://public-inbox.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/
 | 
						|
    A discussion of a "two-dimensional graph position" that can allow reading
 | 
						|
    multiple commit-graph chains at the same time.
 |