levinasavion
levinasavion

Git for Computer Scientists

A brief introduction to the internals of Git, so that you won't be intimidated by words like directed and no loopback.

July 23, 2022

Original link

Introduction

A brief introduction to the internals of Git, so that you won't be intimidated by words like directed and no loopback.

storage

In simple terms, a Git object store is "just" a DAG of one object, and a handful of objects of different types. They are all stored compressed and identified by a SHA-1 hash (note that the hash here is not the one determined by the file content, but the hash represented in Git).


blob: The simplest object composed of a bunch of bytes. Usually a file, but could also be a symlink or anything else possible. The object pointing to the blob determines its semantics.


tree: Directories are represented by tree objects, which point to blobs with file content (file names, access modes, etc. are all stored in the tree), and trees of other subdirectories.

When a node points to another node in the DAG, it depends on the other node: it cannot exist without this connection. If a node does not have any referents pointing to itself, it will become garbage and be collected by the Git garbage collection mechanism (GC), or use git fsck --lost- for nodes that have no filename referents like filesystem inodes found to rescue.


commit: A commit points to a tree representing the state of the file at the time of the commit. It also points to 0 to N other parent commits. When there are multiple parents, it means that the commit is a merge operation (merge), and no parent means this is the initial commit. Interestingly, there can be multiple initial commits, which usually means that two independent commits are merged. s project. The body of the commit object is the commit message.


refs: References, heads, or branches, like sticky notes attached to DAG nodes. Since the DAG only gets add operations, and existing nodes cannot be changed, post-its can move freely anywhere. They are not stored in the history and cannot be transferred directly between the two repositories. They're like a sort of bookmark that says "I'm working here"

(Annotation): I think HEAD is a more figurative term, imagine the content of the screen anchored by your head facing the computer screen.

git commit adds a new node to the DAG, and moves the reference of the current branch to the new node.

What's special about the HEAD ref is that it actually points to another ref. It is a pointer to the currently active branch. Normal refs are actually in a namespace head/XXX, but you can usually skip the head/ part.


remote refs: Remote refs are post-its in different colors. The difference with ordinary refs is the namespace of the address, in fact remote refs are essentially controlled by the remote server. git fetch can update them.


tag: A tag is both a node in the DAG and a post-its (another color). A tag points to a commit and includes optional messages and GPG signatures.

post-it is just a quick way to access tags, and if lost, they can be retrieved with git fsck --lost-found.

Nodes in a DAG can be moved from one repository to another, can be stored in a more efficient form (package), and unused nodes can be garbage collected. But at the end of the day, git repositories are always just DAGs and post-its.

history

So, knowing about how git stores version history, how can we visualize operations like merges? And how git is different from tools that try to manage version number history as linear changes per branch.


This is the simplest repository. We cloned a remote repo with one commit in it.


Here, we've fetched the remote and received a new commit from the remote, but haven't merged it yet.


The situation after git merge remotes/MYSERVER/master. Since the merge is fast-forwarding (that is, we don't have new commits in our local branch), the only thing that happens is moving our sticky notes and changing the files in our working directory respectively.

Annotation: In fact, when merging code, Git can usually merge the code intelligently, but if it encounters a situation where it cannot be merged, Git will prompt a conflict.


A local git commit and a late git fetch. We have a new local commit and a new remote commit. Obviously, a merger is required.


Result of git merge remotes/MYSERVER/master. Since we have new local commits, this is not fast, but an actual new commit node is created in the DAG. Note that it has two parent commits.


This is what the tree looks like after some commits on both branches and another merge. See the "stitching" pattern appearing? A git DAG keeps an accurate history of actions taken.


"Splice" mode is a bit tedious to read. If you haven't released your fork, or have made it clear that others shouldn't work on it, then you have an alternative. You can rebase (rebase) your branch, instead of merging, your commit is replaced by another commit with a different parent, and your branch is moved there.

Your old commits will remain in the DAG until garbage collection. Ignore them for now, but just know that if you screw up completely, there's a way out. If you have extra sticky notes pointing to your old commit, they will continue to point to it and keep your old commit alive indefinitely. However, this can be quite confusing.

Don't rebase on a branch that someone else created a new commit on top of. It's possible to recover from it, it's not difficult, but the extra work required can be frustrating.


The situation after garbage collection (or just ignoring unreachable commits) and creating new commits on top of the rebase branch.


rebase also knows how to rebase multiple commits with one command.

For those who don't want to be intimidated by computer science, this concludes our brief introduction to git. Hope it helps!

CC BY-NC-ND 2.0

Like my work?
Don't forget to support or like, so I know you are with me..

Loading...

Comment