How Git Really Works

A look at Git internals and understanding what happens when you add, commit, push, and more

Maroun Maroun
Better Programming

--

Futuristic wooden tunnel
Photo By Note Thanun On Unsplash.

In this article, we’ll dive into Git internals by going through a real example. If you don’t have your terminal open already, do so, fasten your seatbelts, and let’s go!

Initializing a Git Repository

You have probably already initialized an empty Git project using git init, but have you ever wondered what this command does?

Let’s create an empty folder and initialize an empty Git project. This is how the official Git documentation describes git init:

“This command creates an empty Git repository — basically a .git directory with subdirectories for objects, refs/heads, refs/tags, and template files. An initial HEAD file that references the HEAD of the master branch is also created.”

If we inspect the folder’s content, we’ll see the following structure:

$ tree -L 1 .git/
.git/
├── HEAD
├── config
├── description
├── hooks
├── info
├── objects
└── refs

We’ll cover some of the objects later on.

Git Is a Key-Value Datastore

At its core, Git is a content-addressable file system. Huh? OK, Git is simply a key-value database. You insert any kind of content into a Git repository, for which Git will give you back a unique identifier (a key) that you can use to retrieve that content.

Git uses the hash-object command to store values into the database:

“Computes the object ID value for an object with specified type with the contents of the named file (which can be outside of the work tree), and optionally writes the resulting object into the object database. Reports its object ID to its standard output. When <type> is not specified, it defaults to ‘blob.’”

A “blob” is nothing but a sequence of bytes. A Git blob contains the exact data as a file, but it’s stored in the Git key-value data store, while the “actual” file is stored on the file system.

Let’s create a blob:

$ echo hello | git hash-object --stdin -w
ce013625030ba8dba906f756967f9e9ca394464a

We used the -w flag to actually write the object into the object database, and not only display it (achieved by the --stdin flag).

The value “hello” is the “value” in the Git data store, and the hash returned from the hash-object function is — in this case — our key. We can now do the opposite operation to read the value by its key using the git-cat-file command:

$ git cat-file -p ce013625030ba8dba906f756967f9e9ca394464a
hello

We can check its type using the -t flag:

$ git cat-file -t ce013625030ba8dba906f756967f9e9ca394464a
blob

git hash-object stores the data in the .git/objects/ folder (aka the object database). Let’s verify:

$ tree .git/objects/
.git/objects/
├── ce
│ └── 013625030ba8dba906f756967f9e9ca394464a
├── info
└── pack

The hash suffix (under the “ce” directory) is the same as the one we got back from the hash-object function, but it has a different prefix. Why? That’s because the parent folder name contains the first two characters of our key. Why? Because some file systems have limitations on the number of possible sub-directories. So introducing this layer mitigates that problem.

Let’s save another object:

$ echo world | git hash-object --stdin -w
cc628ccd10742baea8241c5924df992b5c019f71

As expected, we now have two directories under .git/objects/:

$ tree .git/objects/
.git/objects/
├── cc
│ └── 628ccd10742baea8241c5924df992b5c019f71
├── ce
│ └── 013625030ba8dba906f756967f9e9ca394464a
├── info
└── pack

And again, the cc folder that contains the key’s prefix has the rest of the key in the contained file’s name.

Tree Objects

The next Git object we’ll investigate is the tree. This type solves the problem of storing the filename and allows storing a group of files together.

A tree object contains entries. Each entry is the SHA-1 of the blob or subtree with its associated mode, type, and filename. Let’s check the documentation’s definition of git-mktree:

“Reads standard input in non-recursive ls-tree output format, and creates a tree object. The order of the tree entries is normalized by mktree so pre-sorting the input is not required. The object name of the tree object built is written to the standard output.”

If you’re wondering about ls-tree’s output format, it looks like this:

<mode> SP <type> SP <object> TAB <file>

Let’s now associate the two blobs above:

$ printf '%s %s %s\t%s\n' \
100644 blob ce013625030ba8dba906f756967f9e9ca394464a hello.txt \
100644 blob cc628ccd10742baea8241c5924df992b5c019f71 world.txt |
git mktree
88e38705fdbd3608cddbe904b67c731f3234c45b

mktree returns a key for the newly created tree object.

At this point, we can visualize our tree as follows:

             88e38705fdbd3608cddbe904b67c731f3234c45b  
|
+-------------|------------+
| |
| |
| |
| |
| |
hello world
ce013625030b cc628ccd1074

Let’s view the tree’s content:

$ git cat-file -p 88e38705fdbd3608cddbe904b67c731f3234c45b
100644 blob ce013625030ba8dba906f756967f9e9ca394464a hello.txt
100644 blob cc628ccd10742baea8241c5924df992b5c019f71 world.txt

And of course, the .git/objects was updated accordingly:

$ tree .git/objects/
.git/objects/
├── 88
│ └── e38705fdbd3608cddbe904b67c731f3234c45b
├── cc
│ └── 628ccd10742baea8241c5924df992b5c019f71
├── ce
│ └── 013625030ba8dba906f756967f9e9ca394464a
├── info
└── pack

So far, we have not updated our index. To do so, we use the git-read-tree command:

“Reads the tree information given by <tree-ish> into the index, but does not actually update any of the files it ‘caches.’ (see: git-checkout-index[1])”

$ git read-tree 88e38705fdbd3608cddbe904b67c731f3234c45b
$ git ls-files -s
100644 ce013625030ba8dba906f756967f9e9ca394464a 0 hello.txt
100644 cc628ccd10742baea8241c5924df992b5c019f71 0 world.txt

Note, however, that we still don’t have the files on our file system since we’re writing values directly to the Git datastore. In order to “check out” the files, we’ll use thegit-checkout-index command that copies files from the index to the working tree:

git checkout-index 0 -a

The -a stands for “all.” Now we should be able to see our files:

$ ls
hello.txt world.txt
$ cat hello.txt
hello
$ cat world.txt
world

Bonus

git-hash-object outputs a different SHA than the openssl SHA-1. Why? That’s because Git uses the following formula for the SHA-1 calculations:

sha1("blob " + filesize + "\0" + data)

So in order to get the same SHA-1, we’ll need to do the following trick:

# without outputing a newline
$ echo -n hello | git hash-object --stdin -w
b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0
$ printf 'blob 5\0hello' | openssl sha1
b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0

Summary

In this article, we stored two files directly to the Git datastore. The files weren’t yet visible to our local file system. We’ve created a tree and associated the two “blobs” to it, and then we brought the files to our working directory using the git-checkout-index command.

--

--