Git Guide: Technical


Welcome to this (relatively) short explanation of the technical "behind-the-scenes" side of Git. Git is a Source Code Management (SCM) tool that is open source and a lot more scalable and than other alternatives, although it has a somewhat hard-to-understand interface. It was originally developed as a tool for the Linux kernel because managing that code ran out of hand, but then it turned out many people needed better source control, so now it is the most widely used SCM in the world.

Besides Git, there are tools like Github, Gitlab and Bitbucket, which are (online) interfaces for hosting Git repositories. In addition to hosting them, they often have many extra features like issues/bug lists, pull requests (PRs, also called merge requests), reviewing, task management boards and sometimes entire social media platforms. Note that (confusingly) some features that you use a lot (like PRs) are Github features and not Git features. With that out of the way, I'm going to get started on how Git works!

I will have some "Try yourself" sections for where you can try examples on your own Git repo and "Git command" sections for explaining what a high-level command does (e.g. git commit or git reset).

Contents:

Blobs, Trees and Hashes

Before thinking about commits and different versions of the code, just imagine there is only one version of your code (the one you have in the directory of your project). If you have one file with source code, then Git will store this source code in a "blob" (not the file, only the contents of the file). It will store a blob in the directory .git/objects; to identify a blob, it will compute a SHA-1 hash (which is not actually guaranteed to be unique, but the chance is quite high) such as 7fd015... etc. Usually it only displays the first 6 digits.

Of course your repo does not consist of one file with source code, so there are also "trees". A tree is kind of like a directory, where each entry has a name such as src or main.ts, which in turn refers to another object (a tree or blob). For a tree, Git also calculates a unique(-ish) hash.

Even if you have 10000 different commits of a file that are all only 1 character different, it will store each version of this file's content in a separate blob object, meaning you have 10000 copies that are almost identical, since each version will have a completely different hash. Fortunately, Git will do some magic compression to not take up 10000 times as much disk space, but conceptually you should think of them as different copies.

Try yourself: go to any git repository and then use the following command in a terminal:

$ tree /F .git/objects
# if Bash: ls -R .git/objects

You might see something like this:

├───00
│       b7741d13a2a8edf0b8aa457be25c1b05bb1de2
│
├───04
│       c5e626e6abe8006e052fa26b645344a89f6c36
│
├───07
│       723145fc8d316195072cd20237428759c2e07e
│       a58ac8d78fd0adcbf6a80f8a23a1156d795f5a
etc.

Now just pick any entry, enter the first 6 digits including the ones in the directory (for instance, 00b774 or 04c5e6) and run these commands:

$ git cat-file -t 04c5e6
tree

$ git cat-file -p 04c5e6
100644 blob 3e354001a908625dd3fe5a7cd3fda1b283f2d953    README.md
040000 tree 90c6615d02ebf8c948a5fdd01b6bb3ca012f4f9f    src
(etc.)

Some of them will give you trees, others will give blobs, others will give commits. Doing git cat-file -p will show you the data that is in a Git object, while using -t shows if it is a blob, tree or commit.

Commits

In my opinion, "commit" is not a very good name; "snapshot" probably describes better how it works. Each commit is a complete snapshot of everything in your project.

Implementation-wise, a commit is really just the hash of a single tree object, namely the tree object that describes the corresponding version of your project directory. It also has a bit of extra data, such as the commit message and author of the commit.

Finally, a commit also has a parent (except if it's the first commit ever, then it has no parent). This is the hash of the commit object that came before this commit.

In the .git folder you will also find a file called "HEAD". If you open it, it will show you a reference to some branch or to a commit (if it's a commit, you might see something like "Detached HEAD" in a Git client). HEAD refers to the commit that you are currently on.

Git command log: if you use git log --oneline or you view the history of commits in Github Desktop it essentially does a simple tree walk algorithm:

currentCommit = HEAD
print(currentCommit.hash, currentCommit.message)
while (currentCommit.parent exists) {
    currentCommit = currentCommit.parent
    print(currentCommit.hash, currentCommit.message)
}

Tags and Branches

A tag is essentially just a nice name for the hash of an object (usually a commit). A branch is the same thing. The only difference is that when you use commands like git commit or git reset, git will automatically update branches for you, while you have to update tags yourself.

Git stores tags in .git/refs/tags, while it stores branches in .git/refs/heads.

Try yourself: run the following in any git repo:

# Create branch or tag called "example":
$ git branch example       # or git tag example

$ cat .git/heads/example   # or .git/tags/example
# This is the hash of a commit:
b4aa4d1fb12f770e90d5fc27fea64e21e364368d

$ git cat-file -p b4aa4d   # first 6 digits
tree 927c472a22118dcb365ace4232c0137d5484e833
parent e2b52132f4f122bf054bddb314b502b04bfc3806
author (SOME INFO)
committer (SOME INFO)

Message of last commit you made

So you can see that a branch/tag is really just an alias for the hash of a commit object.

Git command branch/tag: when you use git branch branch-name, it essentially does the following:

hash = HEAD.hash
createFile(".git/refs/heads/(branch-name)", content=hash)
HEAD = "refs/heads/(branch-name)"

The git tag tag-name command does the exact same thing, except it does not do the last line. You can use git branch --list or git tag --list to see the branches/tags you created.

Git command commit: when you use git commit, it essentially inserts a node at the end of the commit tree and sets the parent of this new commit to the last commit:

commit = new Commit()
commit.tree = currentTree.hash
commit.parent = HEAD.hash
storeGitObject(commit)
if (HEAD refers to some branch B) {   # i.e. not in detached HEAD state   
    make B refer to commit.hash
    make HEAD refer to B
} else {
    make HEAD refer to commit.hash
}

Git command checkout: when you use git checkout branch-name or git switch branch-name or you switch to a different branch in Github Desktop, all it does is HEAD = "refs/heads/branch-name" (and in case you had some changes, it will move these over to the new branch).

Additionally, you can do git checkout aa1e3a (or with any other hash) to make the HEAD refer to that commit. Git or a Git client will then tell you that you are in "detached HEAD state", meaning that your HEAD does not refer to a branch, but to a "loose" commit.

Git command reset: you can instead use git reset to move back to another commit (I don't think there is an equivalent for this in Github Desktop). Instead of moving HEAD to another commit, this moves the branch that you are on to another commit (assuming your HEAD is not detached). For instance, you can use git reset --hard HEAD~2 to move the branch you are on back to the grandparent (~2) of the HEAD. (Using --soft instead of --hard will keep the changes of the commits that you passed in your project folder, while --hard deletes them entirely).

Rewriting History

Now finally, there are some cool things that you can do with Git which will surely be useful at some point. Note that when you rewrite history, there will be a strong difference between what you have in your local clone of the repo and the remote repository (assuming you are working with a remote), and you have to run git push -f instead of just git push. This means that if other people are working on the same branch as you, it will mess up their history. Therefore I strongly suggest you never force-push to the main branch unless you are working on your own.

Git command commit --amend: you can use git commit --amend to not create a new commit, but to modify the current (HEAD) commit instead. When you have added your changes with git add or a client like Github Desktop, you can use this flag to rewrite history. It essentially does the following:

commit = new Commit()
commit.tree = currentTree.hash
commit.parent = HEAD.parent.hash      # NOTE only difference: HEAD.parent 
storeGitObject(commit)
if (HEAD refers to some branch B) {   # i.e. not in detached HEAD state   
    make B refer to commit.hash
    make HEAD refer to B
} else {
    make HEAD refer to commit.hash
}

Git command rebase: imagine you are working on two different branches, main and feature1. The Git tree might look likes this, where each node is a commit:

Assuming your HEAD refers to feature1, if you use git rebase main (or use Github Desktop for this), the graph will change as follows (it finds the least common ancestor, i.e. the latest commit that both branches originate from):


Now that you have rebased feature1 onto main, it is really easy to make a pull request for the feature1branch; after all, git can simply do the following:

main = feature1

So that after merging, the history looks like this:

Git command rebase -i: There is also something called "interactive rebase", which also rewrites history, but its name is misleading in the sense that it does not necessarily change the base of the current branch. You can also just see it as a general history rewriting tool. When you do something like git rebase -i HEAD~4, it will prompt you what you would like to do with the last 4 commits (e.g. "squash", "drop", "reword", "edit" and others [2]). When you close the text editor, it will then go over each commit that is changed and prompt you again. If you are finished with editing, you can then use git rebase --continue to continue, or you can go to Github Desktop and click on the Continue button. You can always do git rebase --abort if it turns out you messed something up.

Finally, there are other history rewriting operations you can do, like git cherry-pick, but you should refer to the Git documentation for this [1].

Remotes

Now finally, for a completely different topic: if you work with Github, the repository is hosted somewhere else, because often you want more than one person working on the same repository. When you have another repository somewhere else, it is called a remote. Usually this remote is called "origin" and it just points to the URL of the repo hosted on Github/Gitlab/etc.

Another feature of branches is that they can "track" a branch on the remote. So for instance, main will probably track origin/main.

Git command pull: when you use git pull origin, it then does as follows:

# git fetch simply gets all git objects from the remote that didn't exist locally yet
git fetch   

ancestor = leastCommonAncestor(main, origin/main)
if (ancestor == main) {  # if origin/main is directly "ahead" by >= 0 commits
    main = origin/main
} else {                 # if branches diverge
    error("main and origin/main diverge by X and Y commits respectively")
}

Thus it will "fast-forward" the local branch to the remote branch, but only if origin/main is some child of main (i.e. main has strictly fewer commits than the branch it has tracking).

Git command push: in contrast, git push origin kind of does the reverse: first it will transfer git objects (blobs, trees, commits) that do not already exist on the remote; after that, if the remote branch main is some child of origin/main (transitively, i.e. main has strictly more commits than the branch it is tracking), it will set origin/main to point to the same commit as your local branch main.

See here for commands for changing properties of the remote [3].

Example Workflow

This is one example of how you can collaborate using Git. Typically there will be one main branch and every collaborator can branch off of that one. That means to start a new feature, you do as follows:

  1. Create new branch and checkout to it: git checkout -b new-feature
  2. Write the code for a commit
  3. Use git add <filenames> to select changes for your commit and then use git commit -m "message"
  4. Use git push to push your changes to the remote
  5. Repeat from 2. until done

Then regularly, you should do the following (because if you do not, you will diverge from the main branch and get a lot of merge conflicts when you issue a PR):

  1. git checkout main (if you have uncommitted changes, you can use git stash push)
  2. git pull to get the newest version of main
  3. git checkout new-feature
  4. git rebase main to base HEAD (i.e. new-feature) off of main; when it shows merge conflicts, you should fix those, use git add name-of-file.ext to mark that you are done and finally use git rebase --continue to keep going
  5. When rebasing is finished and you are satisfied with the rebase (i.e. your code still works as intended), your local branch is now up-to-date with the latest version of main, but the remote branch is not; if nobody else was working on the same branch as you, you can use git push -f to update the remote as well

Then when you are done, you can create a PR and you should have no merge conflicts.