Learn Git Branches with your ML Project

Published in

Towards Data Science

9 min readAug 24, 2021

Let me put out a hypothesis to start this article. You know the basics of git. You do git init, git clone, git add, git commit, git push, git fetch, git pull. But let me test you. What is the meaning of the following statement?

$ git switch -c development

I am honest with you, before I started my machine learning project a month ago, I didn’t know what this line is doing. But I stumbled across this command on my journey to learn branching to make my skills in Git version control more complete. This article summarises my learnings and is what I wanted to have such that I could learn branching most effectively and quickly as possible. If you come from a similar place like me, I encourage you to become comfortable with branching by reading this article to finish.

To make this learning effective, we first look under the hood of Git to really understand branching. This will include pointers, blobs, commit objects. Secondly, I show you the lines I write on a regular basis so that you can have multiple versions of your projects in different branches of your Git repository.

How does Branching work under the hood?

What happens if you type in git commit -m “update"? Well it creates a commit object. It contains author’s name and email address, the message you typed in for this commit (“update”), a pointer to the commit that came directly before and a pointer to the files you staged. Okay that was a little fast. We start with staging.

Let’s assume you created two files

README
HelloWorld.py

and staged them with:git add README HelloWorld.py

Staging leads to checksumming each file and stores that version in the Git repository. In Git the result of this checksum single files is referred to as blob. Each blob represents the content of the file. When we commit, each subdirectory (which contain the files that were staged) in turn gets checksummed as well. The result of this step is a tree object. The tree object holds pointers to the single blobs, stores the names and the SHA-1 hashes of this single files. Don’t forget the commit file I mentioned earlier. This points to the previous commit and also to the root file to be able to recreate the versions when needed. We have now 4 Git files:

drawing by author, schematic of Git files

I noted that a commit file points to the previous commit file, also known as parent. Let’s visualise that as well.

commit objects that point to their tree objects

Now you are ready to hear what branches really are!

A branch is basically a movable pointer to one of the commit objects.

Thus creating a branch is just creating a pointer. This is a very lightweight solution. A branch is a 40 character SHA-1 hash that is a result of checksumming the commit object. Creating a branch is as easy as creating 41 bytes and write them to a file (41 because 1 newline character).

How branching actually works in practice

The command to create a branch called “testing” looks like this:

$ git branch testing

But you might wonder where this branch branches off? To understand this we have to introduce another Git essential. The HEAD pointer.

The HEAD Pointer

Git knows where you are on the tree like structure by maintaining a HEAD pointer. HEAD points to the branch on which you currently are. So if you type in git branch testing , now the branch testing points to the same commit like another branch to which the HEAD pointer was pointing to before. But git branch does not change the HEAD pointer pointing location, so it remains on master.

I drew an picture to make this more clear to you. One branch called master existed before we created a new branch called testing and since git branch does not change HEAD’s location it still points to master after creating testing .

a simple commit history, with two branches and the HEAD pointer

This means that if you want to have more control over where to branch off, you must first change where the HEAD pointer is pointing to. This is done with the word checkout , but it can only checkout branches that already exist. Note that if you checkout other branches the files in your working directory change to the version saved in other branches if you checkout those other branches.

Remember git switch -c testing ? It combines git branch testing and git checkout testing and let you accomplish it at once. You create a branch/pointer and let HEAD point to this new branch.

HEAD determines to which branch new commits are added.

If we write git checkout testing and then subsequently commit something to our repository git commit -m "another commit" our situation looks like this:

In this situation it doesn’t look like actual branches like we know it from nature. But if we move back to the branch master git checkout master and then commit while being on the master branch, we get a common ancestor or node for those two branches. This situation looks a little bit more like actual branches. But you get the point.

testing branch is still 1 commit ahead, but now also 1 commit behind

We have to talk about merging

Well what’s the use of having branches? You have two or more paths of development you probably want to take best of both worlds, the new functionality from one branch and merge it with the main development for example.

To understand merging, let us consider two possible merging mechanisms.

Fast Forward Merge

With the fast forward merge you basically move the branch that is behind (master) forward to the branch that is ahead (testing). I show again this situation from before. This Fast Forward Merge works if the branch that is behind (master) has no commits that are missing in the branch ahead (testing).

Before:

a situation that is designed for a Fast Forward Merge

After:

To achieve this you first move the HEAD pointer to master and then merge :

$ git checkout master
$ git merge testing

if you don’t need the testing pointer anymore, you can delete it by writing

$ git branch -d testing

Basic Merging

The other way of merging happens if two paths exist like this image shows below

a slightly more difficult job to merge, testing is 2 commits ahead and 1 behind

In this situation Git has to perform a merge of three versions. Instead of moving the branch pointer forward Git creates a snapshot of the files and creates a new commit that points to it. This commit has a special name. It’s called a merge commit, since it has more than one parent.

a commit object called merge commit was created to take changes from both branches with respect to the common ancestor f30ab

Creating Remote Branches

If you want to share your branches with others, such that multiple coders can work on your project or to show it to others on GitHub, you need to push the branches to a remote repository and create remote branches.

The syntax looks like the following:

$ git push <remote> <branch> #for example
$ git push origin master

This command creates two things and both are important.

It creates a remote branch that others can use and build further
It creates a remote tracking branch in your local repository

Think of remote-tracking branches as references to the state of the remote branch. They have the following name structure remotes/<remoteName>/<branchName> . You still have the local version of the branch that is master that means your local Git repository has the following branches:

master
remotes/origin/master

Note that remote tracking branches with the syntax remotes/<remoteName>/<branchName> are local references and you cannot move them with git merge . Instead if you made changes on master you push them to the server and get an updated remote tracking branch remotes/origin/master . Thus as long as you stay out of contact with the remote repository the remote tracking branch does not move. If one of your colleagues wants to make changes and they already cloned the repository they use:

git fetch origin

The remote tracking branches will be placed to the latest commit done on those branches. Note however, that if you fetch and receive new remote-tracking branches, you will not have automatically local copies that you can edit. It would look like this, if issue was a new branch:

master
testing
remotes/origin/master
remotes/origin/testing
remotes/origin/issue

You would only have a pointer remotes/origin/issue that you cannot modify. This means you need to use the following command to create a local editable copy:

$ git checkout -b issue origin/issue

git pull

It’s also possible to do fetch and merge with one step by writing

git pull <remote>

Know What your Team is up to

You are loosing overview? Don’t worry. I will round this article up with a command that gives you insight how many versions your branches are ahead or behind. But make sure, first to type in git fetch such that your local repo is updated and knows the latest commits from your colleagues.

$ git branch -vv

The result could look like the following

testing     7e424c3 [origin/testing: ahead 2, behind 1] change abc  master      1ae2a45 [origin/master] Deploy index fix
* issue     f8674d9 [origin/issue: behind 1] should do it          cart        5ea463a Try something new

I want to point out several things from this example:

on the left you can read the names of the branches, the star indicates the HEAD pointer’s location
to the right there are the SHA-1 hashes that represent the last commits on each branch
the remote branch is indicated in square brackets alongside with the version difference, ahead 2 means e.g. that I committed twice to the local branch testing and this work is not pushed to the remote yet. Additionally, our local branch is not fully up to date. Someone in the team has pushed to the remote and we haven’t merged in this work to our local branch.
to the right are the commit messages

Deleting (remote) branches

Finally, if you don’t need the branches anymore..

remotely,

$ git push <remoteName> --delete <branchName>

locally,

$ git branch -d <branchName>

In the beginning, it can be annoying to remember deleting in both places, but I showed you how and you can possibly return to this article, whenever you feel shaky about one of the commands.

Summary

You learned about commit, tree and blob objects. You know that branches are really just pointers that point to a commit object. Furthermore, it’s vital to remember that the HEAD pointer exists, since this pointer will determine where your new branch will be created upon your git branch command. A concept you are now familiar with is merging branches and deleting them after you do not use them. Please make sure to remember the difference between a Fast Forward Merge and a Basic Merge.

If you enjoyed this article, you might enjoy some of the other articles I wrote on medium.

Working with JSON data in python

Writing to json files, reading from json files explained and illustrated with examples in python.

towardsdatascience.com

Skip-Gram Neural Network for Graphs

This article will go into more details of node embeddings. If you lack intuition and understanding of node embeddings…

towardsdatascience.com

Graph Coloring with networkx

The solution to the graph coloring problem is conceptually easy but powerful in its application. This tutorial shows…

towardsdatascience.com

Learn Git Branches with your ML Project

How does Branching work under the hood?

Now you are ready to hear what branches really are!

How branching actually works in practice

The HEAD Pointer

We have to talk about merging

Fast Forward Merge

Basic Merging

Creating Remote Branches

git pull

Know What your Team is up to

Deleting (remote) branches

Summary

Working with JSON data in python

Writing to json files, reading from json files explained and illustrated with examples in python.

Skip-Gram Neural Network for Graphs

This article will go into more details of node embeddings. If you lack intuition and understanding of node embeddings…

Graph Coloring with networkx

The solution to the graph coloring problem is conceptually easy but powerful in its application. This tutorial shows…

Written by Yves Boutellier