GIT — Basic building blocks

There are hundreds of articles and videos about it, but not many talks about its building blocks. They would be overly articulating about the usage and commands. Even if you go through the official documentation, you can only get the hang of usage but not the internals. Internals are fun to know and it will boost your confidence while you use the product. So in PART 1, I explain the internals in simple terms as usual. Let me see if I can do it.

GIT is Distributed Version Control System — What that means?

The creator of GIT is none other than Linus Torvalds. Hope you know his first product, the “LINUX Kernel” which is ruling the world. What else do you expect out of him? Just one more product to rule the world.

And first thing first! Git is just a “stupid content tracking tool” as expressed by the creator himself. How it was made as a “Distributed Version Control System” is sheer brilliance. It means, it does the following:

It tracks the different versions of the evolution of a file.
If needed, it allows us to get back to a particular version quickly.
Tracking back and forth is not just for a single file but for the entire repository as a revision.(*After all, the entire repository is just Trees and blobs).
You don’t have to connect to a server for switching between versions/revisions.
All the above-mentioned works in a distributed way i.e., everything is brought in your local system. The server contents are a mere copy of what you have locally.

Centralized Vs Distributed

In a centralized repository, multiple versions of files is maintained in a central server. When people sync for the recent changes at some point, it downloads the very recent snapshot from the server. Later, if you want to go to any other previous version, you can fetch that from Central Server and replace it locally with the one you have. That means, at any point in time, just only one snapshot is present in your local machine.

In contrast, once you clone/pull, Git stores everything in its entirety in your local storage. Let’s assume you have 20 released versions of your repository, Git stores all those 20 snapshots in your local system and you can go back and forth without connecting to any server. I know you might be asking me like — “Wait! Let’s say my repo size is 5 GB and it stores 20 versions of it in my local repository, so it sweeps off 5*20 = 100 GB of my disc space?”. Actually no! That’s the beauty of Git. How Git stores and visualizes files really matters and that unique mechanism gives the power of distributed nature of it.

Hash & DAG — Keep these in mind.

Before getting into the Git object model, you must know the following about Hash:

Hashing is different from encryption and it is one way i.e, you cannot get the hashed content back with the help of hash.
Hashing is a mechanism of producing “digest” for a given stream of characters.
A digest is a unique standard size of bits.
SHA-1 used in GIT is one of the famous Hashing mechanisms, which produces 160 bits of “digest” for any stream of characters as input.
A “digest” is usually represented in Hexadecimal values (4 bits), thus an SHA-1 digest would look like 40 (160/4) hexadecimal characters.
There is no size or word limitation on the input stream of characters.
A trillion characters content or a single character content, both result in a Hash of the same size i.e., 40 Hex Characters.
Hash keys produced by SHA-1 are unique and it is humanly impossible to collide for 2 different contents. Let’s not debate on that.

Another concept to remember is DAG. DAG (“Directed Acyclic Graph”) is one type of graph representation without cycles. Tracking evolution is one of its use cases.

Git’s Object Model — Quick glance

On your git repository, you will obviously see the current working snapshot of your files. The rest all files of the multiple snapshots of your repository would be consolidated and represented in a hidden directory called “.git”.

Let me give a very quick overview of “.git" storage structure straight away.

Each Unique file (I mean the file contents) would be hash computed by Git.
If 2 or more files have “ditto” the same contents in your repository, then only one hash is computed.
The hash “digest” is the file name in “.git” and actual file contents will be compressed inside that file.
Every file of your repository is constructed like above when “added” to the git.
When you “commit” those changes (Let’s say C1), it will create a new commit index using its contents (commit text+username+timestamp+etc) as a new special hash object. Thus, it MUST be unique.
This special hash object (commit hash) will be connected to all those file content hashes like a DAG.
Later, whenever a file is changed and “added”, GIT creates a new hash file “only” for that file object.
When “committed” again (Let’s say C2), it will create a committed index connecting to that changed file. For the rest of the file objects, it connects them using a pointer to that sub-part of DAG.
Git visualizes the whole repository’s file system in its own style and stores them as commits, blobs (file contents), and trees with connections.

All those mentioned are consolidated in that hidden “.git” directory present in your repository as I mentioned. That’s it! That’s the whole product. As I promised, I will not go further and this level of understanding is good enough to work with the product.

Some important “Reference Pointers” of GIT

Now we learned, if we just track the “commit” index, we can easily reconstruct a sub-tree of your DAG and that is your revision. How would you track it? It is again simple! Just with reference pointers. A reference pointer tracking a committed index is called “branch”. You can create as many branches as you can like “JIRA-123”, “stupidIdea”, “RTC-567” etc.

Irrespective of you creating a reference pointer, 2 reference pointers that you must know is HEAD & master. You are free to rename “master” as “main”, “development”, “virgin” or whatever. But “HEAD” remains “HEAD”.

master — It is a reference to the latest commit in the repository. This is also the first branch of a repository created by the git itself. Thus, it keeps tracking the latest snapshot in the repository.

HEAD — This is a “movable” reference pointer to a reference pointer. Mostly, it would be referencing “master” branch until switched. When “switch/checkout” is called, it will reference that particular branch.

In the below diagram, I have abstracted commit as single entity (it will be pointing to multiple blobs and trees internally) to illustrate how master and HEAD moves.

Just by using this simple idea, you can track the entire versions of your GIT repository.

Some GIT plumbing command examples

Actually, there is no need for you to know the plumbing commands. They are just fun to dig and re-verify what I spoke about just now. I am giving it here, feel free to omit this section.

git hash-object

It produces SHA-1 hash for a given text. The same is used by git internally. You can’t directly use this command, but can pipe it through the “echo” command as below:

The “objects” folder of git will find a new entry. git creates the first 2 “hex” as a folder name. The remaining 38 “hex” would be the file name. It is for easy indexing. It will appear as below:

2. git cat-file

If you open that file in a notepad, the content would be compressed. Don’t worry, you can unzip it with git cat-file as below:

Conclusion

Now I believe you have understood the basic building blocks of Git. It is nothing but hash-named blobs, trees, and commits connected as DAG. It had reference pointers at multiple places to reconstruct a snapshot. Thus, Git visualizes the file system in its own way to reconstruct it quickly. This magic happens just locally in a distributed way. That’s pretty much about this article. There is PART 2 cooking and it is on the way. In that, I promise, I will talk only from the user perspective and you can use Git confidently thereon. Catch you, until then C’ya!