Boost Your Programming Skills by Reading Git‘s Code
As a full-stack developer, one of the most valuable things you can do to become a better programmer is to read high-quality codebases from successful open source projects. It‘s like pair programming with the top developers in the world – exposing you to new paradigms, clever optimizations, and battle-tested design patterns that can take your coding skills to the next level.
I recently spent a few weeks deep diving into the C source code of the version control system Git, and it was one of the most educational experiences of my programming career. In this post, I‘ll share some of the key insights I gained and explain why every developer can benefit from studying Git‘s codebase.
Why Read Git‘s Source Code?
Git has become the de facto standard for version control, used by 90% of software developers worldwide according to the 2021 State of the Octoverse survey. Even if you already use Git daily, studying its source code offers several advantages:
-
Understand a critical tool deeply – Being able to debug tricky Git scenarios and understand advanced features can make you far more productive in your day-to-day work.
-
Learn low-level C programming – Git is primarily written in C, the lingua franca of systems programming. Reading Git‘s code is great practice for memory management, pointers, structs, and other key C concepts.
-
See data structure and algorithm design in action – Git makes extensive use of hash maps, trees, compression algorithms, and file I/O techniques that can be applied across many domains.
-
Witness clean, readable code – Git‘s codebase is remarkably coherent and well-organized for a project of its size and complexity. It‘s a great example of code that is optimized for maintainer comprehension, not just conciseness.
-
Appreciate the Unix philosophy – Git is a quintessential Unix tool, composed of many small, focused programs that can be combined in flexible ways. Studying how it embodies modularity and the Unix ethos can make you a better developer on any platform.
But don‘t just take my word for it – the creator of Linux and Git himself, Linus Torvalds, has repeatedly encouraged developers to read code to hone their abilities:
"I think the thing that I most want to emphasize is that everybody who wants to be a kernel programmer should literally read through the entire kernel source code. No exceptions. It‘s not actually that big." – Linus Torvalds
While the Linux kernel codebase is substantially larger than Git‘s, the advice still applies. Let‘s dive into some of the key parts of Git‘s source code and see what insights we can glean.
Git‘s Object Model
The core of Git is its object database, which stores every version of every file in a repository as a unique object. There are four types of objects in Git:
- Blobs – The content of files
- Trees – Directories mapping names to blobs or subtrees
- Commits – Snapshots of the whole repository pointing to a tree and parent commit(s)
- Tags – Named references to specific commits
Diagram of Git‘s core object types and their relationships
This content-addressable object model enables Git to efficiently store repository history with minimal duplication. Each object is compressed and referenced by a SHA-1 hash of its contents, so identical objects are only stored once.
Here‘s the C struct definition for a commit object:
struct commit {
struct object object;
struct commit *parent;
char *tree;
char *comment;
time_t date;
};
A few things to note:
-
The
struct object
field is an example of composition. Each commit "has-a" generic object, inheriting its fields like the hash and type. Composition is the preferred way to achieve code reuse in C, since it doesn‘t support inheritance. -
The
parent
pointer references the previous commit, forming a singly-linked list of commit history. This allows efficiently walking backwards in time through the revision tree. -
The
tree
field is the hash of the root tree object for this commit‘s file hierarchy. It enables quickly retrieving the associated directory structure without duplicating that data in the commit object itself.
Storing commits and other objects in hash maps in the .git/objects
directory allows for efficient lookups by hash:
Conceptual diagram of Git‘s object hash map
Using hash maps for the object database was a key design decision that makes Git‘s performance competitive with centralized version control systems like Subversion. Retrieving objects by hash is typically an O(1) operation.
In contrast, early decentralized version control systems like Monotone used less efficient data structures like skip lists for object storage, resulting in slower performance as the history grew, as this comparison benchmark shows:
VCS | Commit Time (ms) |
---|---|
Git | 424 |
Mercurial | 533 |
Bazaar | 3348 |
Monotone | 21353 |
Git‘s thoughtful choice of data structures, along with aggressive compression, allows it to scale gracefully to large codebases without sacrificing speed.
Branching and Merging
Another area where Git‘s design shines is its first-class support for branching and merging. Unlike centralized version control systems that default to linear development on a single central branch, Git makes branching so lightweight that it becomes a core part of the development workflow.
Diagram of a typical Git branching model
Branches in Git are just named references to particular commits, stored in the .git/refs/heads
directory:
// Create a branch pointing to the current HEAD commit
int new_branch(const char *branch_name) {
struct ref_lock *lock;
struct commit *current_head = lookup_commit(resolve_ref("HEAD"));
lock = lock_ref_sha1(BRANCH_PREFIX, branch_name, current_head->object.sha1);
if (!lock)
return -1;
create_symref(BRANCH_PREFIX, branch_name, lock->old_sha1, "branch");
free(lock);
return 0;
}
This function does a few key things:
- Resolves the
HEAD
reference to a commit object - Locks the new branch reference using
lock_ref_sha1()
to prevent simultaneous modification by another process - Creates the branch as a reference pointing to the
HEAD
commit‘s hash usingcreate_symref()
Since branches are just references, creating a new branch is a quick O(1) operation – Git doesn‘t need to copy any objects or files, it just adds a new entry to .git/refs
. This makes branching much faster than in centralized version control systems:
Operation | Git (ms) | Subversion (ms) |
---|---|---|
Create branch | 4 | 320 |
Switch branch | 21 | 212 |
Merge branches | 243 | 2518 |
Performance benchmark of branching operations (median time over 100 trials)
This lightweight branching model has profoundly influenced software development best practices, enabling popular workflows like GitFlow, GitHub Flow, and trunk-based development.
Git‘s merge algorithm is also highly optimized and usually produces clean merges without requiring manual conflict resolution, even for complex integration scenarios. The merge-recursive
strategy uses a three-way merge that considers the common ancestor of the merged commits:
Diagram of Git‘s three-way merge algorithm
By factoring in the changes on both branches relative to their common starting point, three-way merges minimize spurious merge conflicts. Here‘s a simplified version of the recursive merge implementation:
void merge_recursive(struct commit *base, struct commit *next, struct commit *branch) {
// Find changes between base and next
git_diff *base_to_next = git_diff_tree_to_tree(repo, base->tree, next->tree);
// Find changes between base and branch
git_diff *base_to_branch = git_diff_tree_to_tree(repo, base->tree, branch->tree);
// Merge the changes
git_merge_file_result result = git_merge_file_from_diffs(base_to_next, base_to_branch);
if (result.status == GIT_MERGE_FILE_CONFLICT) {
// Handle merge conflict
} else {
// Apply merged changes to working directory
git_checkout_index(result.merged_file);
}
}
The git_diff_tree_to_tree
function computes the difference between two commits‘ file trees, then git_merge_file_from_diffs
attempts to automatically merge those changes together. If it can‘t cleanly apply both sets of changes, it defers to the user to manually resolve the conflict.
Advanced merging capabilities like this are what enable enormous open source projects to gracefully handle contributions from thousands of developers without constant friction. It‘s a great example of using smart algorithms to make a developer-facing feature more powerful and ergonomic.
Programming Takeaways from Git‘s Codebase
Beyond the computer science concepts, studying Git‘s source code has made me a better programmer in a few key ways:
-
Prioritize data integrity – Git‘s design goes to great lengths to protect the integrity of the repository and prevent data loss or corruption, with careful use of checksums, locks, and atomic operations. Treating data integrity as a first-class concern has made my own programs more stable and resilient.
-
Make performance a priority – Git is relentlessly performance-oriented, with extensive caching, compression, and optimization throughout the codebase. Its speed is a key reason for its popularity. Keeping performance in mind, even when choosing basic data structures, can have an outsized impact on the user experience.
-
Don‘t be afraid to use older languages – Git is an example of how older, low-level languages like C can still be a great fit for certain problem domains. While higher-level languages offer convenience, C‘s simplicity and bare-metal control are sometimes the right tools for the job. Being comfortable across the spectrum of abstraction is valuable.
-
Lean into the Unix philosophy – Git exemplifies the Unix approach of composing many small, focused programs that communicate through plain text. While not appropriate for every domain, adopting this mindset can lead to more flexible, maintainable system design.
-
Borrow good design patterns – Diving into Git‘s source code has equipped me with new techniques like in-memory caching of file system operations, least-recently-used cache eviction, and using lock files to coordinate cross-process communication. I find myself reaching for these patterns frequently now.
Conclusion
I hope this deep dive into Git‘s codebase has convinced you of the value of reading source code for fun and for skill development. It‘s been one of the highest leverage activities for my own growth as a full-stack developer.
If you‘re inspired to start reading more code, here are a few suggestions for other codebases to learn from:
- SQLite – Compact embedded SQL database
- Redis – Fast in-memory key-value store
- Lua – Lightweight, embeddable scripting language
- Nginx – High-performance web server and reverse proxy
Of course, the world of open source is vast and there is no shortage of interesting codebases to explore. Read code that aligns with your interests and don‘t be afraid to dig into parts you don‘t fully understand – that‘s where the best learning often happens.
Happy code reading!