Most developers interact with Git through a small set of memorized commands. Add, commit, push, pull. The workflow becomes muscle memory, and as long as nothing unexpected happens, that muscle memory suffices. But when something does go wrong, when a merge conflicts in unexpected ways or a rebase seems to have swallowed work that was definitely committed, Git suddenly feels less like a tool and more like an obstacle. The commands that worked reliably now produce outputs that make little sense. The documentation assumes knowledge that was never acquired. The feeling of being lost inside a system used every day is uniquely frustrating.
The alternative to memorizing commands is understanding what Git actually does when those commands run. This is not about learning theory for its own sake or becoming a contributor to the Git project itself. It is about developing enough mental model of the internal structures that Git's behavior becomes predictable rather than mysterious. If you are completely new to version control, starting with an introduction to Git will provide helpful context before diving into internals. When a command produces unexpected results, the surprise comes from a gap between what was assumed and what actually occurred. Closing that gap does not require knowing every implementation detail. It requires understanding the fundamental objects Git uses to represent history and the references it uses to navigate through that history. Once those pieces click into place, many common problems stop feeling like emergencies and start feeling like solvable puzzles.
What Git Actually Stores and Why That Matters
The most common misconception about Git is that it stores changes between versions of files. This assumption comes naturally because most version control interfaces present history as a series of differences. Each commit appears as lines added and lines removed. But Git does not store differences internally. It stores complete snapshots of the entire project at each commit. When a new commit is created, Git records the full state of every tracked file at that moment. This sounds inefficient, and it would be if Git did not have mechanisms to avoid duplicating identical content across commits.
The efficiency comes from how Git organizes these snapshots internally. Rather than storing a complete copy of the project for every commit, Git breaks the snapshot into discrete objects that can be reused when content does not change. If a file is identical across multiple commits, Git stores that file's content exactly once and points to that single stored copy from each commit that contains it. This explains why committing large repositories remains fast even after hundreds or thousands of commits. Git does not duplicate the entire project each time. It only creates new objects for content that actually changed and references existing objects for everything that stayed the same. For a deeper understanding of the fundamental concepts that make this possible, the guide to Git core concepts provides a helpful foundation. Understanding this snapshot model rather than the diff model changes how operations like merge and rebase are understood. A merge is not combining two sets of changes. It is combining two snapshots and their histories into a new snapshot that incorporates both.
The Three Objects That Form Git's Foundation
Git organizes all stored data into three fundamental object types. These are not abstract concepts described in documentation but actual files that live inside the hidden directory of every Git repository. The first and simplest object type is the blob. A blob represents the content of a single file and nothing else. It does not store the filename. It does not store permissions or timestamps or any other metadata. It stores only the raw bytes that make up the file content. This separation of content from identity has important consequences. If two files have identical content, even if they have different names or live in different directories, Git stores that content as a single blob and both files point to it. Renaming a file does not create a new blob because the content did not change. Copying a file to another location does not increase storage because Git recognizes the content already exists.
The second object type is the tree. A tree represents a directory and solves the problem that blobs intentionally ignore. Trees map filenames to specific blobs for files and to other trees for subdirectories. When Git checks out a commit and reconstructs the working directory, it starts at the root tree and walks down through the hierarchy, creating directories and populating them with files whose content comes from the blobs referenced along the way. A tree object contains entries that pair a name with an object reference and also store basic metadata like whether the entry represents a file or another directory. The tree structure is what allows Git to recreate the exact directory layout that existed when the commit was made.
The third object type is the commit itself. A commit object ties everything together into a coherent point in history. It contains a reference to exactly one root tree that represents the top level directory of the project at that moment. It contains metadata about who created the commit and when. It contains a message describing why the commit exists. And crucially, it contains references to parent commits. For a normal commit, there is one parent. For a merge commit, there are two or more parents. The initial commit in a repository has zero parents. These parent references form the chain of history that Git traverses when showing logs or determining what changed between two points in time. Commits are immutable once created. Their content never changes. The hash that identifies a commit is computed from all of its contents, so any modification would produce a different hash and therefore a different commit. This object model is explored further in the Git internals overview.
Walking Through What Happens During a Commit
Understanding the objects individually is helpful, but seeing how they work together during a commit makes the model concrete. When a commit command runs, Git does not simply capture a diff and append it somewhere. The process begins with the staging area, which holds a representation of what the next commit should contain. Git examines the staging area and identifies which files have content that does not already exist in the object database. For any new or modified content, Git creates new blob objects. Content that matches existing blobs is not duplicated. The references will simply point to the blobs that already exist.
Next, Git builds new tree objects that reflect the directory structure implied by the staged files. If a file changed, the tree that contains it must be updated to reference the new blob. If that tree is referenced by a parent tree, that parent tree must also be updated, and so on up to the root tree. Trees that reference only unchanged blobs and unchanged subtrees can themselves be reused. This cascading update ensures that the new commit accurately reflects the state of the project while minimizing the creation of new objects. The basic Git workflow guide explains how these concepts translate into daily development practice.
Finally, Git creates a new commit object. It sets the root tree reference to the newly created or updated root tree. It records the current timestamp and the configured author information. It sets the parent commit reference to whatever commit was checked out when the commit command ran. It stores the commit message. Then it writes this commit object to the object database and updates the current branch reference to point to this new commit. The entire process happens locally within the repository. No network communication occurs. This is why committing is fast regardless of repository size and why the same content committed at different times produces different commit hashes. The content might match an earlier commit, but the timestamp and parent reference differ, so the hash differs.
The Staging Area as Intentional Control
The staging area sometimes confuses developers who come from version control systems that commit all changes automatically. Its purpose becomes clearer when understood through the object model. The staging area is a separate structure that holds a proposed tree state. It allows selective inclusion of changes in the next commit independent of what exists in the working directory.
A common scenario illustrates the value. A developer is working on a feature and discovers an unrelated issue that needs fixing. The working directory now contains changes for both the feature and the fix. Committing everything together would create a messy history that combines unrelated work. The staging area provides a solution. The developer can stage only the files containing the fix, commit that separately with a clear message, and then continue working on the feature. Internally, Git simply ignores unstaged changes when building the commit from the staging area. The working directory remains dirty, but the commit contains exactly what was staged. Without the staging area, commits would always reflect the entire working directory, making clean history and logical separation of changes far more difficult to achieve. For scenarios where changes need to be temporarily set aside, the guide on stashing changes provides additional workflow flexibility.
Branches Are Pointers, Not Copies
The word branch suggests divergence and duplication, which creates confusion about what branches actually are in Git. A branch is nothing more than a named reference to a specific commit. It is a pointer stored in a file that contains a commit hash. When a new branch is created, Git does not copy any files or duplicate any history. It simply creates a new pointer file that contains the same commit hash as the branch that was checked out when the branch was created. The entire operation completes almost instantly regardless of repository size because no data is copied. Nothing about the commit history changes. A new name now points to an existing commit. The Git branching fundamentals guide explores this topic in greater depth.
This pointer model explains why switching branches is fast and why deleting a branch does not immediately delete the commits it pointed to. The commits remain in the object database as long as some other reference still points to them or as long as the reflog retains a record of the branch having pointed to them. Git eventually runs garbage collection to remove objects that have become completely unreachable, but this is a deliberate background process, not an immediate consequence of branch deletion. The pointer model also clarifies what it means for two branches to have diverged. Divergence simply means the two branch references point to different commits that share some common ancestor. The commits themselves are just objects in the database. The branches are just names that make those commits easy to find and use. When branches need to be combined, the guide on merging branches explains how Git integrates changes from different lines of development.
Rebasing Creates New History
Rebasing is often described as rewriting history, and that description is accurate but requires the object model to fully understand. A rebase operation takes a series of commits and creates new commits that apply similar changes but have different parent relationships. Even when the file content changes are identical to the original commits, the new commits have different hashes because the parent reference is part of what makes a commit unique. The rebasing in Git guide provides a comprehensive walkthrough of this operation.
This explains the most common source of confusion and conflict when collaborating with others. If a developer rebases a branch that has already been shared with teammates, those teammates have local copies of the original commits referenced by their own branch pointers. After the rebase, the shared branch reference points to the new commits, but teammates still have references to the old commits. From the perspective of their local repositories, history has diverged in incompatible ways. The commits they have are not ancestors of the commits now on the shared branch. Resolving this situation requires force operations that discard the old commits in favor of the new ones. The object model makes clear why this happens. The rebased commits are entirely new objects with new hashes. The old objects still exist in the local repository until garbage collection removes them, but they are no longer reachable from the updated branch reference.
Understanding HEAD and Detached States
HEAD is a special reference that tells Git what is currently checked out in the working directory. In normal operation, HEAD points to a branch reference, and that branch reference points to a commit. This indirection allows Git to know both what commit is active and what branch will advance when a new commit is created. When HEAD points directly to a commit rather than to a branch, the repository is said to be in a detached HEAD state. This sounds alarming but represents a normal and useful mode of operation.
A practical example involves investigating an issue reported in production. The developer checks out the commit that was deployed at the time the issue occurred. HEAD now points directly to that commit. The working directory reflects the exact state of the code when the issue happened. The developer can examine the code, run tests, and even make experimental changes. If those experimental changes should be preserved, the developer creates a new branch at the current commit. This attaches a name to the current state and transitions from detached HEAD back to normal operation where HEAD points to the new branch. If the investigation concludes without needing to preserve changes, the developer simply checks out an existing branch. The detached HEAD state resolves, and any uncommitted changes are handled according to the checkout behavior. Nothing was broken. Nothing needs repair. The repository was simply in a state where HEAD pointed directly to a commit instead of indirectly through a branch.
Why Git Rarely Loses Data Permanently
Commands that appear destructive in Git typically operate by moving references rather than deleting objects. When a commit seems to disappear after a reset operation, what actually happened is that a branch reference was moved to point to a different commit. The original commit still exists in the object database. It has simply become unreachable from any named reference. The reflog exists precisely because Git records where references used to point before they moved. This record of reference movement provides a safety net that allows recovery from many operations that initially appear to have destroyed work. The guide on undoing changes in Git covers various recovery scenarios in detail.
The reflog is local to each repository and records entries for a limited time. After that time expires, entries are pruned. If no other references point to a commit and the reflog entry has expired, the commit becomes eligible for garbage collection. Even then, garbage collection is not instantaneous. The objects remain on disk until the collection process runs and removes them. This conservative approach to data deletion reflects a design philosophy that prioritizes safety over aggressive cleanup. Understanding this philosophy changes how mistakes feel. A wrong command does not immediately destroy work. It moves a reference to a place that was not intended. Finding the right reference and moving it back often restores the expected state. For understanding how Git synchronizes with remote repositories during recovery, the guide on fetching and pulling changes clarifies the distinction between downloading and integrating remote work.
Predictability Through Understanding
The practical value of understanding Git's internal model is not academic. It manifests in the ability to reason about what a command will do before running it and in the confidence to recover when something goes wrong. When a merge produces unexpected conflicts, the object model explains that two trees are being combined and that files with different content at the same paths require resolution. When a rebase seems to have lost commits, the object model explains that new commits were created and the old ones still exist but are no longer referenced by the current branch. When the working directory state feels confusing, the distinction between the working directory, the staging area, and the commit history provides a framework for understanding what changed where.
Developers who think in terms of objects and references stop seeing Git as a collection of magical incantations and start seeing it as a predictable system with consistent rules. Commands that previously felt risky become tools whose effects can be anticipated. Mistakes that previously induced panic become situations that can be diagnosed and resolved. The goal is not to memorize more commands or to understand every implementation detail of the Git source code. The goal is to develop a mental model accurate enough that Git's behavior feels like a consequence of that model rather than an arbitrary set of responses to memorized inputs. When that shift happens, version control becomes a reliable partner in development rather than a source of occasional mystery and frustration. For those ready to explore more advanced operations, the guide on cherry picking commits demonstrates how selective application of specific changes works within this object model.

Comments (0)
No comments yet
Be the first to share your thoughts!
Post Your Comment Here: