Git Internals: How Git Stores Data Under the Hood

Git uses blobs, trees, and commits to store snapshots efficiently in its internal structure.

Git Internals Overview

Most developers use Git every day with commands like add, commit, and push without ever wondering what happens behind the scenes. But beneath these simple commands lies an elegant and powerful internal system that makes Git fast, reliable, and remarkably efficient. Understanding Git internals transforms you from someone who merely uses Git into someone who truly understands it—giving you the ability to debug complex issues, recover seemingly lost work, and use advanced features with confidence.

At its heart, Git is fundamentally a content-addressable file system with a version control interface built on top. Instead of tracking files by their names or paths, Git tracks content using unique identifiers generated from the data itself. Every file, every directory, and every commit is stored as an object inside the hidden .git directory, forming a structured database that represents the complete history of your project. Once you grasp this concept, Git's behavior stops feeling like magic and starts making perfect sense.

How Git Stores Data

One of the most important differences between Git and older version control systems is how they store data. Traditional systems like SVN store changes—they record the differences between one version and the next. Git takes a completely different approach: it stores snapshots of the entire project at each commit.

Each snapshot represents the complete state of your project at that moment. If a file hasn't changed between commits, Git does not duplicate it. Instead, it stores a reference to the identical file from the previous snapshot. Only files that actually changed produce new stored content. This combination of full snapshots and intelligent deduplication makes Git both fast to retrieve historical versions and surprisingly storage-efficient.

The snapshot model:
Commit 1:  [File A v1] [File B v1]
Commit 2:  [File A v1] [File B v2]   ← Only File B changed, File A referenced
Commit 3:  [File A v2] [File B v2]   ← Only File A changed, File B referenced

This snapshot-based model is why Git performs so well even in large projects with years of history. To understand how these snapshots are created in practice, review the basic Git workflow.

Git Object Types

Everything in Git's database is stored as one of four object types. These objects form the building blocks of your repository, linking together to create the complete history of your project.

  • Blob (Binary Large Object): Stores the actual content of a file. A blob contains only the file data—no file name, no permissions, no metadata. This separation allows Git to deduplicate identical file content across different locations and names.
  • Tree: Represents a directory. A tree object contains references to blobs (files) and other trees (subdirectories), along with their names and file permissions. Trees give structure to the blobs.
  • Commit: Represents a snapshot of the project. A commit object points to a tree (the root directory of your project at that moment) and contains metadata including the author, committer, timestamp, commit message, and references to parent commits.
  • Tag: Points to a specific commit and is often used to mark release points. Annotated tags also store metadata like the tagger name, email, date, and a tagging message.

These four object types work together to form a complete, verifiable history graph. Understanding these building blocks becomes essential when working with advanced topics like rebasing or cherry-picking.

SHA-1 Hashing

Every object in Git is identified by a unique SHA-1 hash—a 40-character string of hexadecimal digits. Git generates this hash by running the object's content through a cryptographic hash function. The same content always produces the same hash, which is why Git is called content-addressable.

Example SHA-1 hash:
e4d909c290d0fb1ca068ffaddf22cbd0b3c6c9e5

This hashing mechanism serves two critical purposes. First, it ensures data integrity—if any part of a file or commit changes, even by a single character, its hash changes completely, making tampering immediately detectable. Second, it enables deduplication—identical content across different commits or even different files shares the same hash, so Git only stores it once. This combination of integrity verification and efficient storage is why Git history is considered tamper-evident and highly reliable.

The .git Directory Structure

The .git folder is the complete repository. It contains all the data, metadata, configuration, and history required for version control. If you delete everything except the .git folder, you can restore your entire project with full history. This folder is what you are really working with when you use Git.

Important directories inside .git:
.git/
├── objects/     # All Git objects (blobs, trees, commits, tags) stored here
│   ├── ab/      # Objects stored in folders named by first two hash characters
│   └── ...
├── refs/        # References to branches and tags (pointers to commits)
│   ├── heads/   # Local branches
│   └── tags/    # Tags
├── HEAD         # A file pointing to the current branch reference
├── config       # Repository-specific configuration settings
├── index        # The staging area (also called the cache)
└── logs/        # The reflog—a history of where references have pointed

Understanding this structure is invaluable when troubleshooting issues or recovering lost commits. The reflog, in particular, is your safety net for undoing mistakes, as covered in undoing changes in Git.

The Staging Area (Index)

One of Git's unique features is the staging area, also known as the index. It acts as an intermediate buffer between your working directory (where you edit files) and the repository (where history is permanently stored). This three-stage architecture gives you precise control over what gets committed.

When you run git add, Git takes the current state of your files and writes them to the staging area. When you run git commit, Git creates a snapshot of whatever is in the staging area at that moment, not necessarily what is in your working directory.

Three-stage workflow:
Working Directory → Staging Area (index) → Repository (objects)
     (edit files)      (git add)              (git commit)

This design lets you craft commits carefully. You can work on multiple changes simultaneously but commit them as separate, logical units—staging only the files related to each commit. This practice of creating clean, focused commits is a hallmark of professional Git usage, as discussed in Git best practices.

References and HEAD

Git uses references, or refs, to point to commits. Branches and tags are both types of references. Instead of forcing you to remember long, unreadable SHA-1 hashes, Git lets you use friendly names like main or feature/payment-integration.

The HEAD pointer is a special reference that indicates where you are currently working. In most cases, HEAD points to a branch reference, which in turn points to a commit. When you switch branches with git checkout or git switch, Git updates HEAD to point to the new branch.

How HEAD and branches relate:
HEAD → main → commit-abc123
When you checkout feature/login:
HEAD → feature/login → commit-def456

Understanding HEAD is essential when working with commands like git reset, git rebase, and git cherry-pick. It is the reference point Git uses to determine where you are and where new commits should attach.

Packfiles and Performance Optimization

Over time, storing every object individually would become inefficient. Git addresses this through packfiles—compressed archives that combine multiple objects into single files. Packfiles use delta compression to store only the differences between similar objects, dramatically reducing disk usage.

Packfiles are created automatically during operations like cloning, fetching, and when you run garbage collection. Git constantly optimizes its storage in the background, which is why even repositories with years of history and thousands of commits remain fast and compact.

Maintenance commands:
# Clean unnecessary files and optimize repository storage
git gc

# Verify repository integrity (checks objects for corruption)
git fsck

# Manually repack objects into packfiles
git repack -a -d

These internal optimizations are why Git can handle massive repositories like the Linux kernel—with over a million commits—without becoming unusably slow or consuming terabytes of disk space.

Plumbing vs Porcelain Commands

Git commands are traditionally divided into two categories: porcelain and plumbing, a metaphor borrowed from plumbing fixtures versus decorative ceramic. This distinction reflects Git's design as a toolset built on low-level primitives with a user-friendly layer on top.

  • Porcelain commands: The user-friendly commands you use daily—git commit, git status, git push, git log. These are designed to be intuitive and hide internal complexity.
  • Plumbing commands: Low-level commands that interact directly with Git's internal data structures—git hash-object, git cat-file, git update-index, git write-tree. These are rarely used directly but power the porcelain commands.

Plumbing commands give you direct access to Git's object database. While you may never need them for daily work, they are invaluable for advanced debugging, scripting custom Git operations, and truly understanding how Git works internally.

Example plumbing commands in action:
# Manually create a blob object from content
echo "Hello Git" | git hash-object -w --stdin
# Output: 8a0c7e6f6f5b7e9e8c7a6b5c4d3e2f1a0b9c8d7e

# View the content of any object by its hash
git cat-file -p 8a0c7e6f

# View the type of an object
git cat-file -t 8a0c7e6f
# Output: blob

How Commits Form a Graph

Git history is not a simple linear list. It is a directed acyclic graph where each commit points to its parent commit (or parents, in the case of merge commits). This graph structure is what makes branching and merging work so naturally.

When you create a branch, Git simply creates a new reference pointing to an existing commit. As you make new commits on that branch, the branch pointer moves forward, but the original branch remains unchanged. This allows multiple lines of development to proceed independently while remaining part of the same history graph.

Visual representation of a commit graph:
main:   A ← B ← C
              \
feature:        D ← E ← F

This graph-based architecture is what enables powerful features like interactive rebasing, cherry-picking across branches, and visualizing your project's history as a network of contributions. To see how this graph structure supports everyday development, explore Git branching fundamentals and merging branches.

Frequently Asked Questions

  1. Do I need to learn Git internals for everyday development?
    Not at all. Most developers work productively for years using only porcelain commands. However, understanding internals becomes incredibly valuable when things go wrong—when you need to recover lost commits, debug unexpected behavior, or understand why Git is behaving a certain way. It also opens the door to advanced workflows that rely on deeper Git knowledge.
  2. Is Git really storing full copies of files with every commit?
    Yes and no. Git stores full snapshots conceptually, but storage-wise it uses clever optimizations. Unchanged files are referenced rather than duplicated, and packfiles compress similar objects together. The result is that Git repositories are often smaller than you might expect, even with extensive history.
  3. Can I inspect Git objects directly?
    Absolutely. Plumbing commands like git cat-file -p <hash> let you view any object's content. You can also explore the .git/objects directory directly, though the objects are compressed and stored in a specific structure that plumbing commands decode for you.
  4. What happens if the .git folder is deleted?
    Deleting the .git folder removes all version control information—the complete history, all branches, tags, and configuration. Your current working files remain, but they become untracked files in a plain directory. Without a backup, the history is permanently lost. This is why pushing to remote repositories is essential for backup.
  5. Why are SHA-1 hashes so important?
    Hashes serve two critical purposes: they uniquely identify every piece of content, and they guarantee integrity. Because the hash is derived from content, any corruption or tampering changes the hash, making it detectable. This is why Git can confidently claim that history is tamper-evident—you cannot change a commit without changing its hash and breaking all references to it.
  6. What is the difference between a soft, mixed, and hard reset?
    These operations manipulate different parts of Git's three-stage architecture. A soft reset moves only the branch pointer, leaving the staging area and working directory unchanged. A mixed reset (the default) moves the branch pointer and resets the staging area, leaving working directory changes. A hard reset moves the branch pointer, resets the staging area, and discards working directory changes. Understanding Git internals makes these distinctions intuitive rather than confusing.

Conclusion

Git's internal architecture—the snapshot-based storage, the object database, the three-stage workflow, the graph-based history—is what makes Git powerful, fast, and reliable. These design choices, made during Git's creation to support the Linux kernel's massive development scale, scale down just as elegantly to projects of any size. Understanding these fundamentals transforms Git from a set of commands you memorize into a system you truly comprehend.

Armed with this knowledge, you can approach advanced topics with confidence. Explore rebasing to understand how Git rewrites history, cherry-picking to selectively apply commits, and structured Git workflows to see how teams organize development. The more you understand Git internally, the more effectively you can use it—and the more confidently you can recover when things don't go as planned.