Git is not suitable for managing versions of Excel and Word files
Managing Excel and Word Files Inside a Git Repository — A Practical Exploration
The author explored how Git handles Microsoft Excel (.xlsx) files inside a project repository that primarily contains PowerShell scripts and text files. Later the author asked for including Microsoft Word (.docx) files in this summary as the same applies to it too.
How Git Stores Excel and Word Files
Git treats .xlsx and .docx files as binary blobs, not as structured documents. Unlike text files:
- Git cannot show meaningful line-by-line diffs.
- Git cannot intelligently merge concurrent changes.
- Each modification typically results in a new stored blob object.
- Git internally compresses objects, so storage growth is often smaller than raw file size might suggest.
Although both formats are technically ZIP archives containing XML files, Git does not analyze or diff their internal structure by default.
Measuring Historical Storage Footprint
To understand how much space Excel files were consuming across repository history, a PowerShell script was created that:
- Scans the entire Git history
- Identifies all
.xlsxblob objects - Counts historical versions
- Calculates total uncompressed storage
- Provides per-file breakdown
[Ravi: This script is named Get-XlsxBlobStorage.ps1and is available in above mentioned public repo.]
When executed in the repository, the results showed:
- Only one historical version
- Total storage: ~53 KB
This confirmed that the spreadsheet’s footprint was negligible.
The same measurement approach would apply equally to .docx files.
Removing All Historical Versions in the Future
The discussion examined whether it is possible to completely remove an Excel or Word file from Git history at a later date.
Yes, it is possible using tools such as git filter-repo, which:
- Rewrites commit history
- Removes all references to the file
- Permanently changes commit hashes
- Requires force-pushing if a remote repository exists
This process does not corrupt other files, but it is invasive and must be handled carefully because it rewrites the repository’s commit graph.
Adopted Strategy Going Forward
After evaluating the complexity of history rewriting, the author chose a simpler architectural boundary:
- Excel and Word files will be removed from the repository.
.xlsxand.docxextensions will be added to.gitignore.- These documents will be maintained outside Git.
- Regular system backups will handle version retention.
- Older document versions will be discarded according to normal backup lifecycle policies (for example, after several years).
Given a single-user workflow and relatively small document sizes, this approach:
- Avoids future history rewriting
- Keeps the repository focused on source code and text artifacts
- Reduces operational complexity
- Delegates archival lifecycle management to external backups
Key Takeaways
- Git stores
.xlsxand.docxfiles as binary blobs. - Storage growth for small Office documents is usually trivial.
- Full historical removal is possible but involves rewriting repository history.
- Binary working documents are often better managed outside Git.
- Git is version control — not a long-term archival management system.
This exploration provided a practical understanding of Git’s object model and reinforced the importance of aligning tooling choices with the nature of the artifacts being managed.
Comments
Post a Comment