Git is not suitable for managing versions of Excel .xlsx and Word .docx files but handles .csv and .md (from .docx) files well
Managing Excel and Word Files Inside a Git Repository — A Practical Exploration
The author explored how Git handles Microsoft Excel (.xlsx) files inside a project repository that primarily contains PowerShell scripts and text files. Later the author asked for including Microsoft Word (.docx) files in this summary as the same applies to it too.
How Git Stores Excel and Word Files
Git treats .xlsx and .docx files as binary blobs, not as structured documents. Unlike text files:
- Git cannot show meaningful line-by-line diffs.
- Git cannot intelligently merge concurrent changes.
- Each modification typically results in a new stored blob object.
- Git internally compresses objects, so storage growth is often smaller than raw file size might suggest.
Although both formats are technically ZIP archives containing XML files, Git does not analyze or diff their internal structure by default.
Measuring Historical Storage Footprint
To understand how much space Excel files were consuming across repository history, a PowerShell script was created that:
- Scans the entire Git history
- Identifies all
.xlsxblob objects - Counts historical versions
- Calculates total uncompressed storage
- Provides per-file breakdown
[Ravi: This script is named Get-XlsxBlobStorage.ps1and is available in above mentioned public repo.]
When executed in the repository, the results showed:
- Only one historical version
- Total storage: ~53 KB
This confirmed that the spreadsheet’s footprint was negligible.
The same measurement approach would apply equally to .docx files.
Removing All Historical Versions in the Future
The discussion examined whether it is possible to completely remove an Excel or Word file from Git history at a later date.
Yes, it is possible using tools such as git filter-repo, which:
- Rewrites commit history
- Removes all references to the file
- Permanently changes commit hashes
- Requires force-pushing if a remote repository exists
This process does not corrupt other files, but it is invasive and must be handled carefully because it rewrites the repository’s commit graph.
Adopted Strategy Going Forward
After evaluating the complexity of history rewriting, the author chose a simpler architectural boundary:
- Excel and Word files will be removed from the repository.
.xlsxand.docxextensions will be added to.gitignore.- These documents will be maintained outside Git.
- Regular system backups will handle version retention.
- Older document versions will be discarded according to normal backup lifecycle policies (for example, after several years).
Given a single-user workflow and relatively small document sizes, this approach:
- Avoids future history rewriting
- Keeps the repository focused on source code and text artifacts
- Reduces operational complexity
- Delegates archival lifecycle management to external backups
Key Takeaways
- Git stores
.xlsxand.docxfiles as binary blobs. - Storage growth for small Office documents is usually trivial.
- Full historical removal is possible but involves rewriting repository history.
- Binary working documents are often better managed outside Git.
- Git is version control — not a long-term archival management system.
This exploration provided a practical understanding of Git’s object model and reinforced the importance of aligning tooling choices with the nature of the artifacts being managed.
Using Excel CSV Files for Git-Friendly Versioning (ChatGPT Summary)
An Excel workbook (.xlsx) can often be made Git-friendly by saving each worksheet as a separate CSV (.csv) file. Since CSV files are plain text, Git can track them effectively. This allows:
- Line-by-line diffs
- Meaningful version history
- Easier merging of changes
In contrast, Excel (.xlsx) files are treated by Git as binary blobs, so changes cannot be easily inspected or merged.
However, CSV files store only tabular data. They do not preserve Excel features such as formatting, formulas, charts, pivot tables, or multiple sheets within one file.
Therefore, CSV works best when Excel is used mainly for data tables, while Excel workbooks containing richer spreadsheet features are usually better kept outside Git.
Using Markdown Instead of Word Documents for Git Versioning (Edited ChatGPT Summary)
Microsoft Word documents (.docx) are treated by Git as binary files, similar to Excel .xlsx files. This means Git cannot show meaningful diffs or merge changes when the document is modified. For better version control, a practical alternative is to convert Word documents to Markdown (.md), which is a plain-text format that Git can track effectively.
Markdown preserves the key structural formatting used in many documents, such as headings, bold and italic text, bullet lists, numbered lists, and links. This makes Git history easy to read and review. However, complex Word features such as page layout, fonts, advanced tables, and embedded objects may not be preserved during conversion.
Since Microsoft Word does not currently provide a native “Save as Markdown” option, external tools are typically used for the conversion. A widely used command-line (free download) tool for this purpose is Pandoc, which can convert a .docx file to Markdown using a simple command such as:
pandoc document.docx -o document.md
For occasional conversions, an online tool such as word2md.com can also be used to convert .docx files into Markdown without installing additional software.
Using Markdown as the version-controlled format allows Git to track document changes clearly, while Word can still be used when richer document formatting or publishing layouts are required.
Comments
Post a Comment