Powershell script to do word-level comparison using git; Useful for minified files

Quick-Info

Powershell script that compares two text files for word-level differences using git, and is designed for minified files - gitworddiff.ps1 GIST

This script uses `git diff --no-index --word-diff=porcelain` to compare two local text files at the word level.
It provides two outputs:
  1. A raw, contiguous block of words level diff that shows granular changes with clear line breaks.
  2. A final count of the deleted and inserted contiguous block of words differences found.
This comparison is effective for finding subtle edits in minified files (e.g., HTML, CSS, JS) and is also
useful for text files with reasonable line lengths, as it can pinpoint multiple small changes on a single line that a regular line-by-line diff would miss. The porcelain format is designed to group contiguous changes into single lines. Any contiguous block of added or deleted words, no matter how long, will be reported on a single line. A modified word or contiguous block of modified words is represented as both a deletion and an insertion.

As the executable Powershell script code is limited and the main work is done by git diff, I think it should be quite easy to port it to other Operating Systems like Linux.

For global git diff colour setting:
  • git config --global color.diff.old "red yellow"
  • git config --global color.diff.new "green reverse"
For project level git diff colour setting:
  • git config color.diff.old "red yellow"
  • git config color.diff.new "green reverse"
...
After the above colour settings are done (project level or global level), the command:
  • git diff --no-index --word-diff=porcelain testcompare.html testoutput1.html 2>$null
gives very easy to note visual representation of the differences.

To get a count of deleted contiguous block of words:
  • git diff --no-index --word-diff=porcelain file1 file2 2>$null | Select-String -Pattern "^\-" | Select-Object -Skip 1 | Measure-Object -Line | Select-Object -ExpandProperty Lines
To get a count of inserted contiguous block of words:
  • git diff --no-index --word-diff=porcelain file1 file2 2>$null | Select-String -Pattern "^\+" | Select-Object -Skip 1 | Measure-Object -Line | Select-Object -ExpandProperty Lines
=======================================

Details

I was exploring a way to compare minified HTML content from a blog post and an edited version of that, both of which typically have one very long line or few very long lines of HTML code.

Given below are relevant edited extracts from a chat today with Gemini, "Node.js Blogger Scraper Analysis".

Me: The compare files feature of both VSCode and WinMerge do not handle the minified POST HTML file comparison (from post and from our program) well. They are more geared to comparison of files with line breaks. ... Is there some comparison VSCode extension or utility that handles minified HTML file (long content on single line) comparision well?

[One option was to make a copy of these two files, add line breaks using VSCode extension Prettier and then do the comparison in VSCode or WinMerge. But I was exploring direct comparison possibilities which are easy to visually check.]

...
Me: Is there a Windows command line utility that will report the diffs between the two minified files but as strings within lines and not whole lines?

Gem: ...

The most effective tool for this task is a built-in feature of Git. If you have Git for Windows installed, you can use git diff to perform a "word-diff," which can be configured to act as a character-diff.

  1. Run the command: git diff --no-index --word-diff=color --word-diff-regex=. file1.html file2.html

...

Me: Interesting! Tried it. It seems to work though when the difference is small like space/ in Blogger POST HTML, git diff output shows only the / in red foreground colour against the whole file contents. So one can spot it but can we have a different background colour too for the differing characters?

[Key info. from Gem with some inputs from me:
For global git setting:
  • git config --global color.diff.old "red yellow"
  • git config --global color.diff.new "green reverse"
For project level git setting:
  • git config color.diff.old "red yellow"
  • git config color.diff.new "green reverse"
After that, command to run: git diff --no-index --word-diff=color --word-diff-regex=. testcompare.html testoutput1.html ]

Me: That works like a charm! The space/ is shown with yellow background and / is shown in red foreground. So it is easy to spot against the black and white Windows terminal display of other characters in the file. Great solution you have provided without me having to use some new utility command or VSCode extension. ... Now taking it further, can this git command give me a count of the differences in characters or character strings (not lines)?

[After a lot of interactions and experimentation...]

Me: That worked ...
test> git diff --no-index --word-diff=porcelain testcompare.html testoutput1.html 2>$null | Select-String -Pattern "^\-" | Select-Object -Skip 1 | Measure-Object -Line | Select-Object -ExpandProperty Lines
14
test> 

[git diff --no-index --word-diff=porcelain testcompare.html testoutput1.html 2>$null
gives better visual representation of the differences than the earlier --word-diff=color git diff command.]

Me: Now can you provide me a Powershell script (I use a Windows PC) named gitworddiff.ps1 that:
1) Takes in two files (file1 and file2) to compare as arguments
2) Runs the command git diff --no-index --word-diff=porcelain file1 file2 2>$null
3) After above command has finished it prints the number of word differences found like:
   x word differences were found.
   For getting this count it uses the command below but which is modified to print above string  instead of only count:
   git diff --no-index --word-diff=porcelain file1 file2 2>$null | Select-String -Pattern "^\-" | Select-Object -Skip 1 | Measure-Object -Line | Select-Object -ExpandProperty Lines
   
Next I would like an enhancement of taking another flag argument -c which when provided means step 2 above should be omitted so that user sees only the count string printed when the command has finished.
You can do the enhancement with the earlier functionality as a single step or you could do the enhancement later after you provide me the earlier functionality. That's  your choice.

[Gem provided the code. We went through quite a few iterations ...]

Me: Now about whether we should refer to this script as:
A script to compare two HTML files and report on the word differences.
----------
1) Shouldn't we say minified HTML files and explain minified in short?
2) Should we limit ourselves to HTML files? The script should work on any text files and so perhaps there may be other minified text files where the script could be useful.
3) Just curious about this: If two files are not minified i.e. have reasonable line lengths (say < 100 or 200), would there be some scenarios where this command could be useful?

Gem: That's an insightful set of questions. Here are the answers to your points.

1) Minified vs. Any Text Files

The script will work on any two text files, not just HTML. The git diff command is a general-purpose tool for comparing text files. The -word-diff and porcelain options are simply a way to format the output of those differences, which can be applied to any plain text. So, you're right, the help text can be more general.

We can update the help text to be more accurate and useful.

Minified Text: A minified file is a version of a code file (like HTML, CSS, or JavaScript) that has had all unnecessary characters removed, such as whitespace, comments, and line breaks. The purpose is to reduce the file size and improve loading speed.

2) The Usefulness of the Script on Non-Minified Files

Yes, the script is still useful for non-minified files, particularly in these scenarios:

  • Comparing Different Versions of a Document: Imagine you have two versions of a long document, such as a legal brief, a technical manual, or a report. A regular line-by-line diff might show a huge block of text as "changed" if just a single word was altered. Your script, however, will highlight the specific word-level changes, making it much easier to spot the exact edits.

  • Checking for Content Drift: In a collaborative environment, if multiple people are editing a text file, your script can quickly show you exactly what words or phrases were added or removed, even if they were on the same line.

  • Code Reviews: While not a replacement for a full code review tool, your script can be a quick way to see word-level changes in configuration files, scripts, or any other text-based code where you want to focus on the content of the changes rather than just the line they occurred on.

The git diff --word-diff=porcelain command's ability to focus on the words themselves, rather than the lines, is a great way to find the needle in the haystack of a large document change.

...

Me: ... Coming back to scope of this script, I tried git diff without --word-diff option. That gave me the same large two lines issue that I face with VSCode and WinMerge. With git actually it was worse as the entire large line was coloured.
.... It seems to me that the script's word diff feature (using git's word diff feature), will help even for files with regular length lines in that it may highlight multiple word differences in the same line separately. This could be useful in some scenarios. What do u say?

Gem: You've hit on a very important and powerful aspect of the git diff --word-diff command!

You are absolutely correct. Your observation that a regular line-by-line diff (git diff without --word-diff) is often unhelpful for large, single-line changes is spot on. It simply highlights the entire line as changed, which is what you saw with your HTML files, VSCode, and WinMerge. The git diff output was even worse because it colored the entire long line, making it unreadable.

This is precisely where the word-diff feature shines, and it's why your script is useful even for non-minified files. The key benefit is its ability to:

  • Highlight granular changes: Instead of marking an entire line as "changed," it identifies and highlights only the specific words that have been added or removed within that line. This makes it far more useful for scenarios where you're editing a paragraph, a sentence, or a long line of code where only a few characters or words have been altered.

  • Improve readability: By focusing on the word-level changes, you get a much cleaner and more useful diff. This helps you quickly see the exact edits, which is crucial for things like proofreading text, reviewing documentation, or spotting subtle changes in a configuration file.

So, to answer your question directly: yes, the script's word-diff feature is useful and indeed a recommended approach for comparing files with regular line lengths in scenarios where you want to see a granular, word-by-word comparison rather than a line-by-line one.

...

[Lot of interaction for documentation changes.]

Me: So we count deleted words and modified words (once is fine for modification). But we don't count insertion alone (not part of modified). Will that be too complex to do?

[Gem provided the modified code for handling insertion.]

...

Me: I fixed a couple of bugs in the code. Please update your memory with the uploaded updated script.
Now I need to dig in into what is a 'word' to git diff. Is a new line with 'hi there' one word? That's how git diff seems to view it. Your comments pl.

...

Gem: You've discovered a very important nuance about the porcelain format. Your observation is absolutely correct, and it highlights a key difference between how Git's core word-diff engine works and how the porcelain format reports the changes.

Here's a breakdown of what's happening:

  • Underlying Engine: Internally, git diff --word-diff is indeed processing the line as a series of separate words, based on its regular expression [a-zA-Z0-9_]+|[^a-zA-Z0-9_ \t\n]+.

  • --word-diff=porcelain Output: The porcelain format is designed to be simple and machine-readable. It groups contiguous additions or deletions into single lines of output. In your case, the entire string of "hi there hi there..." is a single, continuous addition. Therefore, the porcelain format represents this entire addition on a single line, and your script's Select-String command correctly counts it as a single match for an addition (+).

  • The Other Case: When you inserted hi and there before and after another word, they were not contiguous. They were separated by a word that was not part of the addition. Because of this, the porcelain output showed two separate lines of additions, one for each word.

So, the "word" definition in porcelain is a bit different from the engine's internal definition. While the engine correctly identifies the individual words, the porcelain format condenses a continuous stream of additions or deletions into a single line. This is why you correctly observed the single-line output and the single count in your script.

[Some more discussion and also some sites which claimed to count word differences.]

Me: https://gist.github.com/MilesCranmer/5c7d86c8740219355d2dfdb184910711 makes a claim of reporting number of words changed. I think as that also uses porcelain, it will have the same issue that we face. What do u say?

[Gem agreed with me.]

Me: Same is the case for this: https://john.colagioia.net/blog/2021/10/13/words-changed.html#count-the-words-changed ... Isn't it?

[Gem agreed with me.]

Me: I would like messages in our script to be accurate. So I think in our script report messages, we should say, number of 'word' additions and deletions as reported by git diff --word-diff=porcelain, and then add another line stating that "porcelain format is designed to group contiguous changes into single lines, any continuous block of added or deleted words, no matter how long, will be reported on a single line." This should also be included in our script documentation. This way we will be precise and not give a slightly inaccurate view which is what the links I shared with you seem to be giving. What do u say?

[Gem agreed with me.]

[Had some interactions on polishing the script messages and documentation, and about sharing it as a (public) GIST.]

Comments