Powershell script to do word-level comparison using git; Useful for minified files
Quick-Info
- A raw, contiguous block of words level diff that shows granular changes with clear line breaks.
- A final count of the deleted and inserted contiguous block of words differences found.
- git config --global color.diff.old "red yellow"
- git config --global color.diff.new "green reverse"
- git config color.diff.old "red yellow"
- git config color.diff.new "green reverse"
- git diff --no-index --word-diff=porcelain testcompare.html testoutput1.html 2>$null
- git diff --no-index --word-diff=porcelain file1 file2 2>$null | Select-String -Pattern "^\-" | Select-Object -Skip 1 | Measure-Object -Line | Select-Object -ExpandProperty Lines
- git diff --no-index --word-diff=porcelain file1 file2 2>$null | Select-String -Pattern "^\+" | Select-Object -Skip 1 | Measure-Object -Line | Select-Object -ExpandProperty Lines
Details
The most effective tool for this task is a built-in feature of Git. If you have Git for Windows installed, you can use git diff
to perform a "word-diff," which can be configured to act as a character-diff.
Run the command: git diff --no-index --word-diff=color --word-diff-regex=. file1.html file2.html
- git config --global color.diff.old "red yellow"
- git config --global color.diff.new "green reverse"
- git config color.diff.old "red yellow"
- git config color.diff.new "green reverse"
14
test>
2) Runs the command git diff --no-index --word-diff=porcelain file1 file2 2>$null
3) After above command has finished it prints the number of word differences found like:
x word differences were found.
For getting this count it uses the command below but which is modified to print above string instead of only count:
git diff --no-index --word-diff=porcelain file1 file2 2>$null | Select-String -Pattern "^\-" | Select-Object -Skip 1 | Measure-Object -Line | Select-Object -ExpandProperty Lines
Next I would like an enhancement of taking another flag argument -c which when provided means step 2 above should be omitted so that user sees only the count string printed when the command has finished.
You can do the enhancement with the earlier functionality as a single step or you could do the enhancement later after you provide me the earlier functionality. That's your choice.
----------
1) Shouldn't we say minified HTML files and explain minified in short?
2) Should we limit ourselves to HTML files? The script should work on any text files and so perhaps there may be other minified text files where the script could be useful.
3) Just curious about this: If two files are not minified i.e. have reasonable line lengths (say < 100 or 200), would there be some scenarios where this command could be useful?
1) Minified vs. Any Text Files
The script will work on any two text files, not just HTML. The git diff
command is a general-purpose tool for comparing text files. The -word-diff
and porcelain
options are simply a way to format the output of those differences, which can be applied to any plain text. So, you're right, the help text can be more general.
We can update the help text to be more accurate and useful.
Minified Text: A minified file is a version of a code file (like HTML, CSS, or JavaScript) that has had all unnecessary characters removed, such as whitespace, comments, and line breaks. The purpose is to reduce the file size and improve loading speed.
2) The Usefulness of the Script on Non-Minified Files
Yes, the script is still useful for non-minified files, particularly in these scenarios:
Comparing Different Versions of a Document: Imagine you have two versions of a long document, such as a legal brief, a technical manual, or a report. A regular line-by-line diff might show a huge block of text as "changed" if just a single word was altered. Your script, however, will highlight the specific word-level changes, making it much easier to spot the exact edits.
Checking for Content Drift: In a collaborative environment, if multiple people are editing a text file, your script can quickly show you exactly what words or phrases were added or removed, even if they were on the same line.
Code Reviews: While not a replacement for a full code review tool, your script can be a quick way to see word-level changes in configuration files, scripts, or any other text-based code where you want to focus on the content of the changes rather than just the line they occurred on.
The git diff --word-diff=porcelain
command's ability to focus on the words themselves, rather than the lines, is a great way to find the needle in the haystack of a large document change.
...
.... It seems to me that the script's word diff feature (using git's word diff feature), will help even for files with regular length lines in that it may highlight multiple word differences in the same line separately. This could be useful in some scenarios. What do u say?
git diff --word-diff
command!You are absolutely correct. Your observation that a regular line-by-line diff (git diff
without --word-diff
) is often unhelpful for large, single-line changes is spot on. It simply highlights the entire line as changed, which is what you saw with your HTML files, VSCode, and WinMerge. The git diff
output was even worse because it colored the entire long line, making it unreadable.
This is precisely where the word-diff
feature shines, and it's why your script is useful even for non-minified files. The key benefit is its ability to:
Highlight granular changes: Instead of marking an entire line as "changed," it identifies and highlights only the specific words that have been added or removed within that line. This makes it far more useful for scenarios where you're editing a paragraph, a sentence, or a long line of code where only a few characters or words have been altered.
Improve readability: By focusing on the word-level changes, you get a much cleaner and more useful diff. This helps you quickly see the exact edits, which is crucial for things like proofreading text, reviewing documentation, or spotting subtle changes in a configuration file.
So, to answer your question directly: yes, the script's word-diff
feature is useful and indeed a recommended approach for comparing files with regular line lengths in scenarios where you want to see a granular, word-by-word comparison rather than a line-by-line one.
...
[Lot of interaction for documentation changes.]
Me: So we count deleted words and modified words (once is fine for modification). But we don't count insertion alone (not part of modified). Will that be too complex to do?
[Gem provided the modified code for handling insertion.]
...
...
Gem: You've discovered a very important nuance about the porcelain
format. Your observation is absolutely correct, and it highlights a key difference between how Git's core word-diff
engine works and how the porcelain
format reports the changes.
Here's a breakdown of what's happening:
Underlying Engine: Internally,
git diff --word-diff
is indeed processing the line as a series of separate words, based on its regular expression[a-zA-Z0-9_]+|[^a-zA-Z0-9_ \t\n]+
.--word-diff=porcelain
Output: Theporcelain
format is designed to be simple and machine-readable. It groups contiguous additions or deletions into single lines of output. In your case, the entire string of "hi there hi there..." is a single, continuous addition. Therefore, theporcelain
format represents this entire addition on a single line, and your script'sSelect-String
command correctly counts it as a single match for an addition (+
).The Other Case: When you inserted
hi
andthere
before and after another word, they were not contiguous. They were separated by a word that was not part of the addition. Because of this, theporcelain
output showed two separate lines of additions, one for each word.
So, the "word" definition in porcelain
is a bit different from the engine's internal definition. While the engine correctly identifies the individual words, the porcelain
format condenses a continuous stream of additions or deletions into a single line. This is why you correctly observed the single-line output and the single count in your script.
[Some more discussion and also some sites which claimed to count word differences.]
Me: https://gist.github.com/MilesCranmer/5c7d86c8740219355d2dfdb184910711 makes a claim of reporting number of words changed. I think as that also uses porcelain, it will have the same issue that we face. What do u say?
[Gem agreed with me.]
Me: Same is the case for this: https://john.colagioia.net/blog/2021/10/13/words-changed.html#count-the-words-changed ... Isn't it?
[Gem agreed with me.]
[Gem agreed with me.]
Comments
Post a Comment