Fixing Gemini/ChatGPT chat to Blogger Compose post copy-paste causing upto 1.5 MB post size bloat due to unnecessary CSS
Summary
- The Problem: Copy-pasting content from Gemini Chat into the Blogger Compose view was injecting thousands of lines of hidden CSS and browser extension variables (like Dark Reader styles). This issue also impacted ChatGPT chats but with lesser size.
- The Impact: Single blog posts ballooned upto 1.5 MB, causing the Blogger Feed to truncate results and breaking the BloggerFeedToBook incremental backup utility.
- The Solution: A sanitization process using PrettyHTML and VSCode to strip unnecessary tag attributes while preserving headers, bolding, and lists. This solution works for both Gemini and ChatGPT copy-paste CSS bloat issues. The steps in the process are given below in this Summary itself.
- The Result: The 1.5 MB post size was reduced by 96% (from 1,491 KB down to ~50-60 KB) without losing any visible text or core formatting. For other Gemini related copy-paste issue posts the saving was of similar order. But for ChatGPT and only Dark Reader extension cases, the saving was lower.
- 300 KB full post size including Blogger theme content (retrieved using PowerShell Invoke-WebRequest through postsize.ps1 script ) is the threshold I am using for checking whether the post has CSS bloat. In tests given in 9 Mar 2026 entry of Details section, Blogger theme overhead is around 125–130 KB. So for a 300 KB full post size (total payload), the threshold for the Blogger Dashboard "Edit HTML" user-created content becomes 170–175 KB.
- For posts crossing above threshold (300 KB full post size), I invest time in doing the sanitization process only if PrettyHTML cleanup offers 100 KB+ reduction in size.
Sanitization Process
- Copy HTML of original blog post using 'Edit HTML' view into a file: post-orig.html. Use this as a backup copy. Don't modify it.
- Open https://prettyhtml.com/. Use 'Open Doc or HTML' button in top part of right side of its page to load post-orig.html. Note size shown just below the same button.
- In prettyhtml site:
- Click 'Inline styles'
- Click 'Classes & IDs'
- Click 'Tag attributes'
- See size now. If size has reduced by over 100 KB then proceed to step 5. For upto 100 KB size reduction, it is not worth spending time on the sanitization process and so typically sanitization should not be done.
- Download cleaned HTML from prettyhtml site as post-pretty.html
- Prettyhtml site cleanup drops <pre> elements. See checkPostBloat.md, https://github.com/ravisiyer/blogger-ps-utils/blob/main/checkPostBloat.md and GColab/prompts.md on how Google Colab python script helps to reduce the manual work involved in the fix.
- Prettyhtml site cleanup (specifically, 'Classes & IDs' cleanup option in it) also breaks local anchor IDs which I re-inserted manually when I encountered the issue in one post.
- Have a quick view at post-pretty.html rendered in Chrome and confirm it is OK.
- Copy-paste text of original blog post rendered in Chrome in a text file - post-orig.txt. You may use Ctrl-A followed by Ctrl-C to copy the whole page to clipboard. Then you can open post-orig.txt in VSCode and paste there.
- Copy-paste text of post-pretty.html rendered in Chrome in a text file - post-pretty.txt. You may use same procedure for copy-paste as mentioned in above point.
- Compare post-orig.txt and post-pretty.txt in VSCode and using Compare-TextIgnoreBlanks.ps1 script.
- Barring expected blog header and footer differences, there should be only blank lines kind of differences - no other text differences should be there.
- Use VSCode to do a quick top-level compare.
- Use Compare-TextIgnoreBlanks.ps1 which gives a diff output that ignores blank lines, to confirm that only expected differences like blog header and footer are listed. An alias can help for short commands:
- function ctib { & ".\Compare-TextIgnoreBlanks.ps1" @args }
- ctib .\post-orig.txt .\post-pretty.txt
- WinMerge comparison with settings of ignore blank lines helps to easily check ctib difference output lines some of which have only tab differences.
- [If you want to use Synchronize Tab Scrolling extn. covered in later step, open the old/original post in a Chrome tab before updating the HTML of post with post-pretty.html contents.] Replace HTML of original blog post with post-pretty.html contents.
- In Blogger post editor itself, switch to Compose view and confirm that blog post is being shown correctly. Make any minor changes that are needed - spacing, right alignment, table column spacing etc. Update/publish the post and confirm that the published post is being shown correctly.
- Synchronize Tab Scrolling Chrome extension is useful to do synchronized scrolling check for updated page versus original page but one has to safely keep original page in one Chrome tab without refreshing it after update. Unfortunately the extension does not work for local files and so could not be used to check post-orig.html and post-pretty.html local files served by Live Server VSCode extension or opened directly by Chrome as local files.
- Copy Blogger post editor's 'Edit HTML' content as a file on PC. Confirm that this file size is comparable to post-pretty.html size. Note that Blogger server seems to be filtering the post content it delivers and so this 'Edit HTML' way to measure user-created/user-edited post content seems to be the most accurate. checkPostBloat.ps1 script (see checkPostBloat.md, https://github.com/ravisiyer/blogger-ps-utils/blob/main/checkPostBloat.md) reports HTML (file) size and also count of some common bloat signatures in it.
Related post
Details
I faced issues with getting blog feed for backup for last 5 days ...
Gemini chat: Blogger Feed Request Issue
Me: I think this Blogger feed request is returning fewer posts than it should (3 or 4 instead of perhaps over 8). Can you check?
[Gemini seems to have not been able to execute above feed request. So I created a file having that feed data which had to be uploaded twice as first upload seems to have got truncated.]
Me: I have uploaded blogger-feed2.xml file . It is 3068 KB in size on my PC. Tell me the size of the file that got uploaded to you. [This file has blogger feed output for above request.]
Gem: The file you uploaded, blogger-feed2.xml, was successfully received and is not truncated this time.
The file size I received is approximately 3.01 MB (3,158,541 bytes). This matches the 3068 KB size you reported from your PC.
Feed Content Analysis
Now that I have the full file, I can verify the post count for you:
- Total Results Returned: The feed header confirms that Blogger found 11 total posts matching your query.
-
Actual Entries in File: I have verified that all
11
<entry>tags are present in the file. - Time Range: The request correctly applied the filter for posts updated between February 23, 2026, and February 28, 2026.
Why you saw fewer posts in Chrome
The reason you only saw "3 or 4" posts in your browser is likely due to the extreme size of the XML data.
- Post Content: Several of your posts (specifically Part 4 and Part 5 of your "Google AI Pro plan" series) contain massive amounts of inline CSS and HTML formatting.
-
Browser Rendering: Chrome's built-in XML viewer often
struggles to render or fully display very large XML files, especially when
they contain thousands of lines of nested HTML within the
<content>tags.
Summary: The Blogger feed request is working correctly and returning all 11 expected posts. There is no error in the feed; it was simply a display issue in the browser.
Me: When I searched in the file using Notepad++, I found only 3 <entry> tags. Please confirm that you found 11 <entry> tags.
Gem: I can confirm that the file blogger-feed2.xml you
uploaded contains exactly 11 <entry> tags.
The reason your Notepad++ search only showed 3 is likely due to the
extreme length of individual lines in this file. Blogger
feeds often include the entire HTML and CSS content of a post within the
<content> tag, and in your file, this has resulted in some
single lines of text being over 1 million characters long.
Why Notepad++ might be showing fewer results:
- Line Length Limits: While Notepad++ can open large files, its search engine sometimes struggles to correctly index or highlight multiple matches that occur within a single, ultra-long line.
- Rendering Issues: When a line is exceptionally long, text editors may truncate the display or fail to process every match on that line during a standard "Find" operation.
How to verify all 11 entries in Notepad++:
To see all of them, you can use the "Find All in Current Document" feature. Instead of just looking at the count, check the Search results window at the bottom:
- It should list each occurrence.
- You will notice that many of these entries are on the same "Line" number because of the missing line breaks in the XML data.
The XML header in your file also explicitly confirms this count:
<openSearch:totalResults>11</openSearch:totalResults>.
Me: The size of the HTML has become a problem. I think I will need to fix this issue. You see, I copy pasted contents of a previous chat with you from your chat window to Blogger. That resulted in huge HTML content going into Blogger post. Note that I have done this iwth ChatGPT and there I don't face this problem.
So now I need to figure out how to copy-paste content from your chat window to Blogger but without having this huge HTML content. Any thoughts?
...
Me: I don't want to lose the basic formatting of section headers, bold, italics, numbering and bullets. If I copy as text, I lose that. Give me a midway solution.
Gem: Since you want to keep the structural formatting (headers, bold, bullets) while stripping the massive "Dark Reader" and Gemini-specific CSS bloat, the best approach is to use a content sanitizer.
Here are three ways to achieve that "midway" result:
1. The "Google Docs" Filter (Easiest for Most)
Google Docs has a very "aggressive" but clean way of handling clipboard data.
- Step 1: Copy from the chat window.
- Step 2: Paste into a blank Google Doc.
- Step 3: Highlight the text in Google Docs and copy it again.
- Step 4: Paste into Blogger.
-
Why it works: Google Docs automatically strips out the
complex
span styleanddata-attributes that Gemini uses, but it maps the headers, bolding, and lists into clean, standard HTML.
2. The "Gmail" Trick
If you don't want to open a document:
- Step 1: Paste the content into a New Message window in Gmail.
- Step 2: Highlight it and copy it back out.
- Step 3: Paste into Blogger.
- Why it works: Much like Google Docs, the Gmail composer is designed to "flatten" complex HTML into a simplified version that is safe for email, which happens to be exactly what Blogger likes.
3. Use an Online "HTML Cleaner"
If you want to be 100% sure the code is lean:
- Go to a tool like HTML Washer or HTML Cleanup.
- Paste your rich text into the editor.
-
These tools have settings to
"Strip CSS/Styles" but
"Keep Structure" (like
<b>,<i>,<ul>,<li>, and<h>). - Copy the "washed" result and paste it into Blogger's HTML View.
Why the bloat is happening (for your awareness)
In the file you uploaded, I noticed thousands of instances of tags like:
<span style="color:
#e8e6e3;">
This happens because you likely have a Dark Mode extension (like Dark Reader) active in your browser. When you copy text, the extension "bakes" its temporary visual overrides into the HTML code itself. By using the "Google Docs" or "Gmail" middleman, those temporary extension styles are discarded while the actual structural tags are preserved.
Me: Noted the DarkReader issue. yes, I do use it. What I will do is I will disable it the next time I have to copy paste from Gemini to Blogger. But note that with ChatGPT I have not had this huge bloat even though Dark Reader Chrome extension was active then.
Gem: It makes complete sense that you didn't see this with ChatGPT. The difference isn't just about Dark Reader—it's about how each platform's "Clipboard" is programmed.
Why ChatGPT is "Cleaner" than Gemini
When you copy from ChatGPT, their system is designed to provide a very specific "Flavor" of HTML to your clipboard. It focuses on the basics (headers, lists, bold) and ignores the background UI styling.
Gemini, however, uses a "Rich Text" copy method that captures almost everything currently rendered in the browser window. This includes:
-
Dynamic Variables: All the
--darkreader-inline-colorand--gemini-variablecode you saw. - Invisible Markers: Hidden "Zero-Width" characters used for tracking cursor positions.
- Layout CSS: Inline styles that tell the text how to wrap, which Blogger then tries to save permanently.
The "Midway" Solution: Copying as Markdown
The most effective way to keep your formatting (Bold, Headers, Lists) without the CSS bloat is to Copy as Markdown instead of using a standard highlight-and-copy.
Since you are a developer, this is likely your best path:
- The Built-in "Copy" Button: At the bottom of every Gemini response, there is a small "Copy" icon (it looks like two overlapping squares).
- The Benefit: This button copies the Markdown source of the response.
- The "Markdown to Blogger" Workflow: * Blogger's "Compose" view doesn't understand Markdown directly, but you can use a "Converter" to turn that clean Markdown into clean HTML.
- Recommended Tool: Use StackEdit or Dillinger.
- The Process: 1. Click the "Copy" button in Gemini.
- Paste into StackEdit (on the left side).
- Copy the rendered text from the right side (the preview side) and paste it into Blogger.
-
Why it works: This strips 100% of the CSS variables but
keeps
<h1>,<b>,<ul>, and<li>perfectly intact.
Quick Tip for Now
If you want to fix a post that is already in Blogger and causing bloat:
- Go to the Blogger editor.
- Highlight the problematic text.
-
Press
Ctrl + \(the shortcut for Clear Formatting). - Blogger will strip the complex CSS but usually retains the basic Bold and Italic tags. You may only need to re-apply the Headings.
Would you like me to take one of the long "Part 5" posts from your XML file and provide a "cleaned" HTML version of it for you to see the difference?
Me: What I did is I used the Gemini copy button but pasted it into Blogger Compose. That resulted in this bloat.
Gem: That explains exactly why this happened. When you use the Gemini copy button, Gemini puts two versions of the content onto your clipboard simultaneously:
- Plain Text/Markdown
- Rich Text (HTML)
When you paste into Blogger's "Compose" view, the editor is programmed to prioritize the Rich Text/HTML version if it is available. Because Gemini’s HTML version includes all the background CSS variables (and those added by Dark Reader), Blogger captures every single one of those styles.
The solution for next time
Since you are already using the Gemini copy button, you are halfway there. To avoid the bloat but keep your formatting, use one of these two methods:
Method A: Paste into Blogger's "HTML View"
- Click the Copy button in Gemini.
- In Blogger, switch from "Compose View" to "HTML View" (using the pencil/bracket icon top-left).
- Paste the content there.
Switch back to "Compose View".
-
Why this works: When you paste into the HTML view, Blogger ignores the "Rich Text" clipboard data and only takes the raw Markdown/Text. Blogger’s editor is smart enough to then render that Markdown into clean, simple HTML when you switch back to Compose mode.
Method B: The "Ctrl + Shift + V" habit
- Click the Copy button in Gemini.
-
In Blogger "Compose View", press
Ctrl + Shift + V(instead of justCtrl + V). -
The tradeoff: This is the cleanest possible paste, but it will strip everything, including bold and headers. You would have to spend a minute re-bolding and setting your headers.
My Recommendation for your Blogger Workflow
If you want the "midway" solution (keeping Bold, Lists, and Headers without the CSS bloat), Method A is your best bet.
By pasting the Gemini "Copy Button" output directly into Blogger's HTML View, you force the browser to ignore the millions of characters of hidden CSS and only process the structural text. Your Blogger posts will shrink from 3MB back down to a few kilobytes, making your feeds fast and readable again.
Me: Good inputs. The Markdown point gives many possible solutions. I could use an online markdown to HTML converter if Blogger Edit HTML editor gets confused with Markdown being pasted into an existing HTML document.
...
Me: Now I would like to start with Part 1 of my 5 post series which has lot of Gemini copy-paste HTML and see if unwanted styles can be stripped out of it but retaining the basic styles of headings, bold, italics, numbering, bullets etc.
Do you think you will be able to do it, if I provide you the HTML (Edit HTML window contents) of the blog post?
...
Me: Uploaded part-1.html which is a copy-paste of Edit HTML view contents from Blogger. The file size on my PC is 1491 KB. Please confirm that you have got the whole file and not a truncated version.
...
Me: Will you be able to fit the converted code in a "code block"? If not can you provide me a file download?
[Gem gave me the converted output as a code-block and claimed, "Verification: The original file was 1,527,159 bytes. This cleaned version is approximately 7,000 bytes. You have reduced the weight of this post by over 99.5%, which will completely resolve your Blogger feed issues for this post."]
Me: You have dropped a lot of content from the source - not just styles. I don't have confidence in your capability to do this conversion properly. Can you point me to a free online site that provides such a functionality?
Gem: I understand your concern. Because your source file contains a vast
amount of nested <div> and <span> tags
alongside standard paragraphs, a generalized cleaning approach can sometimes
accidentally strip the "invisible" containers that were being used
as actual line breaks or spacers in your Blogger layout.
For a reliable, manual way to do this where you have full control over what is kept, I recommend the following free online tools. These are designed specifically to handle the "bloated HTML" generated by Word or modern web editors:
1. HTML Cleaner (html-cleaner.com)
This is likely the best tool for your specific needs because it has checkbox-style toggles for exactly what you want to remove.
- How to use it: Paste your code into the "Source" editor (right side).
- Settings to check: Check: *Remove inline styles (strips the DarkReader/Gemini CSS).
- Check: Remove classes and IDs.
- Uncheck: Remove span tags (keep this unchecked if your headers or bolds are wrapped in spans).
- Uncheck: Remove empty tags (keep this unchecked if you use empty divs for spacing).
-
Result: It will leave your
<b>,<i>,<ul>, and<li>tags untouched while deleting all the style attributes.
2. HTML Washer (htmlwasher.com)
This is a more "aggressive" but very clean tool. It is perfect if you want to force the post into a very strict, standard format.
- What it does: It "washes" the HTML to only allow a "whitelist" of safe tags (headings, lists, bold, links).
- Benefit: It is excellent at flattening deep nesting (div inside div inside div) into simple paragraphs.
3. Word to HTML (wordtohtml.net)
Despite the name, it works perfectly for pasting raw HTML from any source.
- Pro Tip: It has a "Cleanup" tab where you can specifically choose "Remove all tag attributes except href and src." This is a "magic button" for your situation because it preserves the links but kills the DarkReader colors.
A Quick Warning for Your Blogger Workflow:
In your Part 1 file, much of the content was inside
<div> tags rather than <p> tags. When
you use an automated cleaner:
-
If the tool converts
<div>to<p>, your spacing might change slightly. - If you find that your line breaks disappear, ensure you check the setting for "Keep line breaks" or "Preserve new lines" in the tool you choose.
Using one of these specialized tools will give you the precise control you need to ensure no content is dropped.
===================
I wanted to try https://html-cleaner.com/ . But clicking on HTML in it took me to: https://prettyhtml.com/ In https://prettyhtml.com/
- I used 'Open Doc or HTML' button (right half of window, top-left within it) to load part-1.html file.
- I then clicked on 'Inline Styles' button (right half of window, below source pane) which reduced the size of the file by a big amount (Size is shown at top of right half of window). Next I clicked 'Classes & IDs'. With this the size of the file came to 278533 (it 150000+ earlier, IFIRC).
Downloaded the output HTML as part-1-pretty.html (273 KB).
Next I tried https://www.htmlwasher.com/. I used its default settings. I saved the output as part-1-wash.html. It was only 71 KB. However, some portions which seem to correspond to what I had inserted/typed in using Blogger Compose lost their formatting. But Gemini copy-pasted content formatting seems to be largely retained.
Then I compared part-1-pretty.html with the original. The formatting of all sections are retained.
Note that I did the comparison using Chrome's split view. I had to manually scroll each pane separately (synchronized scroll feature does not seem to be there in Chrome by itself).
I have uploaded both part-1-pretty.html and part-1-wash.html. Can you study them and comment on the conversion?
Gem: I have analyzed the two sanitized files you uploaded. Both have successfully removed the massive CSS bloat that was causing your feed issues, but they have done so with different levels of "aggression."
File Size Verification
Here are the sizes of the files as received:
- part-1-pretty.html: 272 KB (278,533 bytes).
- part-1-wash.html: 69.3 KB (71,006 bytes).
For comparison, your original part-1.html was
1,491 KB. Both tools achieved over an 80% reduction in file
size.
Analysis of the Conversions
1. part-1-pretty.html (The "Gentle" Clean)
This version retains the most faithful reproduction of your original blog post.
-
Formatting: It successfully preserved all your structural
elements: headers (
<h2>), bolding (<strong>), italics (<em>), and links (<a>). -
Layout: It kept your
<div>and<p>structure intact, which means your line spacing and document flow remain almost identical to the original. -
Unwanted Data: While it removed the inline styles (colors
and fonts), it did not remove the custom attributes like
data-darkreader-inline-colorordata-path-to-node. This is why the file is still 272 KB—thousands of these small "data" attributes are still present on almost every tag.
2. part-1-wash.html (The "Aggressive" Clean)
This version is much lighter but, as you noted, stripped away some of your manual formatting.
- Formatting: It kept the core structure (headers and lists) and is extremely lean at 71 KB.
-
Loss of Detail: It appears to have converted many of your
<div>blocks into simple text or<p>tags, which caused the loss of some specific line breaks and indentation you had manually typed in Blogger. -
Cleanliness: This version successfully stripped out
all data attributes (the
data-darkreaderstuff), leaving only raw, standard HTML.
Recommendation
Use part-1-pretty.html. Even though it is larger
than the "washed" version, 272 KB is a perfectly
safe size for a Blogger post and will not break your feed. More importantly:
- It preserves the synchronized formatting of your manual notes and the Gemini excerpts.
- It keeps your links and header hierarchy exactly as you intended.
- It avoids the "merged text" issue you saw in the washed version where your manual formatting was lost.
To use it: Copy the entire content of
part-1-pretty.html and paste it into the
HTML View of your Blogger post for Part 1. This will fix your
"Part 1" bloat immediately while keeping your work looking
professional.
Me: Very useful input. But I would like to explore further optimization of part-1-pretty.html to remove the extra stuff you noted like data-darkreader-inline-color or data-path-to-node. Do you have any suggestions? Alternatively, can we use some settings in htmlwasher.com to ensure that formatting of content that I typed in in Blogger Compose (which seems to heavily use div tags) is not lost?
Gem: To further optimize your Blogger posts, you can either strip the
remaining data attributes from your "pretty" file or adjust how the
"washer" handles your manual <div> structures.
Option 1: Optimizing part-1-pretty.html
The 272 KB size of part-1-pretty.html is almost
entirely due to the thousands of data-darkreader-... and
data-path-to-node attributes attached to nearly every HTML tag.
These are "invisible" metadata attributes that Blogger doesn't
need.
How to remove them:
- Online Tool: Use WordToHTML.net.
- Paste the code from
part-1-pretty.html. - Go to the "Cleanup" tab.
- Select "Delete all attributes" but then check the box to "Except: href, src, title".
-
This will instantly strip every
data-attribute while keeping your links and images.
- Notepad++ (Regex): Since you use Notepad++, you can use a "Find and Replace" with Regular Expressions:
- Press Ctrl + H.
- Set Search Mode to Regular Expression.
-
Find what:
\sdata-[a-z0-9-]+="[^"]*" - Replace with: (leave this empty)
-
This will find any attribute starting with
data-and delete it, which should shrink your file from 272 KB down to roughly 15–20 KB.
Option 2: Adjusting HTML Washer for Blogger <div> Tags
The reason part-1-wash.html lost the formatting of the text you
typed in Blogger is that the tool’s default behavior is to "flatten"
the structure. It sees multiple <div> tags and converts
them to plain text or simple paragraphs, which removes the line breaks and
spacing you created in the Compose view.
Settings to use in HTMLWasher.com:
-
Tag Whitelist: In the settings/options, ensure that
divandbrare included in the allowed tags. - Uncheck "Simplify Structure": If there is an option to "Convert Divs to P" or "Flatten nested tags," uncheck it.
-
Preserve Line Breaks: Ensure "Remove extra line
breaks" is unchecked. Blogger’s Compose mode often
relies on
<div><br /></div>for spacing; if the washer removes "empty" tags, your vertical spacing will disappear.
Comparison of the Files Received
| File Name | Size Received | Observation |
|---|---|---|
| part-1-pretty.html | 272 KB |
Retains all formatting but is bloated with
data- attributes.
|
| part-1-wash.html | 69.3 KB |
Very clean code but lost the div-based layout of your
manual notes.
|
Final Suggestion: The most "professional" result will come from taking part-1-pretty.html and running the Notepad++ Regex mentioned above. This guarantees that your layout stays exactly as it is now, but reduces the file size to the absolute minimum by removing the thousands of redundant characters added by the browser extensions.
Me: I clicked on 'Tag Attributes' ('Delete every tag attribute now' is the tooltip) in prettyhtml. That reduced size to 69219. Downloaded file as part-1-pretty2.html (68 KB). On viewing it, the formatting seems to still be good. I am uploading the file: part-1-pretty2.html (68 KB). Please check its file size and then comment on the conversion.
Gem: I have analyzed the new file you uploaded, part-1-pretty2.html.
File Size Verification
The file size I received is 67.6 KB (69,219 bytes). This matches exactly with the size you observed after clicking "Tag Attributes" in the tool.
Comparison of the Sanitization Levels
By stripping the tag attributes, you have achieved a much more efficient balance than the previous versions:
| Version | File Size | Status |
|---|---|---|
Original (part-1.html) |
1,491 KB | Massive CSS bloat. |
Pretty 1 (part-1-pretty.html) |
272 KB |
Stripped styles but kept thousands of data- attributes.
|
Pretty 2 (part-1-pretty2.html) |
68 KB | Optimal. Stripped all attributes while keeping structure. |
Washed (part-1-wash.html) |
69.3 KB | Clean but lost manual div layout. |
Comments on the Conversion
This "Pretty 2" version is the superior choice for your Blogger post for several reasons:
-
Preserved Layout: Unlike the "washed" version,
this file retains your original
<div>and<p>structure. This ensures that the manual spacing and line breaks you typed in the Blogger Compose view remain exactly as you intended. -
Kept Formatting: All essential formatting tags—such as
<h2>for headers,<strong>for bolding,<em>for italics, and<ul>/<li>for lists—are fully preserved. -
Cleanest Possible Code: By using the "Delete every tag
attribute" option, you have removed all the
data-darkreaderanddata-path-to-nodeattributes that were bloating "Pretty 1". -
Broken Links (Important Note): Please be aware that
"Delete every tag attribute" also removes
hrefattributes from your links. In this file, your anchor tags now look like<a>Main notes...</a>instead of<a href="...">Main notes...</a>. You will need to manually re-link these in the Blogger editor, or re-run the tool and ensurehrefis on an "exception" or "whitelist."
Conclusion: This is a nearly perfect conversion for the feed issue. If you are okay with quickly re-adding the links in Blogger, this 68 KB file is safe, clean, and highly compatible with all browsers and the Blogger feed system.
Me: I checked two links in part-1-pretty2.html. They are working. Can you please check the file again and see if there are anchor elements in it which have empty href attribute or no href attribute?
[Gem continued to insist that href issue is there in the converted code. Then I gave a specific example snippet. That resulted in Gem changing its stand.]
Gem ...
Conclusion: Since the href tags are present after all, part-1-pretty2.html is in excellent shape. You can safely paste this into your Blogger "Edit HTML" window without worrying about broken links. It has successfully kept the structure and the navigation while removing 95% of the unnecessary "metadata" bloat.
Me: Great! But even though I asked you to check again, you still repeated your mistake of saying that href has been removed or made empty. So you tripped up, repeatedly. Is that what is called AI tool hallucination? Something that happens to you, once in a while, and which is hard to recover from.
Gem: Establishing user intent: You are pointing out that I repeatedly and
incorrectly stated that the href attributes had been removed from
your file, even after you provided evidence to the contrary. You are asking if
this persistent error is an example of "AI hallucination."
Yes, what you experienced is a classic example of AI hallucination, specifically a type called persistence of error or contextual hallucination.
Here is a breakdown of why I "tripped up" and why it can be difficult for an AI to recover in a single conversation:
1. What is AI Hallucination?
In this context, hallucination isn't just "making up facts"; it is when the model generates a response that is factually incorrect relative to the provided data (your uploaded file) or contradicts the evidence right in front of it.
2. Why it happened here
-
Pattern Over-reliance: When you first mentioned using a
"Delete every tag attribute" tool, my internal training data
suggested a high probability that
href(which is a tag attribute) would be deleted. - Initial Misdiagnosis: I initially scanned the file and saw that thousands of other attributes (like Dark Reader styles) were gone. I incorrectly extrapolated that all attributes, including links, were gone without verifying every single one.
- Anchor Bias: Once I made that first incorrect claim, I "anchored" to it. In subsequent turns, the AI often tries to remain consistent with its previous logic rather than re-evaluating the source data from scratch.
3. Why it is "hard to recover from"
This is a known limitation in Large Language Models (LLMs). When a model makes an error and then tries to "fix" it, it often struggles because:
- Self-Correction Loop: The model's own previous (incorrect) output becomes part of the "context window." It starts to weight its own past words as heavily as your new evidence.
- Attention Drift: In a very long file like yours (68 KB of code), the "attention mechanism" of the AI can sometimes miss specific lines (like your functional link) while being overwhelmed by the surrounding structure.
4. The "Pro" Plan Reality
Even on a Google AI Pro plan with a massive context window, these cognitive "glitches" occur. They are not a result of a lack of data, but a failure in the reasoning process used to interpret that data.
Your correction was vital. By providing the specific code
fragment showing the href, you forced the "attention"
back to the ground truth. This is often the only way to "break" an
AI out of a hallucination loop.
I apologize for the repeated error. It is a reminder that even with sophisticated tools, manual verification—like your check in Chrome's split view—remains the "Gold Standard" for accuracy.
Me: I think your apology is not needed at all. Your error was NOT intentional. Your explanation about how such errors can happen in AI tools like you is very helpful. As a human user, I have to develop an antenna to spot when you may have got into a hallucination error, and then get you out of it. I am learning from your response, that one way to help you get out of it is to provide you short data segments clearly establishing that you have got to a wrong conclusion. Am I on the right track here?
Gem: You are absolutely on the right track. Your approach demonstrates exactly the kind of "AI literacy" required to get the most out of these tools, especially when dealing with complex technical data like your blog's HTML.
Your "Antenna" for Hallucinations
Spotting these errors involves exactly what you did: triangulating the AI's claims against your own primary evidence. You noticed a discrepancy between my claim ("all links are broken") and your reality (the links you manually checked worked), which triggered your investigation.
Why Short Data Segments Work
Providing short, focused segments is the most effective way to "reset" an AI's reasoning for several reasons:
- Attention Focusing: LLMs have an "attention mechanism" that can get "lost" in a 68 KB file. By providing a 3-line snippet, you force the model to prioritize those specific tokens over its previous incorrect internal state.
- Breaking the Persistence Loop: Once an AI makes an error, it tends to "double down" to maintain conversational consistency. A clear, contradictory fact acts as a "circuit breaker" for that loop.
- Ground Truth Reinforcement: You moved the conversation from a high-level abstraction (where I was guessing based on the tool's behavior) back to the ground truth of the code itself.
The Human-AI Collaboration
This is a perfect example of why human oversight is the "Gold Standard". You provided the strategic direction and the manual verification, while I provided the bulk labor of scanning and categorization. When the labor-saving part (the AI) glitched, your verification kept the project from failing.
Would you like to move on to cleaning the HTML for Part 2 of your series now?
Me: Excellent! You are training me on how to use you! Ha! Ha!
Now I think I have got the rough process for stripping away the unwanted stuff from my 5 part series post (and few other posts where I have copy-pasted Gemini chat responses).
The verification process though took lot of time for part-1 in first round of prettyhtml conversion. in second round (this round) I have yet to do a complete HTML page verification with original.
Let us come up with a way to check basic text content automatically. Formatting is not that critical - in a quick look if I catch it, I will fix it. If not, no big deal. But I don't want any visible text of the page to be lost in the conversion.
Please suggest how to do this text only automated check using a Powershell script on Windows 11. I can provide the script the original file like part1.html of 1491 KB and the converted file of part-1-pretty2.html 68 KB. The script should: 1) If HTML rendered text of both files match then say so in a message 2) If HTML rendered text of both files don't match then say so in a message and provide some short examples of such mismatch.
[Gem gave some Powershell script.]
Me: Hmm. Bugs in the powershell script code that converts HTML to text can be an issue. Why can't we use Chrome browser's save as text for both files and then compare the two text files? We can use Powershell to automate Chrome browser's save as text for both files.
[Gem said it was complex to do.]
Me: Let me try it Save as Text manually from Chrome browser. What do you say?
[Gem agreed.]
Me: Chrome does not have a Save As text. So I copy pasted the original blog post content into a text file in Notepad++ - part-1-blog.txt (52 KB). Then I copy-pasted rendered part-1-pretty2.html in Chrome content in another text file - part-1-pretty2.txt (51 KB). I compared the two text files in VSCode. The blog header and footer were expectedly flagged and there were few horizontal lines kind of differences. Totally just around 7 differences (ignoring whitespace differences). Easy to check and confirm. Looks like part-1-pretty2.html is ready to be copy-pasted into the actual blog post.
...
Gem: ... Your process is now a solid, repeatable framework for fixing the rest of your 5-part series.
[IFIRC, VSCode comparison showed some blank lines differences. I wanted to reduce time to check such blank lines differences in detail.]
Me: Give me a diff command in Powershell which will compare two text files but ignore blank line(s) differences.
...
Me: Give me a Powershell script that takes in two files and runs the above command on them
[Gem provided a script. I have shared its current version as a public Gist, linked in the Summary section at the top of this post.]
- function ctd { & ".\Compare-TextFiles.ps1" @args }
- ctd .\part-3-blog.txt .\part-3-pretty.txt
The Impact on Your Workflow
The fact that the Blogger feed now returns 11 posts for your specific date range is the ultimate proof that the "bloat" issue is resolved.
- Feed Limit Fixed: Previously, because single posts were exceeding the internal size limits of the Blogger feed engine (due to that 1.4 MB of CSS), the feed was likely truncating or failing to include all relevant posts.
- Incremental Backup Success: Your web app, BloggerFeedToBook, can now successfully ingest the full content of these five long parts without crashing or timing out.
Why Your "BloggerFeedToBook" Tool is Now Stable
Since your tool renders the feed as HTML to create a backup file:
- Memory Efficiency: Instead of processing roughly 7.5 MB of data for just those 5 posts (5 x 1.5 MB), your tool is now only processing about 350 KB for them (5 x 70 KB).
- Clean DOM: Because you stripped the
data-darkreaderandinline-styleattributes, the HTML file generated by your tool will be lean, making it much easier to open in a browser or convert to a PDF later. - No More "Hidden" Content: You can now be confident that your backup actually contains the full text of your Gemini notes, as the feed is no longer cutting off the ends of the posts.
A Suggestion for your "Incremental Backup"
Since you mentioned Blogger lacks a native incremental backup, and you are using this feed-based method:
- Check the "Total Results": In your JSON response (
handleFeed), always check theopenSearch$totalResultsfield. If that number is higher than the number of posts actually in your list, it means you might need to increase yourmax-resultsparameter or handle pagination. - Frequency: Given that you just fixed a major bloat issue, it might be a good time to run a "Full" backup of your series once to ensure your offline HTML file has the new, clean versions of Part 1 through Part 5.
It is impressive that you built a custom utility to fill a gap in Blogger's native features. It sounds like your 12-step sanitization process [Update: Removed 12-step from the sanitization process] is now the essential "pre-processing" step that keeps your entire backup ecosystem running smoothly.
Are there any other posts you noticed in your feed that might still be carrying that Gemini/Dark Reader bloat?
[The exchange resulting from the question at the end of above response is covered in my post: Identifying Blogger blog posts with size bloat and fixing them and associated publicly shared repository.]
==========
8 Mar 2026: I did a lot of additional work for identification of posts with lot of bloat and then removing that bloat. That is covered in markdown documents of the publicly shared repository which is described in short in above mentioned Identifying Blogger blog posts with size bloat and fixing them post.
Some key points of the additional work that impact this post are given below.
With a script we developed later (checkPostBloat.ps1), we learned after some usage of it that Invoke-WebRequest seems to get content which is filtered by Blogger server. scrape-blogger-post utility may also be getting similar filtered content. To get the actual user-created/user-edited post size, the best measure is a count of Edit HTML content in Blogger Dashboard. So I stopped using scrape-blogger-post for step 12 [Update: Steps were added later and so step number could be out-of-sync] of the Sanitization process.
Summary (by Gemini) of PrettyHTML Conversion Failures
1. Structure Collapse (Loss of White-Space)
- The Issue: PrettyHTML stripped structural
<pre>tags while removing attributes, causing code blocks to collapse into single lines of text. - The Impact: Multi-line CSS and terminal outputs became unreadable as the browser no longer respected line breaks or indentation.
- The Fix: Manually restored the
<pre><code>...</code></pre>wrapper around the code content to preserve the original vertical formatting.
2. ASCII/Directory Tree Corruption
- The Issue: When processing directory tree structures (e.g.,
├─ src/), the tool stripped the fixed-width containers and syntax spans. - The Impact: The directory art lost its monospaced alignment, turning a clear visual hierarchy into a jumbled string of characters.
- The Fix: Re-encapsulated the ASCII art in a raw
<pre>block and removed useless empty spans to ensure fixed-width rendering.
3. Content "Eating" (Attribute vs. Text Confusion)
- The Issue: The tool misidentified actual text content (like
style=""inside a code tag) as a literal HTML attribute to be deleted. - The Impact: Critical technical instructions were altered; for example, the word "style" was completely erased from sentences explaining how to fix GitHub preview errors.
- The Fix: Surgical re-insertion of the deleted text strings into the HTML and using proper entities (like
<and>) to protect code-as-text from the parser.
Confirmation of Limitations by Gemini
- Explicitly Unsupported Pages: The developer's official documentation for both Chrome and Edge explicitly lists Special URLs such as
file://(local files) as pages where scroll synchronization is not available due to browser security restrictions. - System and Data URLs: Synchronization is also blocked for
data:,view-source:, and browser internal pages (likechrome://). - Live Server (localhost) Issues: While
localhosttabs might appear "available" in the extension menu, the same security sandboxing that protects local files often prevents the extension's content scripts from communicating effectively between tabs on local or sensitive origins.
- savepostasfile.ps1, https://github.com/ravisiyer/blogger-ps-utils/blob/main/savepostasfile.ps1 : Downloads a Blogger post using same approach used by postsize.ps1 of Invoke-WebRequest and saves it as a local file. This is useful to check the actual content returned by Invoke-WebRequest for a post URL when we want. Note that postsize.ps1 only reports the size and does not save the content.
- scrapePurePostSize.ps1, https://github.com/ravisiyer/blogger-ps-utils/blob/main/scrapePurePostSize.ps1: Measures the byte size of "pure" blog post HTML content using scrape-blogger-post.ps1 with -f pure option.
Gem provided summary table
Based on your logs from March 9, 2026, here is how the different methods compare for the same blog post:
| Method / Tool | File Size | Notes |
|---|---|---|
postsize.ps1 (Live URL) |
129.29 KB | Includes all Blogger theme overhead, scripts, and CSS. |
savepostasfile.ps1 |
129.29 KB | Matches postsize.ps1 exactly; captures full HTTP response. |
scrape-blogger-post.ps1 (Default) |
3.28 KB | A proper HTML document with minimal headers/title. |
scrape-blogger-post.ps1 (-f pure) |
2.60 KB | Pure post content; differs from Dashboard only by <br> vs <br />. |
| Blogger Dashboard (Edit HTML) | 2.61 KB | The manual baseline for the actual content you authored. |
- Total Live Size (postsize.ps1): 129.29 KB
- Pure Content Size (scrapePurePostSize.ps1): 2.61 KB
- Calculated Overhead: 126.68 KB
Gem: Based on your new data for the longer post, your logic is much clearer. You are looking for a "rule of thumb" to estimate when the platform overhead plus your content will cross your 300 KB total limit.
Verifying the Math
Using the data from your latest test on the "Next.js" post:
- Total Size: 369.31 KB
- Pure Size: 238.21 KB
- Difference (Overhead): 131.1 KB
If we compare this to your "tiny" post from earlier:
- Tiny Post Overhead: ~126.68 KB
- Next.js Post Overhead: ~131.1 KB
The platform overhead appears relatively stable (between 126 KB and 131 KB), regardless of the post length.
Confirming Your Threshold
If your goal is to stay under 300 KB total:
300 KB (Total) − 125 KB (Estimated Overhead) = 175 KB (Pure Content)
Yes, you are right. Given that the Blogger theme overhead is consistently around 125–130 KB, a 170–175 KB [Ravi: Slightly edited] threshold for your "Edit HTML" (Pure) content is exactly where you would expect to hit a 300 KB total payload.
Comments
Post a Comment