Copy-pasting Devanagari text from 2002-Era PDF gave gibberish due to 8-Bit Font; Unicode is far superior modern solution
999 ŸÖãeúÖ¸üÖ´ÖË +ªªÉ®ú ®ú´ÉÒ BºÉ. ´É +ªªÉ®ú BºÉ. {ÉÖ 99
999 | तुकाराम | अय्यर रवी एस. | व | अय्यर एस. | पु | | 99
Gem: Here is a comprehensive summary of the technical hurdles you encountered, ready to be saved for your future reference.
Technical Case Study: Legacy 8-bit Font Encoding vs. Modern Unicode
Context: Attempting to extract and archive a personal entry from a 2002 Maharashtra Electoral Roll PDF into a modern text editor (Notepad++).
The Core Problem: Data Held Hostage by Font Presentation
In early 2000s Indian language computing, digital text did not have a universal standard. The PDF was built using a legacy 8-bit font suite from C-DAC named DV-TTSurekh.
In this architecture, text data does not actually contain semantic characters (like the letter अ). Instead, the data layer holds standard Latin/ASCII bytes (like + or Ÿ). The document relies entirely on the custom DV-TTSurekh font file to intercept those Latin bytes and visually draw Devanagari shapes on the screen. When you copy-paste this text into a modern environment like Notepad++, the underlying Latin bytes are exposed as gibberish (e.g., 891 ŸÖãeúÖ¸üÖ´ÖË +ªªÉ®ú ®ú´ÉÒ BºÉ.) because modern systems expect standardized Unicode encoding.
Why Conversion Tools Partially Succeeded (and Partially Failed)
When using free online legacy-to-Unicode font converters, you experienced a frustrating mix of success and failure.
1. What Succeeded: Simple Character Swaps
The Input:
+ªªÉ®ú ®ú´ÉÒ BºÉThe Result:
अव्वर रवी एस(Later corrected cleanly toअय्यर रवी एस)
- Why it worked: The converter successfully handled this phrase because it consisted of straightforward, sequential character mappings. The software simply looked up what each legacy ASCII byte stood for individually and swapped it for the modern Unicode equivalent.
2. What Failed: Complex Conjuncts and Vowel Ligatures
The Input:
ŸÖãeúÖ¸üÖ´ÖThe Result:
Ÿुãड्ढुश्र्ुव्ु(Complete junk instead ofतुकाराम)
- Why it failed: Legacy fonts used highly complex, non-linear workarounds to render Indian scripts within a strict 256-character limitation. This word broke the web utilities due to three specific design flaws of the 2002 era:
- Pre-combined Glyphs: To save space and rendering power, combinations like
तु(Ta + u-matra) were baked into a single, proprietary custom byte (Ÿ). The web converter only knew individual letters, so it completely choked on this blended symbol.
- Visual vs. Linguistic Typing Order: In Devanagari, a vowel sign like
ुis typed after the consonant. However, to force proper rendering on 2002-era screens, the text bytes were often stored out of linguistic order. Standard converters could not map these reversed byte sequences. - Proprietary Government Keymaps: Mainstream online tools map common commercial keyboards (like KrutiDev or Shusha). Because your document used C-DAC’s government-mandated proprietary mapping, standard character-swap algorithms instantly fell out of alignment.
The Modern Architectural Takeaway
This struggle perfectly illustrates why modern development must strictly decouple data from visual presentation.
Whenever building localized software today (such as your Gita application), the underlying data layer must consist exclusively of standard Unicode code points (e.g., U+0909). By ensuring the data itself holds the absolute meaning of the text, the application remains fully cross-platform, searchable, indexable, and completely independent of any specific font's presence to render accurately on a modern PC, tablet, or smartphone.
------------
Prior to above summary, I had this exchange with Gemini.
Me: Hmm. Indian language fonts were really hard to understand in early 2000s when I was last in industry. Quite an experience being thrown back to early 2000s tough place for Indian language fonts by these 2002 E-Roll PDFs. Just goes to show that whenever I do any Devanagari related work like my Gita app, I must stick to Unicode and avoid any CDAC type fonts like Surekh. If at all I use some Devanagari font I must ensure that underlying data is in Unicode and so not dependent on that Devanagari font's presence to render correctly on any modern OS device (PC/tablet/phone). Ur view pl.
Gem: You are spot on, and your conclusion is 100% correct. If you are building a modern application like your Gita app, sticking strictly to Unicode underlying data is the only way to go.
Looking at your experience as a software professional, it must feel like a massive contrast to see how seamlessly we handle text now compared to the "wild west" of Indian language computing in the early 2000s.
Here is a quick view of why your approach is the definitive gold standard for modern development:
1. Separation of Data and Presentation
The fundamental flaw of the old C-DAC/Surekh era was that the data was hostage to the font. The text layer didn't actually contain the letter "अ"; it contained a Latin byte like + that only looked like "अ" if a very specific font file intercepted it on its way to the monitor.
- With Unicode: Your underlying data holds the absolute, semantic meaning of the text (e.g.,
U+0905forअ). - The Result: The data is completely independent. Even if a user doesn't have a specific stylized font installed, the device's native system font (like Nirmala UI on Windows, or default Android/iOS Indic fonts) will step in and render the text perfectly readable.
2. Cross-Platform Flawlessness
Because Android, iOS, Windows, and macOS all natively support the global Unicode character set, your Gita app's text will render flawlessly whether someone opens it on a cheap smartphone, an iPad, or a desktop PC. You don't have to worry about packaging custom .ttf files just to prevent the text from turning into gibberish.
3. Accessible Search and Indexing
In the 2002 system, if someone wanted to search for your entry "रवी" in the database, they couldn't just type "रवी" on a keyboard. They would have to type the exact Latin key sequence ®ú´ÉÒ that visually mapped to those characters!
By ensuring your underlying data is Unicode, your app users can effortlessly search for specific Sanskrit or Hindi words, copy verses into their own notes, and allow search engines to index your content perfectly.
A Quick Tip for Modern Devanagari Apps
When you design your app, you can still use beautiful, stylized Devanagari fonts (like Google's Rozha One, Tiro Devanagari, or Poppins) to give it an elegant, traditional aesthetic.
The magic of modern web and mobile tech (like React Native or modern CSS) is that you apply these as a purely visual styling layer. The font file only handles the artistic curves of the letters, while the underlying data remains pure, indestructible Unicode. If the custom font fails to load, the system seamlessly falls back to the device's default font without corrupting a single syllable.
You have mapped out the exact right architecture for your project. Building with Unicode ensures your app will be robust, future-proof, and accessible to anyone on any device!
------------
OCR is another option but may have some errors at times.
https://www.i2ocr.com/free-online-hindi-ocr handled this conversion well.
Windows 11 OCR (print screen -> copy as text) also does a good job but needs at least few lines to be selected (as single line is too small for it to do OCR).
Comments
Post a Comment