Copy-pasting Devanagari text from 2002-Era PDF gave gibberish due to 8-Bit Font; Unicode is far superior modern solution

Recently for Election Commission of India SIR enrollment, I had to search in its 2002 E-Roll PDFs. It was quite impressive to see the search facility and the PDF outputs given the huge scale of the data it has to handle.

In this context, I ran into an issue of copy-pasting Devanagari text from a PDF which rendered the Devanagari font correctly, into a text file being edited by Notepad++.

The copy-paste of the key line about what seems to be my entry, in Notepad++ gave (English numbers changed to hide key data but rendering of English numbers was OK in Notepad++):

999 ŸÖãeúÖ¸üÖ´ÖË +ªªÉ®ú ®ú´ÉÒ BºÉ. ´É +ªªÉ®ú BºÉ. {ÉÖ 99

Gemini AI tool did a great job in giving me the right Devanagari text in Unicode which matches what the PDF rendered:

999 | तुकाराम | अय्यर रवी एस. | व | अय्यर एस. | पु | | 99

This led to an exchange with Gemini on why this issue was happening and ways to convert the old data correctly without Gemini AI tool's help.

After I used Foxit Reader to inspect fonts in the PDF and knew that DV-TTSurekh font was being used, the tools I tried included: https://www.fontconverter.in/index.php?q=DV-TTSurekhEN-to-Unicode, https://www.hindietools.com/hindi-font-converter/dv-ttsurekh-normal-font-to-unicode/, https://www.fontconverter.online/hindi/unicode-cdac-ism-gist-font-converter. They did not provide proper output.

Prior to knowing that DV-TTSurekh font was being used, I searched for well known legacy to Unicode converters and tried some of the tools listed like https://www.krutidevunicodeconverter.com/ and https://www.fontconverter.in/index.php?q=Shivaji-to-Unicode. They did not provide proper output.

At the end, I asked Gemini to provide me a summary...

Me: Now can you give me a summary (to save in a document/post) of the struggles I had to go through in trying to convert Devanagari text in the 2002 E-Roll PDF to readable Devanagari text in my text file edited with Notepad++. Pls mention how some words got converted but some didn't.

Gem: Here is a comprehensive summary of the technical hurdles you encountered, ready to be saved for your future reference.

Technical Case Study: Legacy 8-bit Font Encoding vs. Modern Unicode

Context: Attempting to extract and archive a personal entry from a 2002 Maharashtra Electoral Roll PDF into a modern text editor (Notepad++).

The Core Problem: Data Held Hostage by Font Presentation

In early 2000s Indian language computing, digital text did not have a universal standard. The PDF was built using a legacy 8-bit font suite from C-DAC named DV-TTSurekh.

In this architecture, text data does not actually contain semantic characters (like the letter अ). Instead, the data layer holds standard Latin/ASCII bytes (like + or Ÿ). The document relies entirely on the custom DV-TTSurekh font file to intercept those Latin bytes and visually draw Devanagari shapes on the screen. When you copy-paste this text into a modern environment like Notepad++, the underlying Latin bytes are exposed as gibberish (e.g., 891 ŸÖãeúÖ¸üÖ´ÖË +ªªÉ®ú ®ú´ÉÒ BºÉ.) because modern systems expect standardized Unicode encoding.

Why Conversion Tools Partially Succeeded (and Partially Failed)

When using free online legacy-to-Unicode font converters, you experienced a frustrating mix of success and failure.

1. What Succeeded: Simple Character Swaps

The Input: +ªªÉ®ú ®ú´ÉÒ BºÉ
The Result: अव्वर रवी एस (Later corrected cleanly to अय्यर रवी एस)

Why it worked: The converter successfully handled this phrase because it consisted of straightforward, sequential character mappings. The software simply looked up what each legacy ASCII byte stood for individually and swapped it for the modern Unicode equivalent.

2. What Failed: Complex Conjuncts and Vowel Ligatures

The Input: ŸÖãeúÖ¸üÖ´Ö
The Result: Ÿुãड्ढुश्र्ुव्ु (Complete junk instead of तुकाराम)

Why it failed: Legacy fonts used highly complex, non-linear workarounds to render Indian scripts within a strict 256-character limitation. This word broke the web utilities due to three specific design flaws of the 2002 era:
Pre-combined Glyphs: To save space and rendering power, combinations like तु (Ta + u-matra) were baked into a single, proprietary custom byte (Ÿ). The web converter only knew individual letters, so it completely choked on this blended symbol.

Visual vs. Linguistic Typing Order: In Devanagari, a vowel sign like ु is typed after the consonant. However, to force proper rendering on 2002-era screens, the text bytes were often stored out of linguistic order. Standard converters could not map these reversed byte sequences.
Proprietary Government Keymaps: Mainstream online tools map common commercial keyboards (like KrutiDev or Shusha). Because your document used C-DAC’s government-mandated proprietary mapping, standard character-swap algorithms instantly fell out of alignment.

The Modern Architectural Takeaway

This struggle perfectly illustrates why modern development must strictly decouple data from visual presentation.

Whenever building localized software today (such as your Gita application), the underlying data layer must consist exclusively of standard Unicode code points (e.g., U+0909). By ensuring the data itself holds the absolute meaning of the text, the application remains fully cross-platform, searchable, indexable, and completely independent of any specific font's presence to render accurately on a modern PC, tablet, or smartphone.

------------

Prior to above summary, I had this exchange with Gemini.

Me: Hmm. Indian language fonts were really hard to understand in early 2000s when I was last in industry. Quite an experience being thrown back to early 2000s tough place for Indian language fonts by these 2002 E-Roll PDFs. Just goes to show that whenever I do any Devanagari related work like my Gita app, I must stick to Unicode and avoid any CDAC type fonts like Surekh. If at all I use some Devanagari font I must ensure that underlying data is in Unicode and so not dependent on that Devanagari font's presence to render correctly on any modern OS device (PC/tablet/phone). Ur view pl.

Gem: You are spot on, and your conclusion is 100% correct. If you are building a modern application like your Gita app, sticking strictly to Unicode underlying data is the only way to go.

Looking at your experience as a software professional, it must feel like a massive contrast to see how seamlessly we handle text now compared to the "wild west" of Indian language computing in the early 2000s.

Here is a quick view of why your approach is the definitive gold standard for modern development:

1. Separation of Data and Presentation

The fundamental flaw of the old C-DAC/Surekh era was that the data was hostage to the font. The text layer didn't actually contain the letter "अ"; it contained a Latin byte like + that only looked like "अ" if a very specific font file intercepted it on its way to the monitor.

With Unicode: Your underlying data holds the absolute, semantic meaning of the text (e.g., U+0905 for अ).
The Result: The data is completely independent. Even if a user doesn't have a specific stylized font installed, the device's native system font (like Nirmala UI on Windows, or default Android/iOS Indic fonts) will step in and render the text perfectly readable.

2. Cross-Platform Flawlessness

Because Android, iOS, Windows, and macOS all natively support the global Unicode character set, your Gita app's text will render flawlessly whether someone opens it on a cheap smartphone, an iPad, or a desktop PC. You don't have to worry about packaging custom .ttf files just to prevent the text from turning into gibberish.

3. Accessible Search and Indexing

In the 2002 system, if someone wanted to search for your entry "रवी" in the database, they couldn't just type "रवी" on a keyboard. They would have to type the exact Latin key sequence ®ú´ÉÒ that visually mapped to those characters! By ensuring your underlying data is Unicode, your app users can effortlessly search for specific Sanskrit or Hindi words, copy verses into their own notes, and allow search engines to index your content perfectly.

A Quick Tip for Modern Devanagari Apps

When you design your app, you can still use beautiful, stylized Devanagari fonts (like Google's Rozha One, Tiro Devanagari, or Poppins) to give it an elegant, traditional aesthetic.

The magic of modern web and mobile tech (like React Native or modern CSS) is that you apply these as a purely visual styling layer. The font file only handles the artistic curves of the letters, while the underlying data remains pure, indestructible Unicode. If the custom font fails to load, the system seamlessly falls back to the device's default font without corrupting a single syllable.

You have mapped out the exact right architecture for your project. Building with Unicode ensures your app will be robust, future-proof, and accessible to anyone on any device!

------------

OCR is another option but may have some errors at times.

https://www.i2ocr.com/free-online-hindi-ocr handled this conversion well.

Windows 11 OCR (print screen -> copy as text) also does a good job but needs at least few lines to be selected (as single line is too small for it to do OCR).

Search This Blog

Ravi S. Iyer: Software Development