Copy-pasting Devanagari text from 2002-Era PDF gave gibberish due to 8-Bit Font; Unicode is far superior modern solution
Recently for Election Commission of India SIR enrollment, I had to search in its 2002 E-Roll PDFs. It was quite impressive to see the search facility and the PDF outputs given the huge scale of the data it has to handle. In this context, I ran into an issue of copy-pasting Devanagari text from a PDF which rendered the Devanagari font correctly, into a text file being edited by Notepad++. The copy-paste of the key line about what seems to be my entry, in Notepad++ gave (English numbers changed to hide key data but rendering of English numbers was OK in Notepad++): 999 ŸÖãeúÖ¸üÖ´ÖË +ªªÉ®ú ®ú´ÉÒ BºÉ. ´É +ªªÉ®ú BºÉ. {ÉÖ 99 Gemini AI tool did a great job in giving me the right Devanagari text in Unicode which matches what the PDF rendered: 999 | तुकाराम | अय्यर रवी एस. | व | अय्यर एस. | पु | | 99 This led to an exchange with Gemini on why this issue was happening and ways to convert the old data correctly without Gemini AI tool's help. After I used Foxit Reader to inspect fonts in the ...