OCR and Pinyin texts

[This entry is largely for my own reference. But feel free to read on, especially if you’re interested in OCR or if you somehow happen to have a lot of Pinyin texts lying around.]

What’s the best way to run optical character recognition (OCR) on texts written in Pinyin with tone marks? Adobe Acrobat 7.0 Standard, the most advanced such software I have on my computer, doesn’t have a “Pinyin” setting. I’d be surprised if any OCR software currently does.

Getting second tones, fourth tones, and umlauts to be read correctly shouldn’t be a big problem, given how the same marks are standard in the orthographies of many European languages. But first tones and third tones are a different matter. The best that can probably be hoped for at present is a more-or-less regular rendering of vowels with first- and third-tone marks as something else that can be fixed quickly through a search-and-replace procedure.

Here’s an image, slightly reduced, of what was being scanned:
scan of sentences in Pinyin

Here’s the text:
W? bù shì xuézh?, bù néng y?nj?ng jùdi?n. Dànshì w? y?u cóng zìj? sh?nghuó l? délái de w? ge zh?nshí lìzi, d?u bi?omíng Hànzì bìng bù tèbié bi?oyì.

Here are the results of OCR, with various language settings applied:

DUTCH
WÖ bu shi xuézhë, bù néng yinjing jùdiän. Danshi wö yöu cóng zip shënghuó
li délái de wü ge zhënshí Iizi, döu biäomíng Hanzì bing bu tebié biäoyì.

CATALAN
W6 bu shi xuezh8, bir neng yinjing jhdisn. Danshi w6 y5u cong ziji shenghuó
li delai de wü ge zhenshí lizi, d6u bibmíng Hanzi bing bu tebie bigoyi.

DANISH
W6 bu shi xuezhe, bU neng yinjing jhdian. Danshi w6 y5u cong ziji shGnghu6
li delai de wii ge zhenshi Iizi, dóu bigoming Hanzi bing bu tebie biaoyi.

FINNISH
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
Ii delai de wii ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.

FRENCH
W6 bù shi xuézhë, bù néng yinjing jùdian. Dànshi wO y5u cong ziji shënghu6
li délai de wü ge zhënshi Iizi, dou bigoming Hànzi bing bu tèbié biaoyi.

GERMAN
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
li delai de wü ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.

GERMAN (SWISS)
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
li delai de wü ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.

ITALIAN
W6 bu shì xuézhe, bù néng yinjing jùdian. Dànshì w6 y5u cong ziji shènghu6
li délai de wii ge zhenshi Iizi, dou bigoming Hànzì bing bu tèbié biaoyì.

NYNORSK
W6 bu shi xuezhe, bU neng yinjing jhdian. Danshi wO y5u cong ziji shGnghu6
li delai de wii ge zhenshi Iizi, dou biaoming Hanzi bing bu tebie biaoyi.

PORTUGUESE (BRAZILIAN)
WÕ bu shi xuézhe, bU néng yinjing jùdiãn. Danshi wõ yõu cóng ziji shènghuó
li délái de wü ge zhenshí Iizi, dõu biãomíng Hanzi bing bu tèbié biãoyi.

PORTUGUESE
WÕ bu shi xuézhe, bU néng yinjing jùdiãn. Danshi wõ yõu cóng ziji shènghuó
li délái de wü ge zhenshí Iizi, dõu biãomíng Hanzi bing bu tèbié biãoyi.

SPANISH
W6 bu shi xuézhe, bU néng yinjing jhdian. Danshi wO y5u cóng ziji shenghuó
li délái de wü ge zhenshí Iizi, dóu bigomíng Hanzi bing bu tebié biaoyi.

There’s no clear winner. The best results, such as they are, appear to be using Dutch and Portuguese (Brazilian or standard).

11 thoughts on “OCR and Pinyin texts

  1. you hit on exactly the issue i have been trying, unsuccessfully
    to brainstorm and resolve.

    How to scan Pinyin text, and make it editable…even translateable.

    if i find anything i will come back, if you find some solutions,
    nice try btw, let me know.

    mykl he

  2. Looks like my other entry didn’t make it. I reported that ocr software like FineReader (Abbyy) can be made to work for pinyin. It requires quite a bit of tweaking and training – but the result is excellent once you understand how to use the software. I’ll try to put together an online tutorial if I have the time…

  3. Hello Farang,
    A most interesting topic. Do you have a tutorial on how to use OCR with pinyin input texts ?
    Thanks in advance.

  4. Hi everybody! I’m a complete newbie, so first of all I wanna apologize if I’m posting in the wrong place; cut to the chase:

    Issue 1: I have some pdf files containing only pinyin, and tried to ocr them; I’ve found no difficulty at all with the set of latin alphabet, except for the next set of characters { o ? ?? ? ? ? ? ? ? ? ? ? ? ? á ?? é í ó ú ? Á É Í Ó Ú ? ? ?? ? ? ? ? ? ? ? ? ? ? ? à ?? è ì ò ù ? À È Ì Ò Ù ? a ? e i o u ü A E I O U Ü } which are not recognize by either Abby, Acrobat, Tesseract etc. I’ve tried to train them, use a combination of different languages, and a million things more like asking in dozens of forums, but no luck.

    Issue 2: I also have some resources in true-pdf format with those damn subset of embedded fonts, and when trying “copy-and-paste” activities to rearrange the layout, the file becomes unmanageable because the text get completely illegible. I’ve installed most of the fonts that I didn’t have on my pc, and also the Pitstop plugin, but cannot find a solution -for example, substituting throughout the file all those characters that use a certain embedded subset of a font by a different font, keeping the original character shape.

    As you can see issue1 and issue2 are related inasmuch as solving issue1 would also put and end to #2.

    So I hope to hear good news soon.
    Thanks in advance

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>