[This entry is largely for my own reference. But feel free to read on, especially if you’re interested in OCR or if you somehow happen to have a lot of Pinyin texts lying around.]
What’s the best way to run optical character recognition (OCR) on texts written in Pinyin with tone marks? Adobe Acrobat 7.0 Standard, the most advanced such software I have on my computer, doesn’t have a “Pinyin” setting. I’d be surprised if any OCR software currently does.
Getting second tones, fourth tones, and umlauts to be read correctly shouldn’t be a big problem, given how the same marks are standard in the orthographies of many European languages. But first tones and third tones are a different matter. The best that can probably be hoped for at present is a more-or-less regular rendering of vowels with first- and third-tone marks as something else that can be fixed quickly through a search-and-replace procedure.
Here’s an image, slightly reduced, of what was being scanned:
Here’s the text:
W? bù shì xuézh?, bù néng y?nj?ng jùdi?n. Dànshì w? y?u cóng zìj? sh?nghuó l? délái de w? ge zh?nshí lìzi, d?u bi?omíng Hànzì bìng bù tèbié bi?oyì.
Here are the results of OCR, with various language settings applied:
DUTCH
WÖ bu shi xuézhë, bù néng yinjing jùdiän. Danshi wö yöu cóng zip shënghuó
li délái de wü ge zhënshí Iizi, döu biäomíng Hanzì bing bu tebié biäoyì.
CATALAN
W6 bu shi xuezh8, bir neng yinjing jhdisn. Danshi w6 y5u cong ziji shenghuó
li delai de wü ge zhenshí lizi, d6u bibmíng Hanzi bing bu tebie bigoyi.
DANISH
W6 bu shi xuezhe, bU neng yinjing jhdian. Danshi w6 y5u cong ziji shGnghu6
li delai de wii ge zhenshi Iizi, dóu bigoming Hanzi bing bu tebie biaoyi.
FINNISH
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
Ii delai de wii ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.
FRENCH
W6 bù shi xuézhë, bù néng yinjing jùdian. Dànshi wO y5u cong ziji shënghu6
li délai de wü ge zhënshi Iizi, dou bigoming Hànzi bing bu tèbié biaoyi.
GERMAN
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
li delai de wü ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.
GERMAN (SWISS)
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
li delai de wü ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.
ITALIAN
W6 bu shì xuézhe, bù néng yinjing jùdian. Dànshì w6 y5u cong ziji shènghu6
li délai de wii ge zhenshi Iizi, dou bigoming Hànzì bing bu tèbié biaoyì.
NYNORSK
W6 bu shi xuezhe, bU neng yinjing jhdian. Danshi wO y5u cong ziji shGnghu6
li delai de wii ge zhenshi Iizi, dou biaoming Hanzi bing bu tebie biaoyi.
PORTUGUESE (BRAZILIAN)
WÕ bu shi xuézhe, bU néng yinjing jùdiãn. Danshi wõ yõu cóng ziji shènghuó
li délái de wü ge zhenshí Iizi, dõu biãomíng Hanzi bing bu tèbié biãoyi.
PORTUGUESE
WÕ bu shi xuézhe, bU néng yinjing jùdiãn. Danshi wõ yõu cóng ziji shènghuó
li délái de wü ge zhenshí Iizi, dõu biãomíng Hanzi bing bu tèbié biãoyi.
SPANISH
W6 bu shi xuézhe, bU néng yinjing jhdian. Danshi wO y5u cóng ziji shenghuó
li délái de wü ge zhenshí Iizi, dóu bigomíng Hanzi bing bu tebié biaoyi.
There’s no clear winner. The best results, such as they are, appear to be using Dutch and Portuguese (Brazilian or standard).