{"id":414,"date":"2006-05-11T13:04:32","date_gmt":"2006-05-11T05:04:32","guid":{"rendered":"https:\/\/pinyin.info\/news\/2006\/ocr-and-pinyin-texts\/"},"modified":"2008-11-04T16:42:00","modified_gmt":"2008-11-04T08:42:00","slug":"ocr-and-pinyin-texts","status":"publish","type":"post","link":"https:\/\/pinyin.info\/news\/2006\/ocr-and-pinyin-texts\/","title":{"rendered":"OCR and Pinyin texts"},"content":{"rendered":"<p>[This entry is largely for my own reference. But feel free to read on, especially if you&#8217;re interested in OCR or if you somehow happen to have a lot of Pinyin texts lying around.]<\/p>\n<p>What&#8217;s the best way to run optical character recognition (OCR) on texts written in Pinyin with tone marks? Adobe Acrobat 7.0 Standard, the most advanced such software I have on my computer, doesn&#8217;t have a &#8220;Pinyin&#8221; setting. I&#8217;d be surprised if <em>any<\/em> OCR software currently does.<\/p>\n<p>Getting second tones, fourth tones, and umlauts to be read correctly shouldn&#8217;t be a big problem, given how the same marks are standard in the orthographies of many European languages. But first tones and third tones are a different matter. The best that can probably be hoped for at present is a more-or-less regular rendering of vowels with first- and third-tone  marks as something else that can be fixed quickly through a search-and-replace procedure. <\/p>\n<p>Here&#8217;s an image, slightly reduced, of what was being scanned:<br \/>\n<img decoding=\"async\" src=\"https:\/\/pinyin.info\/news\/news_photos\/2006\/05\/pinyin_ocr.gif\" alt=\"scan of sentences in Pinyin\" \/><\/p>\n<p>Here&#8217;s the text:<br \/>\n<span class=\"py\">W? b\u00f9 sh\u00ec xu\u00e9zh?, b\u00f9 n\u00e9ng y?nj?ng j\u00f9di?n. D\u00e0nsh\u00ec w? y?u c\u00f3ng z\u00ecj? sh?nghu\u00f3 l? d\u00e9l\u00e1i de w? ge zh?nsh\u00ed l\u00eczi, d?u bi?om\u00edng H\u00e0nz\u00ec b\u00ecng b\u00f9 t\u00e8bi\u00e9 bi?oy\u00ec.<\/span><\/p>\n<p>Here are the results of OCR, with various language settings applied: <\/p>\n<blockquote><p>DUTCH<br \/>\nW\u00d6 bu shi xu\u00e9zh\u00eb, b\u00f9 n\u00e9ng yinjing j\u00f9di\u00e4n. Danshi w\u00f6 y\u00f6u c\u00f3ng zip sh\u00ebnghu\u00f3<br \/>\nli d\u00e9l\u00e1i de w\u00fc ge zh\u00ebnsh\u00ed Iizi, d\u00f6u bi\u00e4om\u00edng Hanz\u00ec bing bu tebi\u00e9 bi\u00e4oy\u00ec.<\/p>\n<p>CATALAN<br \/>\nW6 bu shi xuezh8, bir neng yinjing jhdisn. Danshi w6 y5u cong ziji shenghu\u00f3<br \/>\nli delai de w\u00fc ge zhensh\u00ed lizi, d6u bibm\u00edng Hanzi bing bu tebie bigoyi.<\/p>\n<p>DANISH<br \/>\nW6 bu shi xuezhe, bU neng yinjing jhdian. Danshi w6 y5u cong ziji shGnghu6<br \/>\nli delai de wii ge zhenshi Iizi, d\u00f3u bigoming Hanzi bing bu tebie biaoyi.<\/p>\n<p>FINNISH<br \/>\nW\u00d6 bu shi xuezhe, bU neng yinjing jiidi\u00e4n. Danshi w\u00f6 y\u00f6u cong ziji shGnghu6<br \/>\nIi delai de wii ge zhenshi Iizi, d\u00f6u bi\u00e4oming Hanzi bing bu tebie bi\u00e4oyi.<\/p>\n<p>FRENCH<br \/>\nW6 b\u00f9 shi xu\u00e9zh\u00eb, b\u00f9 n\u00e9ng yinjing j\u00f9dian. D\u00e0nshi wO y5u cong ziji sh\u00ebnghu6<br \/>\nli d\u00e9lai de w\u00fc ge zh\u00ebnshi Iizi, dou bigoming H\u00e0nzi bing bu t\u00e8bi\u00e9 biaoyi.<\/p>\n<p>GERMAN<br \/>\nW\u00d6 bu shi xuezhe, bU neng yinjing jiidi\u00e4n. Danshi w\u00f6 y\u00f6u cong ziji shGnghu6<br \/>\nli delai de w\u00fc ge zhenshi Iizi, d\u00f6u bi\u00e4oming Hanzi bing bu tebie bi\u00e4oyi.<\/p>\n<p>GERMAN (SWISS)<br \/>\nW\u00d6 bu shi xuezhe, bU neng yinjing jiidi\u00e4n. Danshi w\u00f6 y\u00f6u cong ziji shGnghu6<br \/>\nli delai de w\u00fc ge zhenshi Iizi, d\u00f6u bi\u00e4oming Hanzi bing bu tebie bi\u00e4oyi.<\/p>\n<p>ITALIAN<br \/>\nW6 bu sh\u00ec xu\u00e9zhe, b\u00f9 n\u00e9ng yinjing j\u00f9dian. D\u00e0nsh\u00ec w6 y5u cong ziji sh\u00e8nghu6<br \/>\nli d\u00e9lai de wii ge zhenshi Iizi, dou bigoming H\u00e0nz\u00ec bing bu t\u00e8bi\u00e9 biaoy\u00ec.<\/p>\n<p>NYNORSK<br \/>\nW6 bu shi xuezhe, bU neng yinjing jhdian. Danshi wO y5u cong ziji shGnghu6<br \/>\nli delai de wii ge zhenshi Iizi, dou biaoming Hanzi bing bu tebie biaoyi.<\/p>\n<p>PORTUGUESE (BRAZILIAN)<br \/>\nW\u00d5 bu shi xu\u00e9zhe, bU n\u00e9ng yinjing j\u00f9di\u00e3n. Danshi w\u00f5 y\u00f5u c\u00f3ng ziji sh\u00e8nghu\u00f3<br \/>\nli d\u00e9l\u00e1i de w\u00fc ge zhensh\u00ed Iizi, d\u00f5u bi\u00e3om\u00edng Hanzi bing bu t\u00e8bi\u00e9 bi\u00e3oyi.<\/p>\n<p>PORTUGUESE<br \/>\nW\u00d5 bu shi xu\u00e9zhe, bU n\u00e9ng yinjing j\u00f9di\u00e3n. Danshi w\u00f5 y\u00f5u c\u00f3ng ziji sh\u00e8nghu\u00f3<br \/>\nli d\u00e9l\u00e1i de w\u00fc ge zhensh\u00ed Iizi, d\u00f5u bi\u00e3om\u00edng Hanzi bing bu t\u00e8bi\u00e9 bi\u00e3oyi.<\/p>\n<p>SPANISH<br \/>\nW6 bu shi xu\u00e9zhe, bU n\u00e9ng yinjing jhdian. Danshi wO y5u c\u00f3ng ziji shenghu\u00f3<br \/>\nli d\u00e9l\u00e1i de w\u00fc ge zhensh\u00ed Iizi, d\u00f3u bigom\u00edng Hanzi bing bu tebi\u00e9 biaoyi.\n<\/p><\/blockquote>\n<p>There&#8217;s no clear winner. The best results, such as they are, appear to be using Dutch and Portuguese (Brazilian or standard). <\/p>\n","protected":false},"excerpt":{"rendered":"<p>[This entry is largely for my own reference. But feel free to read on, especially if you&#8217;re interested in OCR or if you somehow happen to have a lot of Pinyin texts lying around.] What&#8217;s the best way to run &hellip; <a href=\"https:\/\/pinyin.info\/news\/2006\/ocr-and-pinyin-texts\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20,38],"tags":[],"class_list":["post-414","post","type-post","status-publish","format-standard","hentry","category-pinyin","category-software"],"_links":{"self":[{"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/posts\/414","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/comments?post=414"}],"version-history":[{"count":1,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/posts\/414\/revisions"}],"predecessor-version":[{"id":1645,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/posts\/414\/revisions\/1645"}],"wp:attachment":[{"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/media?parent=414"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/categories?post=414"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/tags?post=414"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}