{"id":3077,"date":"2009-12-22T15:38:55","date_gmt":"2009-12-22T07:38:55","guid":{"rendered":"https:\/\/pinyin.info\/news\/?p=3077"},"modified":"2015-11-17T15:06:46","modified_gmt":"2015-11-17T07:06:46","slug":"google-translate-and-romaji","status":"publish","type":"post","link":"https:\/\/pinyin.info\/news\/2009\/google-translate-and-romaji\/","title":{"rendered":"Google Translate and r&#333;maji"},"content":{"rendered":"<p style=\"padding-bottom: 2em; border-bottom: 1px dotted green;\">The following is a guest post by <a href=\"http:\/\/people.cohums.ohio-state.edu\/unger26\/\">Professor J. Marshall Unger<\/a> of the Ohio State University&#8217;s Department of East Asian Languages and Literatures.<\/p>\n<p style=\"padding-top: 2em;\"><em>The challenge<\/em><\/p>\n<p>On 18 November 2009, Mark Swofford posted an item on his website pinyin.info <a href=\"https:\/\/pinyin.info\/news\/2009\/google-translates-new-pinyin-function-sucks\/\">criticizing the way Google Translate produces Hanyu Pinyin from standard Chinese text<\/a>. He concluded by saying, \u201cGoogle Translate will also romanize Japanese texts written in kanji and kana, Russian texts written in Cyrillic, etc. But I\u2019ll leave those to others to analyze.\u201d So I decided to take up Swofford\u2019s challenge as it pertains to Japanese. Using <a href=\"http:\/\/translate.google.com\/#ja|ja|\">Google Translate<\/a>, I romanized a news item from the <em>Asahi<\/em> of 6 December 2009:<\/p>\n<table style=\"padding: 1em; font-family: georgia, serif;\">\n<tr>\n<th>Original<\/th>\n<th>Google Translate<\/th>\n<\/tr>\n<tr>\n<td style=\"padding-right: .5em; vertical-align: top;\">&#65302;&#26085;&#21320;&#24460;&#65300;&#26178;&#65299;&#65301;&#20998;&#12372;&#12429;&#12289;&#26481;&#20140;&#37117;&#21315;&#20195;&#30000;&#21306;&#30343;&#23621;&#22806;&#33489;&#12398;&#37117;&#36947;&#65288;&#20869;&#22528;&#36890;&#12426;&#65289;&#12398;&#20108;&#37325;&#27211;&#21069;&#20132;&#24046;&#28857;&#12391;&#12289;&#20013;&#22269;&#12363;&#12425;&#12398;&#35251;&#20809;&#23458;&#12398;&#65300;&#65296;&#20195;&#12398;&#30007;&#24615;&#12364;&#20055;&#29992;&#36554;&#12395;&#12399;&#12397;&#12425;&#12428;&#12289;&#20840;&#36523;&#12434;&#24375;&#12367;&#25171;&#12387;&#12390;&#38291;&#12418;&#12394;&#12367;&#27515;&#20129;&#12375;&#12383;&#12290;&#36554;&#12399;&#27497;&#36947;&#12395;&#20055;&#12426;&#19978;&#12370;&#12390;&#27497;&#12356;&#12390;&#12356;&#12383;&#30007;&#24615;&#65288;&#65302;&#65305;&#65289;&#12418;&#12399;&#12397;&#12289;&#30007;&#24615;&#12399;&#38957;&#12434;&#24375;&#12367;&#25171;&#12387;&#12390;&#24847;&#35672;&#19981;&#26126;&#12398;&#37325;&#20307;&#12290;&#20024;&#12398;&#20869;&#32626;&#12399;&#12289;&#36939;&#36578;&#12375;&#12390;&#12356;&#12383;&#26481;&#20140;&#37117;&#28207;&#21306;&#30333;&#37329;&#65299;&#19969;&#30446;&#12289;&#20250;&#31038;&#24441;&#21729;&#39640;&#27211;&#24310;&#25299;&#23481;&#30097;&#32773;&#65288;&#65298;&#65300;&#65289;&#12434;&#33258;&#21205;&#36554;&#36939;&#36578;&#36942;&#22833;&#20663;&#23475;&#12398;&#30097;&#12356;&#12391;&#29694;&#34892;&#29359;&#36910;&#25429;&#12375;&#12289;&#23481;&#30097;&#12434;&#21516;&#33268;&#27515;&#12395;&#20999;&#12426;&#26367;&#12360;&#12390;&#35519;&#12409;&#12390;&#12356;&#12427;&#12290;<\/td>\n<td style=\"padding-right: .5em; vertical-align: top;\">roku nichi gogo yon ji san go fun goro , t&#333;ky&#333; to chiyoda ku k&#333;kyogaien no tod&#333; &#65288; uchibori d&#333;ri &#65289; no nij&#363;bashi zen k&#333;saten de , ch&#363;goku kara no kank&#333; kyaku no yon zero dai no dansei ga j&#333;y&#333;sha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shib&#333; shi ta . kuruma wa hod&#333; ni noriage te arui te i ta dansei &#65288; roku ky&#363; &#65289; mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no j&#363;tai . marunouchi sho wa , unten shi te i ta t&#333;ky&#333; to minato ku hakkin san ch&#333;me , kaisha yakuin takahashi nobe tsubuse y&#333;gi sha &#65288; ni yon &#65289; wo jid&#333;sha unten kashitsu sh&#333;gai no utagai de genk&#333; han taiho shi , y&#333;gi wo d&#333; chishi ni kirikae te shirabe te iru . <\/td>\n<\/tr>\n<tr>\n<td style=\"padding-right: .5em; vertical-align: top;\">&#12288;&#21516;&#32626;&#12395;&#12424;&#12427;&#12392;&#12289;&#27515;&#20129;&#12375;&#12383;&#30007;&#24615;&#12399;&#27178;&#26029;&#27497;&#36947;&#12434;&#27497;&#12356;&#12390;&#28193;&#12387;&#12390;&#12356;&#12383;&#12392;&#12371;&#12429;&#12434;&#30452;&#36914;&#12375;&#12390;&#12365;&#12383;&#36554;&#12395;&#12399;&#12397;&#12425;&#12428;&#12383;&#12290;&#36554;&#12399;&#24038;&#12395;&#24613;&#12495;&#12531;&#12489;&#12523;&#12434;&#20999;&#12426;&#12289;&#36554;&#36947;&#12392;&#27497;&#36947;&#12398;&#22659;&#12395;&#32622;&#12363;&#12428;&#12383;&#20206;&#35373;&#12398;&#12373;&#12367;&#12434;&#12399;&#12397;&#19978;&#12370;&#12289;&#27497;&#36947;&#12395;&#20055;&#12426;&#19978;&#12370;&#12383;&#12392;&#12356;&#12358;&#12290;&#12373;&#12367;&#12399;&#27497;&#36947;&#12391;&#12521;&#12531;&#12491;&#12531;&#12464;&#12434;&#12375;&#12390;&#12356;&#12383;&#30007;&#24615;&#65288;&#65299;&#65300;&#65289;&#12395;&#24403;&#12383;&#12426;&#12289;&#30007;&#24615;&#12399;&#20001;&#36275;&#12395;&#36605;&#12356;&#12369;&#12364;&#12290;<\/td>\n<td style=\"padding-right: .5em; vertical-align: top;\"> d&#333;sho ni yoru to , shib&#333; shi ta dansei wa &#333;dan hod&#333; wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni ky&#363; handoru wo kiri , shad&#333; to hod&#333; no sakai ni oka re ta kasetsu no saku wo haneage , hod&#333; ni noriage ta to iu . saku wa hod&#333; de ran&#8217;ningu wo shi te i ta dansei &#65288; san yon &#65289; ni atari , dansei wa ry&#333;ashi ni karui kega .<\/td>\n<\/tr>\n<tr>\n<td style=\"padding-right: .5em; vertical-align: top;\">&#12288;&#21516;&#32626;&#12399;&#12289;&#27515;&#20129;&#12375;&#12383;&#30007;&#24615;&#12398;&#36523;&#20803;&#30906;&#35469;&#12434;&#36914;&#12417;&#12427;&#12392;&#12392;&#12418;&#12395;&#12289;&#24403;&#26178;&#12398;&#20132;&#24046;&#28857;&#12398;&#20449;&#21495;&#12398;&#29366;&#27841;&#12434;&#35519;&#12409;&#12390;&#12356;&#12427;&#12290; <\/td>\n<td style=\"padding-right: .5em; vertical-align: top;\"> d&#333;sho wa , shib&#333; shi ta dansei no mimoto kakunin wo susumeru totomoni , t&#333;ji no k&#333;saten no shing&#333; no j&#333;ky&#333; wo shirabe te iru . <\/td>\n<\/tr>\n<tr>\n<td style=\"padding-right: .5em; vertical-align: top;\">&#12288;&#29694;&#22580;&#21608;&#36794;&#12399;&#26481;&#20140;&#35251;&#20809;&#12398;&#12473;&#12509;&#12483;&#12488;&#12398;&#19968;&#12388;&#12384;&#12364;&#12289;&#26368;&#36817;&#12399;&#12472;&#12519;&#12462;&#12531;&#12464;&#12434;&#27005;&#12375;&#12416;&#20154;&#12418;&#22679;&#12360;&#12390;&#12356;&#12427;&#12290; <\/td>\n<td> genba sh&#363;hen wa t&#333;ky&#333; kank&#333; no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru . <\/td>\n<\/tr>\n<\/table>\n<p>Google&#8217;s romanization algorithm does a thoroughly mediocre job compared with what a human transcriber would do. To see this, compare the following:<\/p>\n<table style=\"padding: 1em; font-family: georgia, serif;\">\n<thead>\n<tr>\n<th>Google Translate<\/th>\n<th>human transcriber<\/th>\n<\/tr>\n<\/thead>\n<tr>\n<td style=\"padding-right: .5em; vertical-align: top;\"> roku nichi gogo yon ji san go fun goro , t&#333;ky&#333; to chiyoda ku k&#333;kyogaien no tod&#333; &#65288; uchibori d&#333;ri &#65289; no nij&#363;bashi zen k&#333;saten de , ch&#363;goku kara no kank&#333; kyaku no yon zero dai no dansei ga j&#333;y&#333;sha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shib&#333; shi ta . kuruma wa hod&#333; ni noriage te arui te i ta dansei &#65288; roku ky&#363; &#65289; mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no j&#363;tai . marunouchi sho wa , unten shi te i ta t&#333;ky&#333; to minato ku hakkin san ch&#333;me , kaisha yakuin takahashi nobe tsubuse y&#333;gi sha &#65288; ni yon &#65289; wo jid&#333;sha unten kashitsu sh&#333;gai no utagai de genk&#333; han taiho shi , y&#333;gi wo d&#333; chishi ni kirikae te shirabe te iru . <\/td>\n<td style=\"padding-right: .5em; vertical-align: top;\">Muika gogo yo-ji sanj&#363;go-fun goro, T&#333;ky&#333;-to Chiyoda-ku K&#333;kyo Gaien no tod&#333; (Uchibori d&#333;ri) no Nij&#363;bashi-zen k&#333;saten de, Ch&#363;goku kara no kank&#333;-kyaku no yonj&#363;-dai no dansei ga j&#333;y&#333;sha ni hanerare, zenshin o tsuyoku utte mamonaku shib&#333;-shita. Kuruma wa hod&#333; ni noriagete aruite ita dansei (rokuj&#363;ky&#363;) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no j&#363;tai. Marunouchi-sho wa, unten-shite ita T&#333;ky&#333;-to Minato-ku Shirogane san-ch&#333;me, kaisha yakuin Takahashi Nobuhiro y&#333;gisha (nij&#363;yon) o jid&#333;sha unten kashitsu sh&#333;gai no utagai de genk&#333;han taiho-shi, y&#333;gi o d&#333;-chishi ni kirikaete shirabete iru.<\/td>\n<\/tr>\n<tr>\n<td style=\"padding-right: .5em; vertical-align: top;\">d&#333;sho ni yoru to , shib&#333; shi ta dansei wa &#333;dan hod&#333; wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni ky&#363; handoru wo kiri , shad&#333; to hod&#333; no sakai ni oka re ta kasetsu no saku wo haneage , hod&#333; ni noriage ta to iu . saku wa hod&#333; de ran&#8217;ningu wo shi te i ta dansei &#65288; san yon &#65289; ni atari , dansei wa ry&#333;ashi ni karui kega . <\/td>\n<td style=\"padding-right: .5em; vertical-align: top; text-indent: 1.8em;\">D&#333;-sho ni yoru to, shib&#333;-shita dansei wa &#333;dan hod&#333; o aruite watatte ita tokoro o chokushin-shite kita kuruma ni hanerareta. Kuruma wa hidari ni ky&#363;-handoru o kiri, shad&#333; to hod&#333; no sakai ni okareta kasetsu no saku o haneage, hod&#333; ni noriageta to iu. Saku wa hod&#333; de ranningu o shite ita dansei (sanj&#363;yon) ni atari, dansei wa ry&#333;ashi ni karui kega. <\/td>\n<\/tr>\n<tr>\n<td style=\"padding-right: .5em; vertical-align: top;\"> d&#333;sho wa , shib&#333; shi ta dansei no mimoto kakunin wo susumeru totomoni , t&#333;ji no k&#333;saten no shing&#333; no j&#333;ky&#333; wo shirabe te iru . <\/td>\n<td style=\"padding-right: .5em; vertical-align: top; text-indent: 1.8em;\">D&#333;-sho wa, shib&#333;-shita dansei no mimoto kakunin o susumeru to tomo ni, t&#333;ji no k&#333;saten no shing&#333; no j&#333;ky&#333; o shirabete iru. <\/td>\n<\/tr>\n<tr>\n<td style=\"padding-right: .5em; vertical-align: top;\"> genba sh&#363;hen wa t&#333;ky&#333; kank&#333; no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru . <\/td>\n<td style=\"padding-right: .5em; vertical-align: top; text-indent: 1.8em;\">Genba sh&#363;hen wa T&#333;ky&#333; kank&#333; no supotto no hitotsu da ga, saikin wa jogingu o tanoshimu hito mo fuete iru. <\/td>\n<\/tr>\n<\/table>\n<p>For the sake of comparison, I have retained Google&#8217;s Hepburn-style romanization. The following changes have been made in the text in the righthand column:<\/p>\n<ol style=\"font-family: georgia, serif;\">\n<li>Misread words have been rewritten. Many involve numerals; e.g. <strong>muika<\/strong> for &#8220;roku nichi&#8221;, <strong>yo-ji<\/strong> for &#8220;yon ji&#8221;, <strong>sanj&#363;go-fun<\/strong> for &#8220;san go fun&#8221;. The personal name <strong>Nobuhiro <\/strong>is an educated guess, but &#8220;Nobetsubuse&#8221; is certainly wrong. <strong>Shirogane<\/strong> for &#8220;hakkin&#8221; is a place-name (N.B. Google did not produce *hakukin, indicating that the algorithm does more than just character-by-character <em>on-yomi<\/em>).<\/li>\n<li>False spaces and consequent misreadings have been eliminated. E.g. <strong>hanerare<\/strong> for &#8220;hane rare&#8221;, <strong>wattate ita<\/strong> for &#8220;wata~tsu te i ta&#8221;.<\/li>\n<li>Run-together phrases have been parsed correctly. E.g. <strong>to tomo ni<\/strong> for &#8220;totomoni&#8221;.<\/li>\n<li>Capitalization of proper nouns and the first words in sentences has been introduced.<\/li>\n<li>Hyphens are used conservatively for prefixes and suffixes, and for compound verbs with <em>suru<\/em>. <\/li>\n<li>Obsolete &#8220;wo&#8221; for the particle <strong>o<\/strong> has been eliminated. (N.B. Google did not produce *ha for the particle <strong>wa<\/strong>, so &#8220;wo&#8221; for <strong>o<\/strong> is just the result of laziness.)<\/li>\n<li>Apostrophes after <strong>n<\/strong> to indicate mora nasals in positions where they are not needed have been eliminated.<\/li>\n<li>Punctuation has been normalized to match for romanized format and paragraph indentations have been restored.<\/li>\n<\/ol>\n<p>One could make the romanized text more easily readable by restoring arabic numerals, italicizing <em>gairaigo<\/em>, and so on. Of course, if the reporter knew that his\/her copy would be reported orally or in romanization, s\/he might have chosen different wording to avoid homophonic ambiguities. E.g., <strong>Marunouchi-sho <\/strong>could be <strong>Marunouchi Keisatsu-sho<\/strong>, though perhaps in the context of a traffic accident story, it is obvious that the suffix <strong>sho<\/strong> denotes &#8216;police station&#8217;. Furthermore, in a digraphic Japan, homophones might not be such as great problem. If, for instance, readers were accumstomed to seeing <strong>d&#333;sho<\/strong> for &#21516;&#25152; &#8216;same place&#8217;, then <strong>d&#333;-sho<\/strong> would immediately signal that something different was meant, which, given context, might be entirely sufficient to eliminate misunderstanding.<\/p>\n<p>But having said all that, my guess is that the romanization function of Google Translate was programmed with some care. Rather than criticize the quality Google&#8217;s algorithm, I suggest pursuing the logical consequences of assuming that it deserves about a B+ by current standards.<\/p>\n<p style=\"padding-top: 2em;\"><em>Analysis<\/em><\/p>\n<p>Clearly, there is a vast amount of knowledge an editor needs if s\/he wants to bring Google\u2019s result up to an acceptable level of romanization for human consumption. That minimal level, in turn, is probably a far cry from what a committee of linguists might decide would be an ideal romanization for daily use in 21st-century Japan. It is quite obvious why Google\u2019s algorithm blunders \u2014 the reasons were well understood and described long ago (e.g. in <a href=\"https:\/\/pinyin.info\/readings\/fifth_generation.html\">Unger 1987<\/a>) \u2014 and though the algorithm can be improved, it can never produce perfect results. Computers cannot read minds, and mindreading is ultimately what it would take to produce a flawless romanization.<a href=\"#note1\"><span style=\"position:relative; bottom: .5em;\">1<\/span><\/a>\n<\/p>\n<p>Furthermore, imagine the representation of the words of the text that presumably takes shape in some form or other in the mind of the skilled reader of the original text. Given that Google\u2019s programmers are doing their best to get their computers to identify words and their forms from Japanese textual data, it is clear that readers, who achieve excellent comprehension with little or no conscious effort, must be doing vastly more. The sequence of stages \u2014 from (1) the original text to (2) the Google transcription, (3) the better edited version, (4) some future \u201cideal\u201d romanization scheme, and onward to (5) whatever the brain of the skilled reader ultimately distills and comprehends \u2014 concretely illustrates how, at each stage, different kinds of information \u2014 from the easily programmable to genuine expert knowledge \u2014 must be brought to bear on the raw data. <\/p>\n<p>Of course, something similar can be said of English texts as well: like Chinese characters, orthographic words of English, even though written with letters of the roman alphabet, typically function both logographically and phonographically. The English reader has to do some work too. But how much? Think of the sequence of stages just described in reverse order. The step from the mind of an expert reader (5) to an ideal romanization (4) is short compared with the distance down to the crude level of romanization produced by Google Translate (2). Yet Google does quite a bit relative to the original text (1). It does not <em>totally<\/em> fail, but rather makes mistakes, which, as just demonstrated, a human editor can identify and correct. It manages to find many word boundaries and no doubt could do better if the company\u2019s programmers consulted some linguists and exerted themselves more. The point is that Japanese readers must cover the <em>whole<\/em> distance from the text to genuine comprehension, a distance that must be much greater than that traversed by the practiced reader of English, for all its quaint anachronistic spellings. With a decent, standardized roman orthography, the Japanese reader would have a considerably shorter distance to negotiate.<\/p>\n<p style=\"padding-top: 2em;\"><em>Note<\/em><\/p>\n<ol>\n<li style=\"font-size: smaller; padding: 1em; font-family: georgia, serif;\" id=\"note1\">Indeed, starting in the 1980s, <em>Asahi<\/em> pioneered in the use of an IBM-designed system called NELSON (New Editing and Layout System of Newspapers) that uses large-array keyboards (descriptive input) rather than the sort of <em>kanji henkan<\/em> methods (transcriptive input) common on personal computers and dedicated word-processing systems. Consequently, the expedient of storing the underlying roman or <em>kana<\/em> input stream alongside the selected characters is not available for <em>Asahi<\/em> stories. Of course, such information is routinely thrown away by many other input systems too.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>The following is a guest post by Professor J. Marshall Unger of the Ohio State University&#8217;s Department of East Asian Languages and Literatures. The challenge On 18 November 2009, Mark Swofford posted an item on his website pinyin.info criticizing the &hellip; <a href=\"https:\/\/pinyin.info\/news\/2009\/google-translate-and-romaji\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[113,29,112,13,28,95,126,19,31],"tags":[669,670,673,863,837],"class_list":["post-3077","post","type-post","status-publish","format-standard","hentry","category-computers","category-japanese","category-kana","category-kanji","category-languages","category-linguistics","category-romaji","category-romanization","category-writing-systems","tag-google","tag-google-translate","tag-j-marshall-unger","tag-romaji","tag-romanization"],"_links":{"self":[{"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/posts\/3077","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/comments?post=3077"}],"version-history":[{"count":34,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/posts\/3077\/revisions"}],"predecessor-version":[{"id":6802,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/posts\/3077\/revisions\/6802"}],"wp:attachment":[{"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/media?parent=3077"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/categories?post=3077"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pinyin.info\/news\/wp-json\/wp\/v2\/tags?post=3077"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}