Google Translate and r?maji

The following is a guest post by Professor J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures.

The challenge

On 18 November 2009, Mark Swofford posted an item on his website pinyin.info criticizing the way Google Translate produces Hanyu Pinyin from standard Chinese text. He concluded by saying, “Google Translate will also romanize Japanese texts written in kanji and kana, Russian texts written in Cyrillic, etc. But I’ll leave those to others to analyze.” So I decided to take up Swofford’s challenge as it pertains to Japanese. Using Google Translate, I romanized a news item from the Asahi of 6 December 2009:

Original Google Translate
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? roku nichi gogo yon ji san go fun goro , t?ky? to chiyoda ku k?kyogaien no tod? ? uchibori d?ri ? no nij?bashi zen k?saten de , ch?goku kara no kank? kyaku no yon zero dai no dansei ga j?y?sha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shib? shi ta . kuruma wa hod? ni noriage te arui te i ta dansei ? roku ky? ? mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no j?tai . marunouchi sho wa , unten shi te i ta t?ky? to minato ku hakkin san ch?me , kaisha yakuin takahashi nobe tsubuse y?gi sha ? ni yon ? wo jid?sha unten kashitsu sh?gai no utagai de genk? han taiho shi , y?gi wo d? chishi ni kirikae te shirabe te iru .
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? d?sho ni yoru to , shib? shi ta dansei wa ?dan hod? wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni ky? handoru wo kiri , shad? to hod? no sakai ni oka re ta kasetsu no saku wo haneage , hod? ni noriage ta to iu . saku wa hod? de ran’ningu wo shi te i ta dansei ? san yon ? ni atari , dansei wa ry?ashi ni karui kega .
???????????????????????????????????????????? d?sho wa , shib? shi ta dansei no mimoto kakunin wo susumeru totomoni , t?ji no k?saten no shing? no j?ky? wo shirabe te iru .
????????????????????????????????????????? genba sh?hen wa t?ky? kank? no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru .

Google’s romanization algorithm does a thoroughly mediocre job compared with what a human transcriber would do. To see this, compare the following:

Google Translate human transcriber
roku nichi gogo yon ji san go fun goro , t?ky? to chiyoda ku k?kyogaien no tod? ? uchibori d?ri ? no nij?bashi zen k?saten de , ch?goku kara no kank? kyaku no yon zero dai no dansei ga j?y?sha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shib? shi ta . kuruma wa hod? ni noriage te arui te i ta dansei ? roku ky? ? mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no j?tai . marunouchi sho wa , unten shi te i ta t?ky? to minato ku hakkin san ch?me , kaisha yakuin takahashi nobe tsubuse y?gi sha ? ni yon ? wo jid?sha unten kashitsu sh?gai no utagai de genk? han taiho shi , y?gi wo d? chishi ni kirikae te shirabe te iru . Muika gogo yo-ji sanj?go-fun goro, T?ky?-to Chiyoda-ku K?kyo Gaien no tod? (Uchibori d?ri) no Nij?bashi-zen k?saten de, Ch?goku kara no kank?-kyaku no yonj?-dai no dansei ga j?y?sha ni hanerare, zenshin o tsuyoku utte mamonaku shib?-shita. Kuruma wa hod? ni noriagete aruite ita dansei (rokuj?ky?) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no j?tai. Marunouchi-sho wa, unten-shite ita T?ky?-to Minato-ku Shirogane san-ch?me, kaisha yakuin Takahashi Nobuhiro y?gisha (nij?yon) o jid?sha unten kashitsu sh?gai no utagai de genk?han taiho-shi, y?gi o d?-chishi ni kirikaete shirabete iru.
d?sho ni yoru to , shib? shi ta dansei wa ?dan hod? wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni ky? handoru wo kiri , shad? to hod? no sakai ni oka re ta kasetsu no saku wo haneage , hod? ni noriage ta to iu . saku wa hod? de ran’ningu wo shi te i ta dansei ? san yon ? ni atari , dansei wa ry?ashi ni karui kega . D?-sho ni yoru to, shib?-shita dansei wa ?dan hod? o aruite watatte ita tokoro o chokushin-shite kita kuruma ni hanerareta. Kuruma wa hidari ni ky?-handoru o kiri, shad? to hod? no sakai ni okareta kasetsu no saku o haneage, hod? ni noriageta to iu. Saku wa hod? de ranningu o shite ita dansei (sanj?yon) ni atari, dansei wa ry?ashi ni karui kega.
d?sho wa , shib? shi ta dansei no mimoto kakunin wo susumeru totomoni , t?ji no k?saten no shing? no j?ky? wo shirabe te iru . D?-sho wa, shib?-shita dansei no mimoto kakunin o susumeru to tomo ni, t?ji no k?saten no shing? no j?ky? o shirabete iru.
genba sh?hen wa t?ky? kank? no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru . Genba sh?hen wa T?ky? kank? no supotto no hitotsu da ga, saikin wa jogingu o tanoshimu hito mo fuete iru.

For the sake of comparison, I have retained Google’s Hepburn-style romanization. The following changes have been made in the text in the righthand column:

  1. Misread words have been rewritten. Many involve numerals; e.g. muika for “roku nichi”, yo-ji for “yon ji”, sanj?go-fun for “san go fun”. The personal name Nobuhiro is an educated guess, but “Nobetsubuse” is certainly wrong. Shirogane for “hakkin” is a place-name (N.B. Google did not produce *hakukin, indicating that the algorithm does more than just character-by-character on-yomi).
  2. False spaces and consequent misreadings have been eliminated. E.g. hanerare for “hane rare”, wattate ita for “wata~tsu te i ta”.
  3. Run-together phrases have been parsed correctly. E.g. to tomo ni for “totomoni”.
  4. Capitalization of proper nouns and the first words in sentences has been introduced.
  5. Hyphens are used conservatively for prefixes and suffixes, and for compound verbs with suru.
  6. Obsolete “wo” for the particle o has been eliminated. (N.B. Google did not produce *ha for the particle wa, so “wo” for o is just the result of laziness.)
  7. Apostrophes after n to indicate mora nasals in positions where they are not needed have been eliminated.
  8. Punctuation has been normalized to match for romanized format and paragraph indentations have been restored.

One could make the romanized text more easily readable by restoring arabic numerals, italicizing gairaigo, and so on. Of course, if the reporter knew that his/her copy would be reported orally or in romanization, s/he might have chosen different wording to avoid homophonic ambiguities. E.g., Marunouchi-sho could be Marunouchi Keisatsu-sho, though perhaps in the context of a traffic accident story, it is obvious that the suffix sho denotes ‘police station’. Furthermore, in a digraphic Japan, homophones might not be such as great problem. If, for instance, readers were accumstomed to seeing d?sho for ?? ‘same place’, then d?-sho would immediately signal that something different was meant, which, given context, might be entirely sufficient to eliminate misunderstanding.

But having said all that, my guess is that the romanization function of Google Translate was programmed with some care. Rather than criticize the quality Google’s algorithm, I suggest pursuing the logical consequences of assuming that it deserves about a B+ by current standards.

Analysis

Clearly, there is a vast amount of knowledge an editor needs if s/he wants to bring Google’s result up to an acceptable level of romanization for human consumption. That minimal level, in turn, is probably a far cry from what a committee of linguists might decide would be an ideal romanization for daily use in 21st-century Japan. It is quite obvious why Google’s algorithm blunders — the reasons were well understood and described long ago (e.g. in Unger 1987) — and though the algorithm can be improved, it can never produce perfect results. Computers cannot read minds, and mindreading is ultimately what it would take to produce a flawless romanization.1

Furthermore, imagine the representation of the words of the text that presumably takes shape in some form or other in the mind of the skilled reader of the original text. Given that Google’s programmers are doing their best to get their computers to identify words and their forms from Japanese textual data, it is clear that readers, who achieve excellent comprehension with little or no conscious effort, must be doing vastly more. The sequence of stages — from (1) the original text to (2) the Google transcription, (3) the better edited version, (4) some future “ideal” romanization scheme, and onward to (5) whatever the brain of the skilled reader ultimately distills and comprehends — concretely illustrates how, at each stage, different kinds of information — from the easily programmable to genuine expert knowledge — must be brought to bear on the raw data.

Of course, something similar can be said of English texts as well: like Chinese characters, orthographic words of English, even though written with letters of the roman alphabet, typically function both logographically and phonographically. The English reader has to do some work too. But how much? Think of the sequence of stages just described in reverse order. The step from the mind of an expert reader (5) to an ideal romanization (4) is short compared with the distance down to the crude level of romanization produced by Google Translate (2). Yet Google does quite a bit relative to the original text (1). It does not totally fail, but rather makes mistakes, which, as just demonstrated, a human editor can identify and correct. It manages to find many word boundaries and no doubt could do better if the company’s programmers consulted some linguists and exerted themselves more. The point is that Japanese readers must cover the whole distance from the text to genuine comprehension, a distance that must be much greater than that traversed by the practiced reader of English, for all its quaint anachronistic spellings. With a decent, standardized roman orthography, the Japanese reader would have a considerably shorter distance to negotiate.

Note

  1. Indeed, starting in the 1980s, Asahi pioneered in the use of an IBM-designed system called NELSON (New Editing and Layout System of Newspapers) that uses large-array keyboards (descriptive input) rather than the sort of kanji henkan methods (transcriptive input) common on personal computers and dedicated word-processing systems. Consequently, the expedient of storing the underlying roman or kana input stream alongside the selected characters is not available for Asahi stories. Of course, such information is routinely thrown away by many other input systems too.

7 thoughts on “Google Translate and r?maji

  1. I agree about Japanese. There are other automated tools for romanising Japanese, and they do it with only about 95% accuracy. Choices about pronunciation are much more complicated than they are in Chinese.

    I can attest that Russian and other Cyrillic languages are romanised very well. Google used one of the methods, since there are many standards but pretty consistent. The task is straightforward.

    Thai is romanised extremely poorly. Admittedly, the rules are complicated. But Google even swaps around syllables and words! (Some vowels can be written on the left of the consonant, like in some other languages.)

    Hindi and Korean are acceptable for those who the rules of pronunciation: it shows the letters spelled, not taking into account consonant changes (Korean) and silent vowels (Hindi).

  2. I was not aware that romanizing Japanese was such a difficult task. I wonder, do any good screen-readers exist for the Japanese language, for the use of blind people? And if yes, how well do they do?

  3. Is there any JIS standard or books describe how to transliterate Japanese into Latin? In particular, the use of “-“, SPACE (” “) or NO SPACE between words.

  4. The latest version of Google Translate shows

    6-Nichi gogo 4-ji 35-fun-goro, T?ky?-to Chiyoda-ku K?kyogaien no tod? (uchibori-d?ri) no Nij?bashi zen k?saten de, Ch?goku kara no kank? kyaku no 40-dai no dansei ga j?y?sha ni hane rare, zenshin o tsuyoku utte mamonaku shib? shi ta.-Sha wa hod? ni noriagete aruite i ta dansei (69) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no j?tai. Marunouchi-sho wa, unten shite i ta T?ky?-to Minato-ku hakkin 3-ch?me, kaisha yakuin Takahashi nobe Tsubuse y?gi-sha (24) o jid?sha unten kashitsu sh?gai no utagai de genk?-han taiho shi, y?gi o d? chishi ni kirikaete shirabete iru.

    D?sho ni yoru to, shib? shi ta dansei wa ?dan hod? o aruite watatte i ta tokoro o chokushin shite ki ta kuruma ni hane rare ta.-Sha wa hidari ni ky? handoru o kiri, shad? to hod? no sakai ni oka re ta kasetsu no saku o haneage, hod? ni noriage ta toyuu. Saku wa hod? de ran’ningu o shite i ta dansei (34) niatari, dansei wa ry?ashi ni karui kega.

    D?sho wa, shib? shi ta dansei no mimoto kakunin o susumeru totomoni, t?ji no k?saten no shing? no j?ky? o shirabete iru.

    Genba sh?hen wa T?ky? kank? no supotto no hitotsu daga, saikin wa jogingu o tanoshimu hito mo fuete iru.

  5. Japanese government officially adopts Kunrei, although its usage is voluntary. The de facto standard used in almost all the cases is Hepburn. Neither systems have few to no written rules for hyphens or spaces. *1 Unlike in the case of pinyin, romaji was not seriously considered for substituting the current writing system, so it didn’t get intricate rules.

    Unofficial groups adovacating for discarding kanji/kana, however, do provide their intricate rules. *2 They were active back in Meiji period, and right after WW2.

    There is no JIS romaji, while there is ISO romaji, which is identical to Kunrei.

    *1 http://www.bunka.go.jp/kokugo/main.asp?fl=list&id=1000003935&clc=1000000068
    *2 http://xembho.s59.xrea.com/rb/index.html

    Screen readers in Japan exist, and they don’t do much better than Google translation.

  6. Professor Unger has performed a real service with his detailed, exacting analysis of a passage romanized by Google Translate.

  7. Pingback: Pinyin news » Google Translate and romaji revisited

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>