Google Translate and rōmaji

The following is a guest post by Professor J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures.

The challenge

On 18 November 2009, Mark Swofford posted an item on his website pinyin.info criticizing the way Google Translate produces Hanyu Pinyin from standard Chinese text. He concluded by saying, “Google Translate will also romanize Japanese texts written in kanji and kana, Russian texts written in Cyrillic, etc. But I’ll leave those to others to analyze.” So I decided to take up Swofford’s challenge as it pertains to Japanese. Using Google Translate, I romanized a news item from the Asahi of 6 December 2009:

Original Google Translate
6日午後4時35分ごろ、東京都千代田区皇居外苑の都道(内堀通り)の二重橋前交差点で、中国からの観光客の40代の男性が乗用車にはねられ、全身を強く打って間もなく死亡した。車は歩道に乗り上げて歩いていた男性(69)もはね、男性は頭を強く打って意識不明の重体。丸の内署は、運転していた東京都港区白金3丁目、会社役員高橋延拓容疑者(24)を自動車運転過失傷害の疑いで現行犯逮捕し、容疑を同致死に切り替えて調べている。 roku nichi gogo yon ji san go fun goro , tōkyō to chiyoda ku kōkyogaien no todō ( uchibori dōri ) no nijūbashi zen kōsaten de , chūgoku kara no kankō kyaku no yon zero dai no dansei ga jōyōsha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shibō shi ta . kuruma wa hodō ni noriage te arui te i ta dansei ( roku kyū ) mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no jūtai . marunouchi sho wa , unten shi te i ta tōkyō to minato ku hakkin san chōme , kaisha yakuin takahashi nobe tsubuse yōgi sha ( ni yon ) wo jidōsha unten kashitsu shōgai no utagai de genkō han taiho shi , yōgi wo dō chishi ni kirikae te shirabe te iru .
 同署によると、死亡した男性は横断歩道を歩いて渡っていたところを直進してきた車にはねられた。車は左に急ハンドルを切り、車道と歩道の境に置かれた仮設のさくをはね上げ、歩道に乗り上げたという。さくは歩道でランニングをしていた男性(34)に当たり、男性は両足に軽いけが。 dōsho ni yoru to , shibō shi ta dansei wa ōdan hodō wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni kyū handoru wo kiri , shadō to hodō no sakai ni oka re ta kasetsu no saku wo haneage , hodō ni noriage ta to iu . saku wa hodō de ran’ningu wo shi te i ta dansei ( san yon ) ni atari , dansei wa ryōashi ni karui kega .
 同署は、死亡した男性の身元確認を進めるとともに、当時の交差点の信号の状況を調べている。 dōsho wa , shibō shi ta dansei no mimoto kakunin wo susumeru totomoni , tōji no kōsaten no shingō no jōkyō wo shirabe te iru .
 現場周辺は東京観光のスポットの一つだが、最近はジョギングを楽しむ人も増えている。 genba shūhen wa tōkyō kankō no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru .

Google’s romanization algorithm does a thoroughly mediocre job compared with what a human transcriber would do. To see this, compare the following:

Google Translate human transcriber
roku nichi gogo yon ji san go fun goro , tōkyō to chiyoda ku kōkyogaien no todō ( uchibori dōri ) no nijūbashi zen kōsaten de , chūgoku kara no kankō kyaku no yon zero dai no dansei ga jōyōsha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shibō shi ta . kuruma wa hodō ni noriage te arui te i ta dansei ( roku kyū ) mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no jūtai . marunouchi sho wa , unten shi te i ta tōkyō to minato ku hakkin san chōme , kaisha yakuin takahashi nobe tsubuse yōgi sha ( ni yon ) wo jidōsha unten kashitsu shōgai no utagai de genkō han taiho shi , yōgi wo dō chishi ni kirikae te shirabe te iru . Muika gogo yo-ji sanjūgo-fun goro, Tōkyō-to Chiyoda-ku Kōkyo Gaien no todō (Uchibori dōri) no Nijūbashi-zen kōsaten de, Chūgoku kara no kankō-kyaku no yonjū-dai no dansei ga jōyōsha ni hanerare, zenshin o tsuyoku utte mamonaku shibō-shita. Kuruma wa hodō ni noriagete aruite ita dansei (rokujūkyū) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no jūtai. Marunouchi-sho wa, unten-shite ita Tōkyō-to Minato-ku Shirogane san-chōme, kaisha yakuin Takahashi Nobuhiro yōgisha (nijūyon) o jidōsha unten kashitsu shōgai no utagai de genkōhan taiho-shi, yōgi o dō-chishi ni kirikaete shirabete iru.
dōsho ni yoru to , shibō shi ta dansei wa ōdan hodō wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni kyū handoru wo kiri , shadō to hodō no sakai ni oka re ta kasetsu no saku wo haneage , hodō ni noriage ta to iu . saku wa hodō de ran’ningu wo shi te i ta dansei ( san yon ) ni atari , dansei wa ryōashi ni karui kega . Dō-sho ni yoru to, shibō-shita dansei wa ōdan hodō o aruite watatte ita tokoro o chokushin-shite kita kuruma ni hanerareta. Kuruma wa hidari ni kyū-handoru o kiri, shadō to hodō no sakai ni okareta kasetsu no saku o haneage, hodō ni noriageta to iu. Saku wa hodō de ranningu o shite ita dansei (sanjūyon) ni atari, dansei wa ryōashi ni karui kega.
dōsho wa , shibō shi ta dansei no mimoto kakunin wo susumeru totomoni , tōji no kōsaten no shingō no jōkyō wo shirabe te iru . Dō-sho wa, shibō-shita dansei no mimoto kakunin o susumeru to tomo ni, tōji no kōsaten no shingō no jōkyō o shirabete iru.
genba shūhen wa tōkyō kankō no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru . Genba shūhen wa Tōkyō kankō no supotto no hitotsu da ga, saikin wa jogingu o tanoshimu hito mo fuete iru.

For the sake of comparison, I have retained Google’s Hepburn-style romanization. The following changes have been made in the text in the righthand column:

  1. Misread words have been rewritten. Many involve numerals; e.g. muika for “roku nichi”, yo-ji for “yon ji”, sanjūgo-fun for “san go fun”. The personal name Nobuhiro is an educated guess, but “Nobetsubuse” is certainly wrong. Shirogane for “hakkin” is a place-name (N.B. Google did not produce *hakukin, indicating that the algorithm does more than just character-by-character on-yomi).
  2. False spaces and consequent misreadings have been eliminated. E.g. hanerare for “hane rare”, wattate ita for “wata~tsu te i ta”.
  3. Run-together phrases have been parsed correctly. E.g. to tomo ni for “totomoni”.
  4. Capitalization of proper nouns and the first words in sentences has been introduced.
  5. Hyphens are used conservatively for prefixes and suffixes, and for compound verbs with suru.
  6. Obsolete “wo” for the particle o has been eliminated. (N.B. Google did not produce *ha for the particle wa, so “wo” for o is just the result of laziness.)
  7. Apostrophes after n to indicate mora nasals in positions where they are not needed have been eliminated.
  8. Punctuation has been normalized to match for romanized format and paragraph indentations have been restored.

One could make the romanized text more easily readable by restoring arabic numerals, italicizing gairaigo, and so on. Of course, if the reporter knew that his/her copy would be reported orally or in romanization, s/he might have chosen different wording to avoid homophonic ambiguities. E.g., Marunouchi-sho could be Marunouchi Keisatsu-sho, though perhaps in the context of a traffic accident story, it is obvious that the suffix sho denotes ‘police station’. Furthermore, in a digraphic Japan, homophones might not be such as great problem. If, for instance, readers were accumstomed to seeing dōsho for 同所 ‘same place’, then dō-sho would immediately signal that something different was meant, which, given context, might be entirely sufficient to eliminate misunderstanding.

But having said all that, my guess is that the romanization function of Google Translate was programmed with some care. Rather than criticize the quality Google’s algorithm, I suggest pursuing the logical consequences of assuming that it deserves about a B+ by current standards.

Analysis

Clearly, there is a vast amount of knowledge an editor needs if s/he wants to bring Google’s result up to an acceptable level of romanization for human consumption. That minimal level, in turn, is probably a far cry from what a committee of linguists might decide would be an ideal romanization for daily use in 21st-century Japan. It is quite obvious why Google’s algorithm blunders — the reasons were well understood and described long ago (e.g. in Unger 1987) — and though the algorithm can be improved, it can never produce perfect results. Computers cannot read minds, and mindreading is ultimately what it would take to produce a flawless romanization.1

Furthermore, imagine the representation of the words of the text that presumably takes shape in some form or other in the mind of the skilled reader of the original text. Given that Google’s programmers are doing their best to get their computers to identify words and their forms from Japanese textual data, it is clear that readers, who achieve excellent comprehension with little or no conscious effort, must be doing vastly more. The sequence of stages — from (1) the original text to (2) the Google transcription, (3) the better edited version, (4) some future “ideal” romanization scheme, and onward to (5) whatever the brain of the skilled reader ultimately distills and comprehends — concretely illustrates how, at each stage, different kinds of information — from the easily programmable to genuine expert knowledge — must be brought to bear on the raw data.

Of course, something similar can be said of English texts as well: like Chinese characters, orthographic words of English, even though written with letters of the roman alphabet, typically function both logographically and phonographically. The English reader has to do some work too. But how much? Think of the sequence of stages just described in reverse order. The step from the mind of an expert reader (5) to an ideal romanization (4) is short compared with the distance down to the crude level of romanization produced by Google Translate (2). Yet Google does quite a bit relative to the original text (1). It does not totally fail, but rather makes mistakes, which, as just demonstrated, a human editor can identify and correct. It manages to find many word boundaries and no doubt could do better if the company’s programmers consulted some linguists and exerted themselves more. The point is that Japanese readers must cover the whole distance from the text to genuine comprehension, a distance that must be much greater than that traversed by the practiced reader of English, for all its quaint anachronistic spellings. With a decent, standardized roman orthography, the Japanese reader would have a considerably shorter distance to negotiate.

Note

  1. Indeed, starting in the 1980s, Asahi pioneered in the use of an IBM-designed system called NELSON (New Editing and Layout System of Newspapers) that uses large-array keyboards (descriptive input) rather than the sort of kanji henkan methods (transcriptive input) common on personal computers and dedicated word-processing systems. Consequently, the expedient of storing the underlying roman or kana input stream alongside the selected characters is not available for Asahi stories. Of course, such information is routinely thrown away by many other input systems too.

China’s earliest romanization system

The most recent rerelease from Sino-Platonic Papers is Dì-yī ge Lādīng zìmǔ de Hànyǔ Pīnyīn Fāng’àn shì zěnyàng chǎnshēng de? (How Was the First Romanized Spelling System for Sinitic Produced? / 第一个拉丁字母的汉语拼音方案是怎样产生的), by YIN Binyong (尹斌庸).

The author should be familiar to regular readers of this site, as he wrote the standard works on Hanyu Pinyin orthography — Chinese Romanization: Pronunciation and Orthography and the Xinhua Pinxie Cidian — as well as Pinyin-to-Chinese Character Computer Conversion Systems and the Realization of Digraphia in China.

The text is in Mandarin in Chinese characters. Here is the introduction.
image of the Mandarin text (in Chinese characters) of the first two paragraphs of the article

This is issue no. 50 of Sino-Platonic Papers. It was first published in November 1994.

Taipei to stick with Hanyu Pinyin, despite pressure from central gov’t: mayor

Taipei Mayor Hau Lung-bin (Hǎo Lóngbīn / 郝龍斌) said on Sunday that Taipei will not switch from Hanyu Pinyin to Tongyong Pinyin, despite pressure from the Ministry of the Interior to do so.

Questioned by reporters at the wedding of Taiwan’s top “Go” player, Hau stressed that the Taipei City Government would continue to use Hanyu Pinyin despite the Interior Ministry’s push as it’s the most commonly used pinyin system in the international community.

“Taipei City has decided to continue using Hanyu Pinyin to connect with other countries in the world,” Hau said.

He suggested that the Interior Ministry consult with linguistic scholars and learn to respect their expertise when standardizing the romanization of Taiwan’s place and street names.

Yes, the MOI would do well to follow this advice — as would the Taipei City Government itself. Taipei’s stupid @#$%! InTerCaPiTaLiZaTion and lack of apostrophes are significant errors. And sometimes the lack of tone marks is a problem. And don’t get me started about Taipei’s “nicknumbering” system.

Taipei City is the only city in Taiwan that has adopted Hanyu Pinyin.

This is incorrect. Several cities around Taiwan use Hanyu Pinyin, such as Xinzhu and Taizhong, though none as consistently as Taipei.

TVBS is reporting that Taipei will be forced to switch, which I very much doubt will happen — certainly not before the presidential election in March 2008.

Nèizhèngbù de xíngzhèng mìnglìng yīdàn bānbù, bùyòng sòngjiāo Lìyuàn tóngyì, Táiběi shìzhèngfǔ zhǐyǒu zhàobàn de fèn, 5 nián qián, Mǎ Yīngjiǔ qiángshì zhǔdǎo Hànyǔ Pīnyīn, ràng Táiběi Shì chéngwéi tā yǎnzhōng, gēn guójì jiēguǐ de dūshì, 5 nián hòu, Nèizhèngbù dìngdìng fǎlìng qiǎngpò zhíxíng, gěi le yī jì huímǎqiāng.

TVBS also gives the cost for changing Taipei’s signs at NT$8 million (US$250,000).

The TVBS video gives lots of pictures of signs.

sources:

MOI and Tongyong Pinyin: update

I have spent many hours over the past few days trying to find out exactly what is behind the recent news story about the Ministry of Education and moves to expand Tongyong Pinyin by the end of the year.

I have sent out no fewer than five e-mail messages to various government officials but have received no responses. I have also made more than a dozen phone calls to various ministries and government-information lines. But nobody I spoke with knows what is going on. My wife helped by making some calls on her own. She was eventually able to get through to someone at the Ministry of the Interior who does have a clue about all this.

Here is basically what is happening.

On October 30, Taiwan’s Ministry of the Interior promulgated the government’s guidelines for writing place names (including not just town names but physical features, such as rivers, mountains, temples, bridges, etc.) in English and romanization: Yùgào dìngdìng “biāozhǔn dìmíng yì xiě zhǔnzé” (預告訂定「標準地名譯寫準則」) (MS Word document).

Most of the pages of this document are simply a list of townships and districts throughout Taiwan, as given in Tongyong Pinyin. But it also contains a few pages of general guidelines. Local governments and interested individuals (yes, that could include you, o reader) who wish to comment on these guidelines may do so before the deadline of Thursday, November 8. The question of Tongyong Pinyin vs. Hanyu Pinyin, however, is supposedly off the table, as the Ministry of the Interior must follow the administration in this — though I encourage anyone who writes the ministry to bring up the issue anyway. I will post contact information as soon as I get it.

To return to the matter of the promulgated document, these are the guidelines that Taiwan’s local governments are ordered to use, with local governments’ offices of land administration compiling lists of place names to be standardized within their jurisdictions and submitting these lists to the MOI’s Department of Land Administration (dìzhèng sī fāngyù kē / 地政司方域科).

If local governments reject Tongyong Pinyin and use a different romanization system, the MOI does not have the authority to compel them to switch to Tongyong. But the central government can and and almost certainly will exert pressure on them to toe the line.

Making matters worse for advocates of Hanyu Pinyin, the international standard romanization system for Mandarin, is the fact that many local officials — even in “blue” regions — do not believe they have autonomy in this matter, as I know from having spoken with several of them about precisely this topic. Nor, unsurprisingly, do they take the word of a foreigner over what they “know” to be “correct”: that they must use Tongyong whether they like it or not. As an example, the city of Jilong (”Keelung”), which is controlled by the anti-Tongyong “blues,” instituted a plan to standardize street names there with Tongyong Pinyin. Nor will most officials bother to look up the rule they are supposedly following — and which, BTW, I can’t show them because it doesn’t exist.

The recently promulgated proposal has extremely limited guidelines. These are most certainly inferior to the fuller guidelines for Hanyu Pinyin — to say nothing of the book-length supplementary guidelines for Hanyu Pinyin (Chinese Romanization: Pronunciation and Orthography and the Xinhua Pinxie Cidian) and carefully produced dictionaries in Hanyu Pinyin.

Probably the best thing I could say about the guidelines is a negative: At least they didn’t adopt Taipei’s StuPid, StuPid PolICy Of InTerCapITaLiZaTion.

The problem that is likely to affect more names than other deficiencies — other than the fundamental matter of Tongyong Pinyin, that is — is the recommended use of the hyphen. Basically, the guidelines call for a hyphen where Hanyu Pinyin would use an apostrophe: before any syllable that begins with a, e, or o, unless that syllable comes at the beginning of a word or immediately follows a hyphen or other dash.

The reason that is a big problem, beyond the failure to follow the standard of Hanyu Pinyin, is that hyphens cannot then be put to the good use they have in Hanyu Pinyin. Hyphens are often needed in signage because they are used in short forms of proper nouns, for example the correct short form of Taiwan Daxue (National Taiwan University) is “Tai-Da.”

Hyphens can thus help clarify names a great deal becuase they often indicate an abbreviation. Mandarin’s tendency toward Consider bridge names, in which the hyphen helps indicates the reason for the name:

  • not Huazhong but Hua-Zhong (for [Wan]hua to Zhong[he])
  • not Huajiang but Hua-Jiang (for [Wan]hua to Jiang[zicui])

Or the case given in the guidelines of 嘉南大圳. The recommendation there is for “Jianan dazun.” But giving “Jia-Nan” instead of “Jianan” would help clarify that this is something in Jiayi and Tainan counties.

The government guidelines’ failure to employ the hyphen in the same manner as Hanyu Pinyin is a major deficiency.

Taiwan should have Tongyong Pinyin’s orthography follow the well-established guidelines for Hanyu Pinyin. But the administration’s petty difference-for-the-sake-of-difference policy will likely rule out that course.

more on Taiwan’s new Tongyong move

This morning all three of Taiwan’s English-language newspapers ran the AP story on the Ministry of the Interior’s plan to expand the use of Tongyong Pinyin. (Bonus points to the copy editor at the Taipei Times who changed the original article’s sloppy “Taiwan will standardize the English transliterations of its Chinese Mandarin place names by the end of the year” to “The Romanization of Mandarin place names will be standardized by the end of this year.”)

I have made a few calls about this, but to little effect so far. Unfortunately, I haven’t had the time today to track down someone at the Ministry of the Interior who can give some definitive information about this.

Meanwhile, here’s another article. It gives a little more information: no intercapping (good), hyphens instead of apostrophes (bad), some screwed-up word parsing (bad).

But all of this sounds like old news. How this will be any different in implementation is still unclear.

Wàijí rénshì lái Táiwān gōngzuò huò lǚyóu, zǒng bèi Táiwān de dìmíng yì xiě gǎo de “wù shàsha,” jiéjú cháng yǐ mílù shōuchǎng. Nèizhèngbù 30 rì gōng bù “biāozhǔn dìmíng yì xiě zhǔnzé” cǎo’àn, míng dìng dìmíng yì xiě yǐ “yīnyì” wèi yuánzé, bìng cǎi “Tōngyòng Pīnyīn” wèi jīzhǔn, ruò dìmíng yǒu lìshǐ, yǔyán, guójì guànyòng, shùzì děng tèxìng, zé yǐ dìmíng xìngzhì fānyì, rú Rìyuè Tán yì wéi “Sun Moon Lake;” 306 gāodì yì wéi “Highland 306.”

Gāi cǎo’àn shì yījù “guótǔ cèhuì fǎ” dìngdìng, bìng nàrù Jiàoyùbù zhìdìng de “Zhōngwén yìyīn shǐyòng yuánzé” zuòwéi yì xiě biāozhǔn, dìmíng yì xiě fāngshì yóu dìmíng zhǔguǎn jīguān zìxíng juédìng.

Cǎo’àn zhǐchū, wèi bìmiǎn yì xiě zhě duì wényì rènzhī bùtóng, chǎnshēng yì xiě chāyì, tǒngyī xíngzhèng qūyù de biāozhǔn yì xiě fāngshì, shěng “Province,” shì “City,” xiàn “County,” xiāng-zhèn “Township,” qū “District,” cūnli “Village.” Jiēdào míngchēng yě tǒngyī yì xiě, dàdào “Boulevard,” lù “Road,” jiē “Street,” xiàng “lane,” nòng “Alley.” Lìrú Kǎidágélán Dàdào wéi “Kaidagelan Boulevard.”

Cǎo’àn míng dìng, biāozhǔn dìmíng de yì xiě cǎi tōngyòng pīnyīn, dàn dìmíng hányǒu “shǔxìng míngchēng” shí, yǐ shǔxìng míngchēng yìyì fāngshì yì xiě, rú Dōng Fēng zhíyì wéi “East Peak.”

Ruò shǔxìng míngchēng yǔ biāozhǔn dìmíng zhěngtǐ shìwéi yī ge zhuānyǒu míngchēng shí, bù lìng yǐ yìyì fāngshì fēnkāi yì xiě, rú “Jiā-Nán dà zùn [zhèn?]” yì wéi “Jianan dazun;” Yángmíng Shān yì wéi “Yangmingshan;” Zhúzi Hú yì wéi “Jhuzihhu.”

Lìngwài dìmíng yǒu dāngdì lìshǐ, yǔyán, fēngsúxíguàn, zōngjiào xìnyǎng, guójì guànyòng huò qítā tèshū yuányīn, jīng zhǔguǎn jīguān bào zhōngyāng zhǔguǎn jīguān hédìng hòu, bù shòu “shǔxìng míngchēng” xiànzhì, rú Yù Shān zhíyì wéi Jade Mountain; zhōngyāng shānmài yì wéi “Central Mountains.”

Cǎo’àn guīdìng, biāozhǔn dìmíng yì xiě shūxiě fāngshì, dì-yī ge zìmǔ dàxiě, qíyú zìmǔ xiǎoxiě, rú bǎnqiáo yì wéi “Banciao,” ér fēi “Ban Ciao” huò “Ban-ciao.” Dàn dìmíng de dì-yī ge zì yǐhòu de pīnyīn zìmǔ, chūxiàn a, o, e shí, yǔ qián dānzì jiān yǐ duǎnxiàn liánjiē, rú Qīlǐ’àn yì wéi “Cili-an,” Rén’ài Xiāng wéi “Ren-ai Township.”

Cǐwài, cǎo’àn yě tǒngyī zìrán dìlǐ shítǐ shǔxìng míngchēng, rú píngyuán, péndì, dǎoyǔ, qúndǎo, liè yǔ, jiāo, tān, shāzhōu, jiǎjiǎo, shān, shānmài, fēng, hé xī, hú, tán děng shíwǔ zhǒng yì xiě fāngshì. Lìrú, Dōngshā Qúndǎo yì wéi “Dongsha Islands;” Diàoyútái liè yǔ “Diaoyutai Archipelago;” Běiwèi Tān “Beiwei Bank;” “Ālǐ Shān shānmài” yì wéi “Alishan Mountains;” zhǔfēng yì wéi “Main Peak;” Shānhútán zhíyì wéi “Shanhu Pond.”

source: Yīngyì yǒu “zhǔn” — lǎowài zhǎo lù bùzài wù shàsha (英譯有「準」 老外找路不再霧煞煞), China Times, October 31, 2007