Google Translate and r?maji

The following is a guest post by Professor J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures.

The challenge

On 18 November 2009, Mark Swofford posted an item on his website pinyin.info criticizing the way Google Translate produces Hanyu Pinyin from standard Chinese text. He concluded by saying, “Google Translate will also romanize Japanese texts written in kanji and kana, Russian texts written in Cyrillic, etc. But I’ll leave those to others to analyze.” So I decided to take up Swofford’s challenge as it pertains to Japanese. Using Google Translate, I romanized a news item from the Asahi of 6 December 2009:

Original Google Translate
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? roku nichi gogo yon ji san go fun goro , t?ky? to chiyoda ku k?kyogaien no tod? ? uchibori d?ri ? no nij?bashi zen k?saten de , ch?goku kara no kank? kyaku no yon zero dai no dansei ga j?y?sha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shib? shi ta . kuruma wa hod? ni noriage te arui te i ta dansei ? roku ky? ? mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no j?tai . marunouchi sho wa , unten shi te i ta t?ky? to minato ku hakkin san ch?me , kaisha yakuin takahashi nobe tsubuse y?gi sha ? ni yon ? wo jid?sha unten kashitsu sh?gai no utagai de genk? han taiho shi , y?gi wo d? chishi ni kirikae te shirabe te iru .
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? d?sho ni yoru to , shib? shi ta dansei wa ?dan hod? wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni ky? handoru wo kiri , shad? to hod? no sakai ni oka re ta kasetsu no saku wo haneage , hod? ni noriage ta to iu . saku wa hod? de ran’ningu wo shi te i ta dansei ? san yon ? ni atari , dansei wa ry?ashi ni karui kega .
???????????????????????????????????????????? d?sho wa , shib? shi ta dansei no mimoto kakunin wo susumeru totomoni , t?ji no k?saten no shing? no j?ky? wo shirabe te iru .
????????????????????????????????????????? genba sh?hen wa t?ky? kank? no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru .

Google’s romanization algorithm does a thoroughly mediocre job compared with what a human transcriber would do. To see this, compare the following:

Google Translate human transcriber
roku nichi gogo yon ji san go fun goro , t?ky? to chiyoda ku k?kyogaien no tod? ? uchibori d?ri ? no nij?bashi zen k?saten de , ch?goku kara no kank? kyaku no yon zero dai no dansei ga j?y?sha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shib? shi ta . kuruma wa hod? ni noriage te arui te i ta dansei ? roku ky? ? mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no j?tai . marunouchi sho wa , unten shi te i ta t?ky? to minato ku hakkin san ch?me , kaisha yakuin takahashi nobe tsubuse y?gi sha ? ni yon ? wo jid?sha unten kashitsu sh?gai no utagai de genk? han taiho shi , y?gi wo d? chishi ni kirikae te shirabe te iru . Muika gogo yo-ji sanj?go-fun goro, T?ky?-to Chiyoda-ku K?kyo Gaien no tod? (Uchibori d?ri) no Nij?bashi-zen k?saten de, Ch?goku kara no kank?-kyaku no yonj?-dai no dansei ga j?y?sha ni hanerare, zenshin o tsuyoku utte mamonaku shib?-shita. Kuruma wa hod? ni noriagete aruite ita dansei (rokuj?ky?) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no j?tai. Marunouchi-sho wa, unten-shite ita T?ky?-to Minato-ku Shirogane san-ch?me, kaisha yakuin Takahashi Nobuhiro y?gisha (nij?yon) o jid?sha unten kashitsu sh?gai no utagai de genk?han taiho-shi, y?gi o d?-chishi ni kirikaete shirabete iru.
d?sho ni yoru to , shib? shi ta dansei wa ?dan hod? wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni ky? handoru wo kiri , shad? to hod? no sakai ni oka re ta kasetsu no saku wo haneage , hod? ni noriage ta to iu . saku wa hod? de ran’ningu wo shi te i ta dansei ? san yon ? ni atari , dansei wa ry?ashi ni karui kega . D?-sho ni yoru to, shib?-shita dansei wa ?dan hod? o aruite watatte ita tokoro o chokushin-shite kita kuruma ni hanerareta. Kuruma wa hidari ni ky?-handoru o kiri, shad? to hod? no sakai ni okareta kasetsu no saku o haneage, hod? ni noriageta to iu. Saku wa hod? de ranningu o shite ita dansei (sanj?yon) ni atari, dansei wa ry?ashi ni karui kega.
d?sho wa , shib? shi ta dansei no mimoto kakunin wo susumeru totomoni , t?ji no k?saten no shing? no j?ky? wo shirabe te iru . D?-sho wa, shib?-shita dansei no mimoto kakunin o susumeru to tomo ni, t?ji no k?saten no shing? no j?ky? o shirabete iru.
genba sh?hen wa t?ky? kank? no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru . Genba sh?hen wa T?ky? kank? no supotto no hitotsu da ga, saikin wa jogingu o tanoshimu hito mo fuete iru.

For the sake of comparison, I have retained Google’s Hepburn-style romanization. The following changes have been made in the text in the righthand column:

  1. Misread words have been rewritten. Many involve numerals; e.g. muika for “roku nichi”, yo-ji for “yon ji”, sanj?go-fun for “san go fun”. The personal name Nobuhiro is an educated guess, but “Nobetsubuse” is certainly wrong. Shirogane for “hakkin” is a place-name (N.B. Google did not produce *hakukin, indicating that the algorithm does more than just character-by-character on-yomi).
  2. False spaces and consequent misreadings have been eliminated. E.g. hanerare for “hane rare”, wattate ita for “wata~tsu te i ta”.
  3. Run-together phrases have been parsed correctly. E.g. to tomo ni for “totomoni”.
  4. Capitalization of proper nouns and the first words in sentences has been introduced.
  5. Hyphens are used conservatively for prefixes and suffixes, and for compound verbs with suru.
  6. Obsolete “wo” for the particle o has been eliminated. (N.B. Google did not produce *ha for the particle wa, so “wo” for o is just the result of laziness.)
  7. Apostrophes after n to indicate mora nasals in positions where they are not needed have been eliminated.
  8. Punctuation has been normalized to match for romanized format and paragraph indentations have been restored.

One could make the romanized text more easily readable by restoring arabic numerals, italicizing gairaigo, and so on. Of course, if the reporter knew that his/her copy would be reported orally or in romanization, s/he might have chosen different wording to avoid homophonic ambiguities. E.g., Marunouchi-sho could be Marunouchi Keisatsu-sho, though perhaps in the context of a traffic accident story, it is obvious that the suffix sho denotes ‘police station’. Furthermore, in a digraphic Japan, homophones might not be such as great problem. If, for instance, readers were accumstomed to seeing d?sho for ?? ‘same place’, then d?-sho would immediately signal that something different was meant, which, given context, might be entirely sufficient to eliminate misunderstanding.

But having said all that, my guess is that the romanization function of Google Translate was programmed with some care. Rather than criticize the quality Google’s algorithm, I suggest pursuing the logical consequences of assuming that it deserves about a B+ by current standards.

Analysis

Clearly, there is a vast amount of knowledge an editor needs if s/he wants to bring Google’s result up to an acceptable level of romanization for human consumption. That minimal level, in turn, is probably a far cry from what a committee of linguists might decide would be an ideal romanization for daily use in 21st-century Japan. It is quite obvious why Google’s algorithm blunders — the reasons were well understood and described long ago (e.g. in Unger 1987) — and though the algorithm can be improved, it can never produce perfect results. Computers cannot read minds, and mindreading is ultimately what it would take to produce a flawless romanization.1

Furthermore, imagine the representation of the words of the text that presumably takes shape in some form or other in the mind of the skilled reader of the original text. Given that Google’s programmers are doing their best to get their computers to identify words and their forms from Japanese textual data, it is clear that readers, who achieve excellent comprehension with little or no conscious effort, must be doing vastly more. The sequence of stages — from (1) the original text to (2) the Google transcription, (3) the better edited version, (4) some future “ideal” romanization scheme, and onward to (5) whatever the brain of the skilled reader ultimately distills and comprehends — concretely illustrates how, at each stage, different kinds of information — from the easily programmable to genuine expert knowledge — must be brought to bear on the raw data.

Of course, something similar can be said of English texts as well: like Chinese characters, orthographic words of English, even though written with letters of the roman alphabet, typically function both logographically and phonographically. The English reader has to do some work too. But how much? Think of the sequence of stages just described in reverse order. The step from the mind of an expert reader (5) to an ideal romanization (4) is short compared with the distance down to the crude level of romanization produced by Google Translate (2). Yet Google does quite a bit relative to the original text (1). It does not totally fail, but rather makes mistakes, which, as just demonstrated, a human editor can identify and correct. It manages to find many word boundaries and no doubt could do better if the company’s programmers consulted some linguists and exerted themselves more. The point is that Japanese readers must cover the whole distance from the text to genuine comprehension, a distance that must be much greater than that traversed by the practiced reader of English, for all its quaint anachronistic spellings. With a decent, standardized roman orthography, the Japanese reader would have a considerably shorter distance to negotiate.

Note

  1. Indeed, starting in the 1980s, Asahi pioneered in the use of an IBM-designed system called NELSON (New Editing and Layout System of Newspapers) that uses large-array keyboards (descriptive input) rather than the sort of kanji henkan methods (transcriptive input) common on personal computers and dedicated word-processing systems. Consequently, the expedient of storing the underlying roman or kana input stream alongside the selected characters is not available for Asahi stories. Of course, such information is routinely thrown away by many other input systems too.

China’s earliest romanization system

The most recent rerelease from Sino-Platonic Papers is Dì-y? ge L?d?ng zìm? de Hàny? P?ny?n F?ng’àn shì z?nyàng ch?nsh?ng de? (How Was the First Romanized Spelling System for Sinitic Produced? / ????????????????????), by YIN Binyong (???).

The author should be familiar to regular readers of this site, as he wrote the standard works on Hanyu Pinyin orthography — Chinese Romanization: Pronunciation and Orthography and the Xinhua Pinxie Cidian — as well as Pinyin-to-Chinese Character Computer Conversion Systems and the Realization of Digraphia in China.

The text is in Mandarin in Chinese characters. Here is the introduction.
image of the Mandarin text (in Chinese characters) of the first two paragraphs of the article

This is issue no. 50 of Sino-Platonic Papers. It was first published in November 1994.

Taipei to stick with Hanyu Pinyin, despite pressure from central gov’t: mayor

Taipei Mayor Hau Lung-bin (H?o Lóngb?n / ???) said on Sunday that Taipei will not switch from Hanyu Pinyin to Tongyong Pinyin, despite pressure from the Ministry of the Interior to do so.

Questioned by reporters at the wedding of Taiwan’s top “Go” player, Hau stressed that the Taipei City Government would continue to use Hanyu Pinyin despite the Interior Ministry’s push as it’s the most commonly used pinyin system in the international community.

“Taipei City has decided to continue using Hanyu Pinyin to connect with other countries in the world,” Hau said.

He suggested that the Interior Ministry consult with linguistic scholars and learn to respect their expertise when standardizing the romanization of Taiwan’s place and street names.

Yes, the MOI would do well to follow this advice — as would the Taipei City Government itself. Taipei’s stupid @#$%! InTerCaPiTaLiZaTion and lack of apostrophes are significant errors. And sometimes the lack of tone marks is a problem. And don’t get me started about Taipei’s “nicknumbering” system.

Taipei City is the only city in Taiwan that has adopted Hanyu Pinyin.

This is incorrect. Several cities around Taiwan use Hanyu Pinyin, such as Xinzhu and Taizhong, though none as consistently as Taipei.

TVBS is reporting that Taipei will be forced to switch, which I very much doubt will happen — certainly not before the presidential election in March 2008.

Nèizhèngbù de xíngzhèng mìnglìng y?dàn b?nbù, bùyòng sòngji?o Lìyuàn tóngyì, Táib?i shìzhèngf? zh?y?u zhàobàn de fèn, 5 nián qián, M? Y?ngji? qiángshì zh?d?o Hàny? P?ny?n, ràng Táib?i Shì chéngwéi t? y?nzh?ng, g?n guójì ji?gu? de d?shì, 5 nián hòu, Nèizhèngbù dìngdìng f?lìng qi?ngpò zhíxíng, g?i le y? jì huím?qi?ng.

TVBS also gives the cost for changing Taipei’s signs at NT$8 million (US$250,000).

The TVBS video gives lots of pictures of signs.

sources:

MOI and Tongyong Pinyin: update

I have spent many hours over the past few days trying to find out exactly what is behind the recent news story about the Ministry of Education and moves to expand Tongyong Pinyin by the end of the year.

I have sent out no fewer than five e-mail messages to various government officials but have received no responses. I have also made more than a dozen phone calls to various ministries and government-information lines. But nobody I spoke with knows what is going on. My wife helped by making some calls on her own. She was eventually able to get through to someone at the Ministry of the Interior who does have a clue about all this.

Here is basically what is happening.

On October 30, Taiwan’s Ministry of the Interior promulgated the government’s guidelines for writing place names (including not just town names but physical features, such as rivers, mountains, temples, bridges, etc.) in English and romanization: Yùgào dìngdìng “bi?ozh?n dìmíng yì xi? zh?nzé” (??????????????) (MS Word document).

Most of the pages of this document are simply a list of townships and districts throughout Taiwan, as given in Tongyong Pinyin. But it also contains a few pages of general guidelines. Local governments and interested individuals (yes, that could include you, o reader) who wish to comment on these guidelines may do so before the deadline of Thursday, November 8. The question of Tongyong Pinyin vs. Hanyu Pinyin, however, is supposedly off the table, as the Ministry of the Interior must follow the administration in this — though I encourage anyone who writes the ministry to bring up the issue anyway. I will post contact information as soon as I get it.

To return to the matter of the promulgated document, these are the guidelines that Taiwan’s local governments are ordered to use, with local governments’ offices of land administration compiling lists of place names to be standardized within their jurisdictions and submitting these lists to the MOI’s Department of Land Administration (dìzhèng s? f?ngyù k? / ??????).

If local governments reject Tongyong Pinyin and use a different romanization system, the MOI does not have the authority to compel them to switch to Tongyong. But the central government can and and almost certainly will exert pressure on them to toe the line.

Making matters worse for advocates of Hanyu Pinyin, the international standard romanization system for Mandarin, is the fact that many local officials — even in “blue” regions — do not believe they have autonomy in this matter, as I know from having spoken with several of them about precisely this topic. Nor, unsurprisingly, do they take the word of a foreigner over what they “know” to be “correct”: that they must use Tongyong whether they like it or not. As an example, the city of Jilong (“Keelung”), which is controlled by the anti-Tongyong “blues,” instituted a plan to standardize street names there with Tongyong Pinyin. Nor will most officials bother to look up the rule they are supposedly following — and which, BTW, I can’t show them because it doesn’t exist.

The recently promulgated proposal has extremely limited guidelines. These are most certainly inferior to the fuller guidelines for Hanyu Pinyin — to say nothing of the book-length supplementary guidelines for Hanyu Pinyin (Chinese Romanization: Pronunciation and Orthography and the Xinhua Pinxie Cidian) and carefully produced dictionaries in Hanyu Pinyin.

Probably the best thing I could say about the guidelines is a negative: At least they didn’t adopt Taipei’s StuPid, StuPid PolICy Of InTerCapITaLiZaTion.

The problem that is likely to affect more names than other deficiencies — other than the fundamental matter of Tongyong Pinyin, that is — is the recommended use of the hyphen. Basically, the guidelines call for a hyphen where Hanyu Pinyin would use an apostrophe: before any syllable that begins with a, e, or o, unless that syllable comes at the beginning of a word or immediately follows a hyphen or other dash.

The reason that is a big problem, beyond the failure to follow the standard of Hanyu Pinyin, is that hyphens cannot then be put to the good use they have in Hanyu Pinyin. Hyphens are often needed in signage because they are used in short forms of proper nouns, for example the correct short form of Taiwan Daxue (National Taiwan University) is “Tai-Da.”

Hyphens can thus help clarify names a great deal becuase they often indicate an abbreviation. Mandarin’s tendency toward Consider bridge names, in which the hyphen helps indicates the reason for the name:

  • not Huazhong but Hua-Zhong (for [Wan]hua to Zhong[he])
  • not Huajiang but Hua-Jiang (for [Wan]hua to Jiang[zicui])

Or the case given in the guidelines of ????. The recommendation there is for “Jianan dazun.” But giving “Jia-Nan” instead of “Jianan” would help clarify that this is something in Jiayi and Tainan counties.

The government guidelines’ failure to employ the hyphen in the same manner as Hanyu Pinyin is a major deficiency.

Taiwan should have Tongyong Pinyin’s orthography follow the well-established guidelines for Hanyu Pinyin. But the administration’s petty difference-for-the-sake-of-difference policy will likely rule out that course.

more on Taiwan’s new Tongyong move

This morning all three of Taiwan’s English-language newspapers ran the AP story on the Ministry of the Interior’s plan to expand the use of Tongyong Pinyin. (Bonus points to the copy editor at the Taipei Times who changed the original article’s sloppy “Taiwan will standardize the English transliterations of its Chinese Mandarin place names by the end of the year” to “The Romanization of Mandarin place names will be standardized by the end of this year.”)

I have made a few calls about this, but to little effect so far. Unfortunately, I haven’t had the time today to track down someone at the Ministry of the Interior who can give some definitive information about this.

Meanwhile, here’s another article. It gives a little more information: no intercapping (good), hyphens instead of apostrophes (bad), some screwed-up word parsing (bad).

But all of this sounds like old news. How this will be any different in implementation is still unclear.

Wàijí rénshì lái Táiw?n g?ngzuò huò l?yóu, z?ng bèi Táiw?n de dìmíng yì xi? g?o de “wù shàsha,” jiéjú cháng y? mílù sh?uch?ng. Nèizhèngbù 30 rì g?ng bù “bi?ozh?n dìmíng yì xi? zh?nzé” c?o’àn, míng dìng dìmíng yì xi? y? “y?nyì” wèi yuánzé, bìng c?i “T?ngyòng P?ny?n” wèi j?zh?n, ruò dìmíng y?u lìsh?, y?yán, guójì guànyòng, shùzì d?ng tèxìng, zé y? dìmíng xìngzhì f?nyì, rú Rìyuè Tán yì wéi “Sun Moon Lake;” 306 g?odì yì wéi “Highland 306.”

G?i c?o’àn shì y?jù “guót? cèhuì f?” dìngdìng, bìng nàrù Jiàoyùbù zhìdìng de “Zh?ngwén yìy?n sh?yòng yuánzé” zuòwéi yì xi? bi?ozh?n, dìmíng yì xi? f?ngshì yóu dìmíng zh?gu?n j?gu?n zìxíng juédìng.

C?o’àn zh?ch?, wèi bìmi?n yì xi? zh? duì wényì rènzh? bùtóng, ch?nsh?ng yì xi? ch?yì, t?ngy? xíngzhèng q?yù de bi?ozh?n yì xi? f?ngshì, sh?ng “Province,” shì “City,” xiàn “County,” xi?ng-zhèn “Township,” q? “District,” c?nli “Village.” Ji?dào míngch?ng y? t?ngy? yì xi?, dàdào “Boulevard,” lù “Road,” ji? “Street,” xiàng “lane,” nòng “Alley.” Lìrú K?idágélán Dàdào wéi “Kaidagelan Boulevard.”

C?o’àn míng dìng, bi?ozh?n dìmíng de yì xi? c?i t?ngyòng p?ny?n, dàn dìmíng hány?u “sh?xìng míngch?ng” shí, y? sh?xìng míngch?ng yìyì f?ngshì yì xi?, rú D?ng F?ng zhíyì wéi “East Peak.”

Ruò sh?xìng míngch?ng y? bi?ozh?n dìmíng zh?ngt? shìwéi y? ge zhu?ny?u míngch?ng shí, bù lìng y? yìyì f?ngshì f?nk?i yì xi?, rú “Ji?-Nán dà zùn [zhèn?]” yì wéi “Jianan dazun;” Yángmíng Sh?n yì wéi “Yangmingshan;” Zhúzi Hú yì wéi “Jhuzihhu.”

Lìngwài dìmíng y?u d?ngdì lìsh?, y?yán, f?ngsúxíguàn, z?ngjiào xìny?ng, guójì guànyòng huò qít? tèsh? yuány?n, j?ng zh?gu?n j?gu?n bào zh?ngy?ng zh?gu?n j?gu?n hédìng hòu, bù shòu “sh?xìng míngch?ng” xiànzhì, rú Yù Sh?n zhíyì wéi Jade Mountain; zh?ngy?ng sh?nmài yì wéi “Central Mountains.”

C?o’àn gu?dìng, bi?ozh?n dìmíng yì xi? sh?xi? f?ngshì, dì-y? ge zìm? dàxi?, qíyú zìm? xi?oxi?, rú b?nqiáo yì wéi “Banciao,” ér f?i “Ban Ciao” huò “Ban-ciao.” Dàn dìmíng de dì-y? ge zì y?hòu de p?ny?n zìm?, ch?xiàn a, o, e shí, y? qián d?nzì ji?n y? du?nxiàn liánji?, rú Q?l?’àn yì wéi “Cili-an,” Rén’ài Xi?ng wéi “Ren-ai Township.”

C?wài, c?o’àn y? t?ngy? zìrán dìl? shít? sh?xìng míngch?ng, rú píngyuán, péndì, d?oy?, qúnd?o, liè y?, ji?o, t?n, sh?zh?u, ji?ji?o, sh?n, sh?nmài, f?ng, hé x?, hú, tán d?ng shíw? zh?ng yì xi? f?ngshì. Lìrú, D?ngsh? Qúnd?o yì wéi “Dongsha Islands;” Diàoyútái liè y? “Diaoyutai Archipelago;” B?iwèi T?n “Beiwei Bank;” “?l? Sh?n sh?nmài” yì wéi “Alishan Mountains;” zh?f?ng yì wéi “Main Peak;” Sh?nhútán zhíyì wéi “Shanhu Pond.”

source: Y?ngyì y?u “zh?n” — l?owài zh?o lù bùzài wù shàsha (?????? ?????????), China Times, October 31, 2007