Google Translate and romaji revisited

OK, Google has improved its Pinyin converter some, though it still fails in important areas. So that’s the present situation for Google and Mandarin.

How about for Google and Japanese?

Professor J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures generously agreed to reexamine Google’s performance in conversions to rōmaji (Japanese written in romanization).

Below is his latest evaluation.

For his initial analysis (in December 2009), see Google Translate and rōmaji.

I ran the test passage through Google Translate again. There’s some improvement, but it’s still pretty mediocre.

Original Google Translate
6日午後4時35分ごろ、東京都千代田区皇居外苑の都道(内堀通り)の二重橋前交差点で、中国からの観光客の40代の男性が乗用車にはねられ、全身を強く打って間もなく死亡した。車は歩道に乗り上げて歩いていた男性(69)もはね、男性は頭を強く打って意識不明の重体。丸の内署は、運転していた東京都港区白金3丁目、会社役員高橋延拓容疑者(24)を自動車運転過失傷害の疑いで現行犯逮捕し、容疑を同致死に切り替えて調べている。 6-Nichi gogo 4-ji 35-fun-goro, Tōkyō-to Chiyoda-ku Kōkyogaien no todō (uchibori-dōri) no Nijūbashi zen kōsaten de, Chūgoku kara no kankō kyaku no 40-dai no dansei ga jōyōsha ni hane rare, zenshin o tsuyoku Utte mamonaku shibō shita. Kuruma wa hodō ni noriagete aruite ita dansei (69) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no jūtai. Marunouchi-sho wa, unten shite ita Tōkyō-to Minato-ku hakkin 3-chōme, kaisha yakuin Takahashi nobe Tsubuse yōgi-sha (24) o jidōsha unten kashitsu shōgai no utagai de genkō-han taiho shi, yōgi o dō chishi ni kirikaete shirabete iru.
 同署によると、死亡した男性は横断歩道を歩いて渡っていたところを直進してきた車にはねられた。車は左に急ハンドルを切り、車道と歩道の境に置かれた仮設のさくをはね上げ、歩道に乗り上げたという。さくは歩道でランニングをしていた男性(34)に当たり、男性は両足に軽いけが。 Dōsho ni yoru to, shibō shita dansei wa ōdan hodō o aruite watatte ita tokoro o chokushin shite kita kuruma ni hane rareta. Kuruma wa hidari ni kyū handoru o kiri, shadō to hodō no sakai ni oka reta kasetsu no saku o haneage, hodō ni noriageta toyuu. Saku wa hodō de ran’ningu o shite ita dansei (34) niatari, dansei wa ryōashi ni karui kega.
 同署は、死亡した男性の身元確認を進めるとともに、当時の交差点の信号の状況を調べている。 Dōsho wa, shibō shita dansei no mimoto kakunin o susumeru totomoni, tōji no kōsaten no shingō no jōkyō o shirabete iru.
 現場周辺は東京観光のスポットの一つだが、最近はジョギングを楽しむ人も増えている。 Genba shūhen wa Tōkyō kankō no supotto no hitotsudaga, saikin wa jogingu o tanoshimu hito mo fuete iru.

Notes:

  • The use of numerals dodges a plethora of errors, but “6-Nichi” is still wrong for Muika.
  • Lots of correct capitalizations have been added, but “uchibori” was missed and “Utte” capitalized by mistake.
  • Some false spaces or lack of spaces persist: “hane rare”, “oka reta”; “hitotsudaga” and “niatari” were correctly hitotsu da ga and ni atari in the original test.
  • Names still get butchered (“hakkin” for Shirogane, “nobe Tsubuse” for Nobuhiro.
  • The needless apostrophe in “ran’ningu” is still there.
  • Interestingly, “toyuu” is a new error: it should be to iu.
  • There’s evidence of some attempt to use hyphens, but why not in “kankō kyaku” or “Nijūbashi zen”?

So, to update: Google gets kudos for conscientiousness, but I stick by my original comments.

For more by Prof. Unger, see Pinyin.info’s recommended readings, which includes selections from The Fifth Generation Fallacy: Why Japan Is Betting Its Future on Artificial Intelligence, Literacy and Script Reform in Occupation Japan: Reading Between the Lines, and Ideogram: Chinese Characters and the Myth of Disembodied Meaning.

Google Translate’s Pinyin converter revisited

When Google Translate‘s Pinyin converter was first released about a year and a half ago, it sucked. Wow, did it ever suck. Since then, however, Google has instituted some changes. So it seems about time this was reexamined.

Fortunately, Google’s Pinyin converter is now much better than before.

Here’s the sort of FUBAR romanization — it certainly doesn’t deserve to be called Hanyu Pinyin — Google used to produce:

tán zhōng guó de“yǔ“hé” wén” de wèn tí, wǒ jué de zuì hǎo néng xiān liǎo jiè yī xià zài zhōng guó tōng yòng de yǔ yán。… rú guǒ nǐ shǐ yòng zhōng guó de gòng tóng yǔ yán pǔ tōng huà, nǐ liǎo jiě zhè ge yǔ yán de yǔ fǎ(bǐ rú“de, de, de“ hé“le” de bù tóng yòng fǎ) ma?zhī dào zhè ge yǔ yán de jī běn yīn jié(bù bāo kuò shēng diào) zhǐ yǒu408gè ma?

Now the same passage will look like this:

Tán zhōngguó de “yǔ” hé “wén” de wèntí, wǒ juéde zuì hǎo néng xiān liǎo jiè yīxià zài zhōngguó tōngyòng de yǔyán…. Rúguǒ nǐ shǐyòng zhōngguó de gòngtóng yǔyán pǔtōnghuà, nǐ liǎojiě zhège yǔyán de yǔfǎ (bǐrú “de, de, de “hé “le” de bùtóng yòngfǎ) ma? Zhīdào zhège yǔyán de jīběn yīnjié (bù bāokuò shēngdiào) zhǐyǒu 408 gè ma?

At last! Capitalization at the beginning of a sentence and word parsing! But — you knew there was going to be a but, didn’t you? — Google’s Pinyin converter falls significantly short because it still fails completely in two fundamental areas: capitalization of proper nouns and proper use of the apostrophe.

1. Proper Nouns

Google’s Pinyin converter fails to follow the basic point of capitalizing proper nouns. For example, here are some well-known place names. I have prefixed the names with “在” because Google automatically capitalizes the first word in a line; so to see how it handles capitalization of place names something other than the name must go first.

screenshot showing what happens if the following is entered into Google Translate: '在西安, 在长安, 在重庆, 在北京'. That leads to the following in Google Translate: 'in Xi'an, in Chang [sic], in Chongqing, in Beijing'. But the romanization line reads 'Zai xian, Zai changan, Zai chongqing, Zai beijing'

Google Translate gets these right, other than the odd truncation of Chang’an. But the Pinyin converter (see the gray text at the bottom of the image above) fails to capitalize these, even though it correctly parses them as units and thus must “know” their meanings.

The same thing happens with personal names.

Input this:

是馬英九
是毛泽东
是陳水扁

Google Translate provides this:

Is Ma Ying-jeou
Mao Zedong
Chen Shui-bian

Those are correct, if the missing Iss are discounted.

But the Pinyin appears as “Shì mǎyīngjiǔ Shì máozédōng Shì chénshuǐbiǎn“. So even though the software understands that these names are units, the capitalization and word parsing are still wrong and they are still not rendered as they should be in Pinyin: “Mǎ Yīngjiǔ,” “Máo Zédōng,” “Chén Shuǐbiǎn.

There is nothing obscure about capitalizing proper nouns. How did this get missed?

2. Apostrophes

The cases of Xi’an and Chang’an above already demonstrate apostrophe omission. Let’s try a few more tests, including some words that are not proper nouns.

Input this:

阿爾巴尼亞
然而
仁愛
蓮藕

The Pinyin is rendered as “Āěrbāníyǎ Ránér Rénài Liánǒu” rather than the correct forms of Ā’ěrbāníyǎ, rán’ér, rén’ài, and lián’ǒu.

As always I want to stress that, whatever you might have heard elsewhere, apostrophes are not optional. But the rules for their use are easy — so easy that I suspect a fairly simple computer script could fix this problem quickly and simply. (Only about 2 percent of Mandarin words, as written in Hanyu Pinyin, have apostrophes.)

As is the case with the mistakes with proper nouns, these apostrophe errors are all the more puzzling because Google Translate does not appear to share them. Fortunately, these problems should not be particularly difficult to fix, especially if the Pinyin converter can make better use of Google Translate’s database.

Although Google’s failures to implement capitalization of proper nouns and apostrophe use are significant problems, they could likely be corrected quickly and easily. (I strongly suspect this would take considerably less time than it has taken for me to write this post.) The result would be a vastly improved converter. So I am hopeful that Google will work on this soon.

3. Additional work

Once Google gets those basics fixed, it should focus on the simple matter of correcting spacing before and after some quotations (which would surely take just a few minutes to take care of) and any other such spacing errors, and fixing its word parsing related to numbers (which is a bit more complicated, though the basics are easy: everything from 1 to 100 is written solid).

Next would come something requiring a bit more care: the proper handling of Mandarin’s three tense-marking particles: zhe, guo, and le.

And Google should attach the pluralizing suffix -men to the word it modifies rather than leaving it separate (e.g., háizimen, not háizi men).

Then, with all of those taken care of, Google would have a pretty good Pinyin converter that I would be happy to praise. Of course even then it could still use other improvements; but those would most likely deal more with particulars than the fundamentals of how Pinyin is meant to be written.

A separate post, to be written soon, will compare the performance of several Pinyin converters (including Google’s). Stay tuned.

Oxford Chinese Dictionary goes online

cover image of the Oxford Chinese DictionaryOxford University Press has just announced that its massive Oxford Chinese Dictionary is now available through its Oxford Language Dictionaries Online subscription service.

I haven’t seen the online version yet myself; but from the publisher’s description it appears to be largely the same as the published edition, whose paucity of Pinyin is disappointing. The publisher, however, is promising that “Pinyin will be added to all Chinese translations” in November, which should be a major step forward.

Perhaps some of you at universities have institutional access. I would welcome reports.

source: What’s New, Oxford Language Dictionaries Online, May 2011.

old fashioned

photo of donuts and their label: 'Choco Fashioned / 巧克力歐菲香', price NT$35 (about US$1.20)

Here’s a shot of some Hanzified, Mandarinized English I recently came across. Qiǎokèlì (巧克力) is of course a well-established loan word, from the English “chocolate” (though here the English is given in the more Japanese-English form of choco, as befits a Japanese donut chain store in Taiwan). Ōufēixiāng (歐菲香) is a rendering of “old fashioned.” Although the “old” is missing from the English above, it can be seen in both of the tags pictured below.

photo of donuts and their labels: 'White Cocoa Old Fashioned / 白可可歐菲香' and 'Old Fashioned / 原味歐菲香'
Bái kěkě ōufēixiāng (白可可歐菲香) and yuán wèi ōufēixiāng (原味歐菲香).

And if that’s not enough to fill you up with Hanzified English, perhaps try a piece of Bōshìdùn pài (波士頓派), i.e., “Boston [cream] pie.”

Feichang nankan!

The sign in the photo below has been up for years; but only recently did I finally get a chance to take a halfway decent photo of it. It’s just outside the second terminal of Taiwan’s main international airport and thus is the first example of road signage that many visitors to Taiwan see.

The atrocious typography displayed in how “Nankan” (南崁/Nánkàn) is written is certainly a good introduction to the chabuduo world of Taiwan’s signage.

Truly nánkàn (ugly)!

a directional sign pointing the way to Nankan -- but 'Nankan' is written with all letters the same height (i.e., the capital 'N' is reduced to the height of the letter 'a' and the 'k' is similarly shrunken)

Conferences in Hawaii

Tomorrow morning I’m off to Honolulu for the Zhang Liqing Memorial International Conference on Hanyu Pinyin. This promises to be a tremendously exciting event, with select scholars from throughout the United States, Asia, and Oceania participating. I’ll have more to say about this after the gathering.

While I’m in Hawaii I may drop in on the joint conference of the Association for Asian Studies and the International Convention of Asia Scholars (March 31–April 3). You might think, though, that with nearly 800 sessions on just about everything under the sun, at least a few of them would discuss romanization. (But nooo.) Still, session 282, “Beyond Cultural Essentialism: Neo-Orientalism in Chinese Studies” (Friday morning, 10:15-12:15), sounds interesting, especially Edward McDonald’s talk on character fetishization in Chinese studies. McDonald’s new book, Learning Chinese, Turning Chinese: Challenges to Becoming Sinophone in a Globalised World, also covers this topic.

If you know of anything else particularly interesting going on at the AAS-ICAS conference or in Honolulu at large, please let me know. (For example, what’s the best bookstore there?)

Spreading the good news

Behold, I bring you good tidings.

As I keep having to note, most of the things that are supposedly in Pinyin are terrible. This is not because Pinyin itself is inherently poor or difficult. It’s because most people who produce such things have a fundamental lack of understanding of Pinyin as a system. (And, yes, that includes most users in China.) So it is with amazement that I report today on a journal that not only offers dozens of pages in Hanyu Pinyin — good Hanyu Pinyin — but does so twice every month. It’s also well worth noting that the journal is aimed primarily at adult native speakers of Mandarin, not foreigners trying to pick up the language, though certainly it could also be read by people in the latter group.

From what I’ve seen so far, this journal gets right the things most commonly written incorrectly elsewhere, including:

And it doesn’t use the atrocious ɑ that some people mistakenly believe is required either.

Unfortunately, punctuation and alphanumerics are not included in the Pinyin. But other than that there’s very little that doesn’t follow standard Pinyin orthography, the main exception being the indication of the tone sandhi related to the special cases of and , (e.g., the journal gives “bú shì” and “búdà” instead of the standard “bù shì” and “bùdà,” and “yìhuíshì” and “yí wèi” instead of the standard “yīhuíshì” and “yī wèi“). That said, though, tone changes related to yi and bu can be something of a pain. So although this isn’t standard, I can see why it was done and am not entirely unsympathetic to this approach.

Here are a few sample lines (click to enlarge):
screenshot of some text in the journal, showing text in simplified Chinese characters with word-parsed Hanyu Pinyin above the Hanzi. Note: Yifusuoshu/以弗所書 = Ephesians

It would be nice if this were in Unicode, to help aid searches and cutting and pasting. The text, however, appears to have been made in a system devised years ago by the people at the journal. Regardless, I’m happy to see the Pinyin.

Overall, despite the lamentable absence of punctuation and Arabic numerals in the Pinyin, this is quality work, which is perhaps all the more remarkable in that the Pinyin and simplified Hanzi edition of this journal is not truly free to circulate in the land of its target audience. That’s because its publishers are Jehovah’s Witnesses, a group suppressed by the PRC (though it appears that at least at the moment their sites are not blocked by the great firewall). The journal, Shǒuwàngtái, may be more familiar to you by its English name: Watchtower. Whatever you might think of Jehovah’s Witnesses, I hope you’ll recognize the considerable accomplishment of those who put together this publication.

Getting to the Jehovah’s Witnesses Web pages that link to Shǒuwàngtái can be tricky. (Go to the magazines page, select “Chinese (Simplified)” for the language; then choose the month and file with Pinyin.) So I’m providing direct links to some documents below:

I haven’t found any Pinyin editions other than those. Perhaps old ones are taken offline.

Rénrén Dōu Xūyào Zhīdao De Hǎo Xiāoxi (I'd prefer 'de' instead of 'De' -- but that's no big deal) 人人都需要知道的好消息

With thanks to Victor Mair.

Weishenme Zhongwen zheme TM nan?

David Moser’s essay Why Chinese Is So Damn Hard — which is one of the most popular readings here on Pinyin Info, with perhaps half a million page views to date (nothing to dǎ pēntì at!) — has been translated into Mandarin: Wèishénme Zhōngwén zhème TM nán? (为什么中文这么TM难?). (Gotta love the use of Roman letters there.)

Although the translation has been online for only 24 hours or so, it has already received more than 150 comments.

A suggestion for readers and translators looking for something similar: Moser’s Some Things Chinese Characters Can’t Do-Be-Do-Be-Do.