Google Translate’s Pinyin converter revisited

When Google Translate‘s Pinyin converter was first released about a year and a half ago, it sucked. Wow, did it ever suck. Since then, however, Google has instituted some changes. So it seems about time this was reexamined.

Fortunately, Google’s Pinyin converter is now much better than before.

Here’s the sort of FUBAR romanization — it certainly doesn’t deserve to be called Hanyu Pinyin — Google used to produce:

tán zhōng guó de“yǔ“hé” wén” de wèn tí, wǒ jué de zuì hǎo néng xiān liǎo jiè yī xià zài zhōng guó tōng yòng de yǔ yán。… rú guǒ nǐ shǐ yòng zhōng guó de gòng tóng yǔ yán pǔ tōng huà, nǐ liǎo jiě zhè ge yǔ yán de yǔ fǎ(bǐ rú“de, de, de“ hé“le” de bù tóng yòng fǎ) ma?zhī dào zhè ge yǔ yán de jī běn yīn jié(bù bāo kuò shēng diào) zhǐ yǒu408gè ma?

Now the same passage will look like this:

Tán zhōngguó de “yǔ” hé “wén” de wèntí, wǒ juéde zuì hǎo néng xiān liǎo jiè yīxià zài zhōngguó tōngyòng de yǔyán…. Rúguǒ nǐ shǐyòng zhōngguó de gòngtóng yǔyán pǔtōnghuà, nǐ liǎojiě zhège yǔyán de yǔfǎ (bǐrú “de, de, de “hé “le” de bùtóng yòngfǎ) ma? Zhīdào zhège yǔyán de jīběn yīnjié (bù bāokuò shēngdiào) zhǐyǒu 408 gè ma?

At last! Capitalization at the beginning of a sentence and word parsing! But — you knew there was going to be a but, didn’t you? — Google’s Pinyin converter falls significantly short because it still fails completely in two fundamental areas: capitalization of proper nouns and proper use of the apostrophe.

1. Proper Nouns

Google’s Pinyin converter fails to follow the basic point of capitalizing proper nouns. For example, here are some well-known place names. I have prefixed the names with “在” because Google automatically capitalizes the first word in a line; so to see how it handles capitalization of place names something other than the name must go first.

screenshot showing what happens if the following is entered into Google Translate: '在西安, 在长安, 在重庆, 在北京'. That leads to the following in Google Translate: 'in Xi'an, in Chang [sic], in Chongqing, in Beijing'. But the romanization line reads 'Zai xian, Zai changan, Zai chongqing, Zai beijing'

Google Translate gets these right, other than the odd truncation of Chang’an. But the Pinyin converter (see the gray text at the bottom of the image above) fails to capitalize these, even though it correctly parses them as units and thus must “know” their meanings.

The same thing happens with personal names.

Input this:

是馬英九
是毛泽东
是陳水扁

Google Translate provides this:

Is Ma Ying-jeou
Mao Zedong
Chen Shui-bian

Those are correct, if the missing Iss are discounted.

But the Pinyin appears as “Shì mǎyīngjiǔ Shì máozédōng Shì chénshuǐbiǎn“. So even though the software understands that these names are units, the capitalization and word parsing are still wrong and they are still not rendered as they should be in Pinyin: “Mǎ Yīngjiǔ,” “Máo Zédōng,” “Chén Shuǐbiǎn.

There is nothing obscure about capitalizing proper nouns. How did this get missed?

2. Apostrophes

The cases of Xi’an and Chang’an above already demonstrate apostrophe omission. Let’s try a few more tests, including some words that are not proper nouns.

Input this:

阿爾巴尼亞
然而
仁愛
蓮藕

The Pinyin is rendered as “Āěrbāníyǎ Ránér Rénài Liánǒu” rather than the correct forms of Ā’ěrbāníyǎ, rán’ér, rén’ài, and lián’ǒu.

As always I want to stress that, whatever you might have heard elsewhere, apostrophes are not optional. But the rules for their use are easy — so easy that I suspect a fairly simple computer script could fix this problem quickly and simply. (Only about 2 percent of Mandarin words, as written in Hanyu Pinyin, have apostrophes.)

As is the case with the mistakes with proper nouns, these apostrophe errors are all the more puzzling because Google Translate does not appear to share them. Fortunately, these problems should not be particularly difficult to fix, especially if the Pinyin converter can make better use of Google Translate’s database.

Although Google’s failures to implement capitalization of proper nouns and apostrophe use are significant problems, they could likely be corrected quickly and easily. (I strongly suspect this would take considerably less time than it has taken for me to write this post.) The result would be a vastly improved converter. So I am hopeful that Google will work on this soon.

3. Additional work

Once Google gets those basics fixed, it should focus on the simple matter of correcting spacing before and after some quotations (which would surely take just a few minutes to take care of) and any other such spacing errors, and fixing its word parsing related to numbers (which is a bit more complicated, though the basics are easy: everything from 1 to 100 is written solid).

Next would come something requiring a bit more care: the proper handling of Mandarin’s three tense-marking particles: zhe, guo, and le.

And Google should attach the pluralizing suffix -men to the word it modifies rather than leaving it separate (e.g., háizimen, not háizi men).

Then, with all of those taken care of, Google would have a pretty good Pinyin converter that I would be happy to praise. Of course even then it could still use other improvements; but those would most likely deal more with particulars than the fundamentals of how Pinyin is meant to be written.

A separate post, to be written soon, will compare the performance of several Pinyin converters (including Google’s). Stay tuned.

Oxford Chinese Dictionary goes online

cover image of the Oxford Chinese DictionaryOxford University Press has just announced that its massive Oxford Chinese Dictionary is now available through its Oxford Language Dictionaries Online subscription service.

I haven’t seen the online version yet myself; but from the publisher’s description it appears to be largely the same as the published edition, whose paucity of Pinyin is disappointing. The publisher, however, is promising that “Pinyin will be added to all Chinese translations” in November, which should be a major step forward.

Perhaps some of you at universities have institutional access. I would welcome reports.

source: What’s New, Oxford Language Dictionaries Online, May 2011.

A clang on the Taipei MRT announcements

photo of a sign at the Zhongxiao Xinsheng MRT stationPeople generally don’t listen carefully to the announcements on the Taipei MRT, a subway/elevated train mass-transit system. With four languages to get through — Mandarin, Taiwanese, Hakka, and English — that’s a lot of talking. And anyway, the cars can be so full that it’s hard to hear such things clearly over all the background noise anyway. Still, you’d think that at least the people who make the recordings would be paying attention.

Below is a link to a recording of a relatively new announcement, advising people on the Danshui line that Minquan West Road is the place to change trains for the Luzhou line, which opened late last year: Mínquán West Road Station. Attention: passengers transferring to Sānchóng, Lúzhōu, or Zhōngxiào-Xīnshēng please change trains at this station.

Or at least what I typed above is what the announcement is supposed to give. As you may have noticed, however, “Zhōngxiào-Xīnshēng” is rendered “Zhongxiao-Xinshang,” with a very un-Mandarin shang that rhymes with the English words clang, pang, hang, and sang. And that’s without getting into the matter of tones.

I pointed out this error to Taipei City Hall and the authorities in charge of the MRT. As usual, I had to spend some time repeatedly explaining: “No, Xinshang is not the English pronunciation of Xīnshēng. Xīnshēng isn’t English. It’s Mandarin. What the announcement gives is simply an error….” I was pleasantly surprised, however, that the main person I spoke to at TRTS did not require the usual explanations. He understood the problem and said it would be fixed.

This, however, was a couple of months ago. The recordings have not yet been changed. I haven’t been holding my breath over this, though, because the official with the MRT system warned that it would take time to run a public bid notice for a new recording, make the new recording, and then install the recording in the front and back cars of some 100 trains. Still, the system has been known to move fairly quickly; unfortunately, this usually happens only when the change is for the worse, such as renaming Xindian City Hall as Xindian City Office (now Xindian District Office), or renaming the whole Muzha line because some superstitious nitwits thought that a joking, non-official nickname was bringing the system bad luck.

For longtime residents of Taipei, the shang mispronunciation will likely bring back memories of the bad old days when the MRT system first opened. Back then the signage was predominantly in bastardized Wade-Giles, with the pronunciations in the English announcements matching what a clueless Westerner might say when shown names like Kuting and Nanking (properly: Gǔtíng and Nánjīng, respectively). Perhaps the most offensive pronunciation on the system then was given to Dànshuǐ, which at the time was [mis]spelled Tamshui on the MRT system. This was pronounced as three syllables: Tam (rhymes with the English word “dam”) + shu (“shoe”) + i (as in “machine”).

By the way, the Xinbei City Government has been changing signs around Danshui from Danshui to the old Taiwanese spelling of Tamsui (note: not Tamshui). But more about that in a different post.

Conferences in Hawaii

Tomorrow morning I’m off to Honolulu for the Zhang Liqing Memorial International Conference on Hanyu Pinyin. This promises to be a tremendously exciting event, with select scholars from throughout the United States, Asia, and Oceania participating. I’ll have more to say about this after the gathering.

While I’m in Hawaii I may drop in on the joint conference of the Association for Asian Studies and the International Convention of Asia Scholars (March 31–April 3). You might think, though, that with nearly 800 sessions on just about everything under the sun, at least a few of them would discuss romanization. (But nooo.) Still, session 282, “Beyond Cultural Essentialism: Neo-Orientalism in Chinese Studies” (Friday morning, 10:15-12:15), sounds interesting, especially Edward McDonald’s talk on character fetishization in Chinese studies. McDonald’s new book, Learning Chinese, Turning Chinese: Challenges to Becoming Sinophone in a Globalised World, also covers this topic.

If you know of anything else particularly interesting going on at the AAS-ICAS conference or in Honolulu at large, please let me know. (For example, what’s the best bookstore there?)

Spreading the good news

Behold, I bring you good tidings.

As I keep having to note, most of the things that are supposedly in Pinyin are terrible. This is not because Pinyin itself is inherently poor or difficult. It’s because most people who produce such things have a fundamental lack of understanding of Pinyin as a system. (And, yes, that includes most users in China.) So it is with amazement that I report today on a journal that not only offers dozens of pages in Hanyu Pinyin — good Hanyu Pinyin — but does so twice every month. It’s also well worth noting that the journal is aimed primarily at adult native speakers of Mandarin, not foreigners trying to pick up the language, though certainly it could also be read by people in the latter group.

From what I’ve seen so far, this journal gets right the things most commonly written incorrectly elsewhere, including:

And it doesn’t use the atrocious ɑ that some people mistakenly believe is required either.

Unfortunately, punctuation and alphanumerics are not included in the Pinyin. But other than that there’s very little that doesn’t follow standard Pinyin orthography, the main exception being the indication of the tone sandhi related to the special cases of and , (e.g., the journal gives “bú shì” and “búdà” instead of the standard “bù shì” and “bùdà,” and “yìhuíshì” and “yí wèi” instead of the standard “yīhuíshì” and “yī wèi“). That said, though, tone changes related to yi and bu can be something of a pain. So although this isn’t standard, I can see why it was done and am not entirely unsympathetic to this approach.

Here are a few sample lines (click to enlarge):
screenshot of some text in the journal, showing text in simplified Chinese characters with word-parsed Hanyu Pinyin above the Hanzi. Note: Yifusuoshu/以弗所書 = Ephesians

It would be nice if this were in Unicode, to help aid searches and cutting and pasting. The text, however, appears to have been made in a system devised years ago by the people at the journal. Regardless, I’m happy to see the Pinyin.

Overall, despite the lamentable absence of punctuation and Arabic numerals in the Pinyin, this is quality work, which is perhaps all the more remarkable in that the Pinyin and simplified Hanzi edition of this journal is not truly free to circulate in the land of its target audience. That’s because its publishers are Jehovah’s Witnesses, a group suppressed by the PRC (though it appears that at least at the moment their sites are not blocked by the great firewall). The journal, Shǒuwàngtái, may be more familiar to you by its English name: Watchtower. Whatever you might think of Jehovah’s Witnesses, I hope you’ll recognize the considerable accomplishment of those who put together this publication.

Getting to the Jehovah’s Witnesses Web pages that link to Shǒuwàngtái can be tricky. (Go to the magazines page, select “Chinese (Simplified)” for the language; then choose the month and file with Pinyin.) So I’m providing direct links to some documents below:

I haven’t found any Pinyin editions other than those. Perhaps old ones are taken offline.

Rénrén Dōu Xūyào Zhīdao De Hǎo Xiāoxi (I'd prefer 'de' instead of 'De' -- but that's no big deal) 人人都需要知道的好消息

With thanks to Victor Mair.

Weishenme Zhongwen zheme TM nan?

David Moser’s essay Why Chinese Is So Damn Hard — which is one of the most popular readings here on Pinyin Info, with perhaps half a million page views to date (nothing to dǎ pēntì at!) — has been translated into Mandarin: Wèishénme Zhōngwén zhème TM nán? (为什么中文这么TM难?). (Gotta love the use of Roman letters there.)

Although the translation has been online for only 24 hours or so, it has already received more than 150 comments.

A suggestion for readers and translators looking for something similar: Moser’s Some Things Chinese Characters Can’t Do-Be-Do-Be-Do.

Bing Maps for Taiwan

The maps of Taiwan put out by GooGle are plagued with errors in their use of Pinyin. But what about that other big company with deep pockets? You know: Microsoft. How good a job does Microsoft’s Bing do with its maps of Taiwan?
map of Taiwan from Bing, showing Wade-Giles place names

I won’t keep y’all waiting: After examining Bing’s maps of Taiwan the two words that came first to mind were incompetent and atrocious.

The country-level map is odd, offering Wade-Giles. And although the use of the hyphen is irregular, I will give Bing points for getting at least Wade-Giles’ apostrophes right. So, although some place names on the map are decades out of date (e.g., Hsin-chuang, Chungli, Chunan, Kuang-fu), at least they’re not horribly misspelled within that system.

It’s at the street level that Bing’s weirdness becomes most apparent. For example, below is part of Bing’s map of Banqiao.

I added the highlighting.

click for larger map

This tiny but representative fragment of the map has not one but four romanization systems:

  • MPS2: Gung Guang, Min Chiuan, Shin Fu (Even within MPS2, none of those should have spaces or extra capital letters.)
  • Hanyu Pinyin: Banqiao (This is the only properly written place name on this map fragment.)
  • Tongyong Pinyin: Jhancian, Sianmin, Sin Jhan
  • Gwoyeu Romatzyh(!): Shinjann (This is the same road as the one marked “Sin Jhan”. In Hanyu Pinyin, which is what officially should be used here, this is written “Xinzhan”.)

A few more points about this small fragment of the map:

  • Wen Hua could be either MPS2 or Hanyu Pinyin, but not Tongyong Pinyin. And it should be Wenhua.
  • Minan is missing an apostrophe. (It should be Min’an.)
  • Banchiao is just wrong, regardless of the system. They were probably going for MPS2 but erroneously used an o instead of a u: Banchiau.
  • Sec 1 Rd should be Rd Sec. 1.
  • Mrt should be MRT.

So that’s four systems, plus additional errors.

There’s much, much more that’s wrong with this than is right. That’s even more evident on a larger map — and that’s without me bothering to mark orthographic problems in the Pinyin (e.g., Wen Hua instead of the correct Wenhua).
click for larger view

Here bastardized Wade-Giles (e.g., “Mrt-Hsinpu” at top, center — and, FWIW, in the wrong location) has been added to the mix, making a total of five different romanization systems, as well as some weird spellings, e.g., U Nung, Win De, Bah De, Ying Sh — and that’s without including my favorite, JRLE, because that one is correct in MPS2 (“Zhile” in Hanyu Pinyin).

The main point is that vast majority of names are spelled wrong. And among the few that are spelled correctly, those that are written with correct orthography can be counted on one hand. So, to the words above (incompetent and atrocious) let me add FUBAR.

The copyright statement lists not only Microsoft but also Navteq. The Taiwan maps on the latter company’s site, however, are different from those on Bing. Navteq’s are generally in Hanyu Pinyin, though almost invariably improperly written (e.g., Tai bei Shi, Ban Qiao Shi). And despite the prevalence of Hanyu Pinyin, they still contain other romanization systems (e.g., Jhong Shan) and outright errors (e.g., Shin Jahn).

So an update from Navteq wouldn’t be nearly enough to fix Bing’s problems, which are fundamental.

Banqiao — the Xinbei ways

Xinbei, formerly known as Taipei County and now officially bearing the atrocious English name of “New Taipei City,” has made available an online map of its territory.

Interestingly, the map is available not just in Mandarin with traditional Chinese characters and English with Hanyu Pinyin (most of the time — but more on that soon) but also in Mandarin with simplified Chinese characters. A Japanese interface is also available.

The interface for all versions opens to a map centered on Xinbei City Hall. What struck me upon seeing this for the first time was that, in just one small section, Banqiao is spelled four different ways:

  • Banqiao (Hanyu Pinyin)
  • Panchiao (bastardized Wade-Giles)
  • Ban-Chiau (MPS2, with an added hyphen)
  • Banciao (Tongyong Pinyin)

Click the map to see an enlargement.
click for larger version

I want to stress that these are not typos. These are the result of an inattention to detail that is all too common here.

The spelling for the city, er, district is also wrong in the interface, with Tongyong used. Since Banqiao is the seat of the Xinbei City Government and has more than half a million inhabitants,*, it’s not exactly so obscure that spelling its name correctly should be much of a challenge. Tongyong and other systems also crop up in some other names outside the interface.

It should be admitted, however, that the Xinbei map’s romanization is still better overall than the error-filled mess issued by GooGle.

*: including me