Google Translate’s Pinyin converter: now with apostrophes

Google has taken another major step toward making Google Translate‘s Pinyin converter decent. Finally, apostrophes.

Not long ago “阿爾巴尼亞然而仁愛蓮藕普洱茶” would have yielded “Āěrbāníyǎ ránér rénài liánǒu pǔěr chá.” But now Google produces the correct “Ā’ěrbāníyǎ rán’ér rén’ài lián’ǒu pǔ’ěr chá.” (Well, one could debate whether that last one should be pǔ’ěr chá, pǔ’ěrchá, Pǔ’ěr chá, Pǔ’ěr Chá, or Pǔ’ěrchá. But the apostrophe is undoubtedly correct regardless.)

Also, the -men suffix is now solid with words (e.g., 朋友們 –> péngyoumen and 孩子們 –> háizimen). This is a small thing but nonetheless welcome.

The most significant remaining fundamental problem is the capitalization and parsing of proper nouns.

And numbers are still wrong, with everything being written separately. For example, “七千九百四十三萬五千六百五十八” should be rendered as “qīqiān jiǔbǎi sìshísān wàn wǔqiān liùbǎi wǔshíbā.” But Google is still giving this as “qī qiān jiǔ bǎi sì shí sān wàn wǔ qiān liù bǎi wǔ shí bā.”

On the other hand, Google is starting to deal with “le”, with it being appended to verbs. This is a relatively tricky thing to get right, so I’m not surprised Google doesn’t have the details down yet.

So there’s still a lot of work to be done. But at least progress is being made in areas of fundamental importance. I’m heartened by the progress.

Related posts:

The current state:
screen shot of what Google Translate's Pinyin converter produces as of late September 2011

Key Chinese updated, adding new Pinyin features

The program Key, which offers probably the best support for Hanyu Pinyin of any software and thus deserves praise for this alone, has just come out with an update with even more Pinyin features: Key 5.2 (build: August 21, 2011 — earlier builds of 5.2 do not offer all the latest features).

Those of you who already have the program should get the update, as it’s free. But note that if you update from the site, the installer will ask you to uninstall your current version prior to putting in the update, so make sure you have your validation code handy or you’ll end up with no version at all.

(If you don’t already have Key, I recommend that you try it out. A 30-day free trial version can be downloaded from the site.)

Anyway, here’s some of what the latest version offers:

  • Hanzi-with-Pinyin horizontal layout gets preserved when copied into MS Word documents (RT setting), as well as in .html and .pdf files created from such documents.
  • Pinyin Proofing (PP) assistance: with pinyin text displayed, pressing the PP button on the toolbar will colour the background of ambiguous pinyin passages blue; right-clicking on such a blue-background pinyin passage will display the available options.
  • Copy Special: a highlighted Chinese character passage can be copied & pasted automatically in various permutations.
  • Improved number-measureword system: it now works with Chinese-character, pinyin and Arabic numerals.
  • Showing different tones through coloured characters (Language menu under Preferences).
  • Chengyu (fixed four character expression) spacing logic: automatic spacing according to the pinyin standard (Language menu under Preferences).
  • Option to show tone sandhi on grey background (Language menu under Preferences).
  • Full support of standard pinyin orthography in capitalization and spacing.
  • Automatic glossary building.

Some programs, such as Popup Chinese’s “Chinese converter,” will take Chinese characters and then produce pinyin-annotated versions, with the Pinyin appearing on mouseover. Key, however, offers something extra: the ability to produce Hanzi-annotated orthographically correct Pinyin texts (i.e,, the reverse of the above). If you have a text in Key in Chinese characters, all you have to do is go to File --> Export to get Key to save your text in HTML format.

Here’s a sample of what this looks like.

Běn biāozhǔn guīdìngle yòngZhōngwén pīnyīn fāng’ànpīnxiě xiàndài Hànyǔ de guīzéNèiróng bāokuò fēncí liánxiě chéngyǔ pīnxiěfǎwàiláicí pīnxiěfǎrénmíng dìmíng pīnxiěfǎbiāodiào yíháng guīzé děng

Basically, this is a “digraphia export” feature — terrific!

If you want something like the above, you do not have to convert the Hanzi to orthographically correct Pinyin first; Key will do it for you automatically. (I hope, though, that they’ll fix those double-width punctuation marks one of these days.)

Let’s say, though, that you want a document with properly word-parsed interlinear Hanzi and Pinyin. Key will do this too. To do this, a input a Hanzi text in Key, then highlight the text (CTRL + A) and choose Format --> Hanzi with Pinyin / Kanji-Kana with Romaji.

In the window that pops up, choose Hanzi with Pinyin / Kanji-kana with Romaji / Hangul with Romanization from the Two-Line Mode section and Show all non-Hanzi symbols in Pinyin line from Options. The results will look something like this:

GIF of a screenshot from Key, showing an interlinear text with word-parsed Pinyin above Chinese characters. This is an image of the text after being pasted into Microsoft Word.

This can be extremely useful for those authoring teaching materials.

Furthermore, such interlinear texts can be copied and pasted into Word. For the interlinear-formatted copy-and-paste into Word to work properly, Key must be set to rich text format, so before selecting the text you wish to use click on the button labeled RT. (Note yellow-highlighted area in the image below.)

screenshot identifying the location of the button that needs to be pressed to make the text RTF

Google Translate’s Pinyin converter revisited

When Google Translate‘s Pinyin converter was first released about a year and a half ago, it sucked. Wow, did it ever suck. Since then, however, Google has instituted some changes. So it seems about time this was reexamined.

Fortunately, Google’s Pinyin converter is now much better than before.

Here’s the sort of FUBAR romanization — it certainly doesn’t deserve to be called Hanyu Pinyin — Google used to produce:

tán zhōng guó de“yǔ“hé” wén” de wèn tí, wǒ jué de zuì hǎo néng xiān liǎo jiè yī xià zài zhōng guó tōng yòng de yǔ yán。… rú guǒ nǐ shǐ yòng zhōng guó de gòng tóng yǔ yán pǔ tōng huà, nǐ liǎo jiě zhè ge yǔ yán de yǔ fǎ(bǐ rú“de, de, de“ hé“le” de bù tóng yòng fǎ) ma?zhī dào zhè ge yǔ yán de jī běn yīn jié(bù bāo kuò shēng diào) zhǐ yǒu408gè ma?

Now the same passage will look like this:

Tán zhōngguó de “yǔ” hé “wén” de wèntí, wǒ juéde zuì hǎo néng xiān liǎo jiè yīxià zài zhōngguó tōngyòng de yǔyán…. Rúguǒ nǐ shǐyòng zhōngguó de gòngtóng yǔyán pǔtōnghuà, nǐ liǎojiě zhège yǔyán de yǔfǎ (bǐrú “de, de, de “hé “le” de bùtóng yòngfǎ) ma? Zhīdào zhège yǔyán de jīběn yīnjié (bù bāokuò shēngdiào) zhǐyǒu 408 gè ma?

At last! Capitalization at the beginning of a sentence and word parsing! But — you knew there was going to be a but, didn’t you? — Google’s Pinyin converter falls significantly short because it still fails completely in two fundamental areas: capitalization of proper nouns and proper use of the apostrophe.

1. Proper Nouns

Google’s Pinyin converter fails to follow the basic point of capitalizing proper nouns. For example, here are some well-known place names. I have prefixed the names with “在” because Google automatically capitalizes the first word in a line; so to see how it handles capitalization of place names something other than the name must go first.

screenshot showing what happens if the following is entered into Google Translate: '在西安, 在长安, 在重庆, 在北京'. That leads to the following in Google Translate: 'in Xi'an, in Chang [sic], in Chongqing, in Beijing'. But the romanization line reads 'Zai xian, Zai changan, Zai chongqing, Zai beijing'

Google Translate gets these right, other than the odd truncation of Chang’an. But the Pinyin converter (see the gray text at the bottom of the image above) fails to capitalize these, even though it correctly parses them as units and thus must “know” their meanings.

The same thing happens with personal names.

Input this:

是馬英九
是毛泽东
是陳水扁

Google Translate provides this:

Is Ma Ying-jeou
Mao Zedong
Chen Shui-bian

Those are correct, if the missing Iss are discounted.

But the Pinyin appears as “Shì mǎyīngjiǔ Shì máozédōng Shì chénshuǐbiǎn“. So even though the software understands that these names are units, the capitalization and word parsing are still wrong and they are still not rendered as they should be in Pinyin: “Mǎ Yīngjiǔ,” “Máo Zédōng,” “Chén Shuǐbiǎn.

There is nothing obscure about capitalizing proper nouns. How did this get missed?

2. Apostrophes

The cases of Xi’an and Chang’an above already demonstrate apostrophe omission. Let’s try a few more tests, including some words that are not proper nouns.

Input this:

阿爾巴尼亞
然而
仁愛
蓮藕

The Pinyin is rendered as “Āěrbāníyǎ Ránér Rénài Liánǒu” rather than the correct forms of Ā’ěrbāníyǎ, rán’ér, rén’ài, and lián’ǒu.

As always I want to stress that, whatever you might have heard elsewhere, apostrophes are not optional. But the rules for their use are easy — so easy that I suspect a fairly simple computer script could fix this problem quickly and simply. (Only about 2 percent of Mandarin words, as written in Hanyu Pinyin, have apostrophes.)

As is the case with the mistakes with proper nouns, these apostrophe errors are all the more puzzling because Google Translate does not appear to share them. Fortunately, these problems should not be particularly difficult to fix, especially if the Pinyin converter can make better use of Google Translate’s database.

Although Google’s failures to implement capitalization of proper nouns and apostrophe use are significant problems, they could likely be corrected quickly and easily. (I strongly suspect this would take considerably less time than it has taken for me to write this post.) The result would be a vastly improved converter. So I am hopeful that Google will work on this soon.

3. Additional work

Once Google gets those basics fixed, it should focus on the simple matter of correcting spacing before and after some quotations (which would surely take just a few minutes to take care of) and any other such spacing errors, and fixing its word parsing related to numbers (which is a bit more complicated, though the basics are easy: everything from 1 to 100 is written solid).

Next would come something requiring a bit more care: the proper handling of Mandarin’s three tense-marking particles: zhe, guo, and le.

And Google should attach the pluralizing suffix -men to the word it modifies rather than leaving it separate (e.g., háizimen, not háizi men).

Then, with all of those taken care of, Google would have a pretty good Pinyin converter that I would be happy to praise. Of course even then it could still use other improvements; but those would most likely deal more with particulars than the fundamentals of how Pinyin is meant to be written.

A separate post, to be written soon, will compare the performance of several Pinyin converters (including Google’s). Stay tuned.