Google Translate’s Pinyin converter revisited

When Google Translate‘s Pinyin converter was first released about a year and a half ago, it sucked. Wow, did it ever suck. Since then, however, Google has instituted some changes. So it seems about time this was reexamined.

Fortunately, Google’s Pinyin converter is now much better than before.

Here’s the sort of FUBAR romanization — it certainly doesn’t deserve to be called Hanyu Pinyin — Google used to produce:

tán zh?ng guó de“y?“hé” wén” de wèn tí? w? jué de zuì h?o néng xi?n li?o jiè y? xià zài zh?ng guó t?ng yòng de y? yán?… rú gu? n? sh? yòng zh?ng guó de gòng tóng y? yán p? t?ng huà? n? li?o ji? zhè ge y? yán de y? f??b? rú“de? de? de“ hé“le” de bù tóng yòng f?? ma?zh? dào zhè ge y? yán de j? b?n y?n jié?bù b?o kuò sh?ng diào? zh? y?u408gè ma?

Now the same passage will look like this:

Tán zh?ngguó de “y?” hé “wén” de wèntí, w? juéde zuì h?o néng xi?n li?o jiè y?xià zài zh?ngguó t?ngyòng de y?yán…. Rúgu? n? sh?yòng zh?ngguó de gòngtóng y?yán p?t?nghuà, n? li?oji? zhège y?yán de y?f? (b?rú “de, de, de “hé “le” de bùtóng yòngf?) ma? Zh?dào zhège y?yán de j?b?n y?njié (bù b?okuò sh?ngdiào) zh?y?u 408 gè ma?

At last! Capitalization at the beginning of a sentence and word parsing! But — you knew there was going to be a but, didn’t you? — Google’s Pinyin converter falls significantly short because it still fails completely in two fundamental areas: capitalization of proper nouns and proper use of the apostrophe.

1. Proper Nouns

Google’s Pinyin converter fails to follow the basic point of capitalizing proper nouns. For example, here are some well-known place names. I have prefixed the names with “?” because Google automatically capitalizes the first word in a line; so to see how it handles capitalization of place names something other than the name must go first.

screenshot showing what happens if the following is entered into Google Translate: '???, ???, ???, ???'. That leads to the following in Google Translate: 'in Xi'an, in Chang [sic], in Chongqing, in Beijing'. But the romanization line reads 'Zai xian, Zai changan, Zai chongqing, Zai beijing'

Google Translate gets these right, other than the odd truncation of Chang’an. But the Pinyin converter (see the gray text at the bottom of the image above) fails to capitalize these, even though it correctly parses them as units and thus must “know” their meanings.

The same thing happens with personal names.

Input this:

????
????
????

Google Translate provides this:

Is Ma Ying-jeou
Mao Zedong
Chen Shui-bian

Those are correct, if the missing Iss are discounted.

But the Pinyin appears as “Shì m?y?ngji? Shì máozéd?ng Shì chénshu?bi?n“. So even though the software understands that these names are units, the capitalization and word parsing are still wrong and they are still not rendered as they should be in Pinyin: “M? Y?ngji?,” “Máo Zéd?ng,” “Chén Shu?bi?n.

There is nothing obscure about capitalizing proper nouns. How did this get missed?

2. Apostrophes

The cases of Xi’an and Chang’an above already demonstrate apostrophe omission. Let’s try a few more tests, including some words that are not proper nouns.

Input this:

?????
??
??
??

The Pinyin is rendered as “??rb?níy? Ránér Rénài Lián?u” rather than the correct forms of ?’?rb?níy?, rán’ér, rén’ài, and lián’?u.

As always I want to stress that, whatever you might have heard elsewhere, apostrophes are not optional. But the rules for their use are easy — so easy that I suspect a fairly simple computer script could fix this problem quickly and simply. (Only about 2 percent of Mandarin words, as written in Hanyu Pinyin, have apostrophes.)

As is the case with the mistakes with proper nouns, these apostrophe errors are all the more puzzling because Google Translate does not appear to share them. Fortunately, these problems should not be particularly difficult to fix, especially if the Pinyin converter can make better use of Google Translate’s database.

Although Google’s failures to implement capitalization of proper nouns and apostrophe use are significant problems, they could likely be corrected quickly and easily. (I strongly suspect this would take considerably less time than it has taken for me to write this post.) The result would be a vastly improved converter. So I am hopeful that Google will work on this soon.

3. Additional work

Once Google gets those basics fixed, it should focus on the simple matter of correcting spacing before and after some quotations (which would surely take just a few minutes to take care of) and any other such spacing errors, and fixing its word parsing related to numbers (which is a bit more complicated, though the basics are easy: everything from 1 to 100 is written solid).

Next would come something requiring a bit more care: the proper handling of Mandarin’s three tense-marking particles: zhe, guo, and le.

And Google should attach the pluralizing suffix -men to the word it modifies rather than leaving it separate (e.g., háizimen, not háizi men).

Then, with all of those taken care of, Google would have a pretty good Pinyin converter that I would be happy to praise. Of course even then it could still use other improvements; but those would most likely deal more with particulars than the fundamentals of how Pinyin is meant to be written.

A separate post, to be written soon, will compare the performance of several Pinyin converters (including Google’s). Stay tuned.

Tian’anmen, not Tiananmen

I’m certainly not expecting the Western media to start writing Ti?n’?nmén with tone marks. But its it’s not like the apostrophe is an obscure glyph to be found only in specialist typefaces that dig deep into Unicode, the sort of thing that might require an English form separate from the Pinyin one.

Microsoft Word certainly isn’t helping matters, as it flags the correct form (Tian’anmen) as a misspelling but does not flag the apostrophe-less form (Tiananmen).

screenshot from Microsoft Word, showing that 'Tian'anmen', unlike 'Tiananmen', is marked as misspelled

Indeed, if you ask the program to help you with the supposedly misspelled “Tian’anmen”, it suggests “Tiananmen”.

screen shot of Microsoft Word's spell checker suggesting 'Tiananmen' as a replacement for 'Tian'anmen'

So my guess would be that the “Tiananmen” form is the result of a combination of (1) the Cupertino effect, (2) laziness, and (3) people thinking that Tian’anmen “looks funny”.

Ugh.

And as long as I’m on this, it’s not Tian An Men, TianAnMen, Tienanmen, Tianan men, etc., either.

But, no, I don’t expect this will do much good; and if I ever work myself into a case of apostrophe rage it will probably be for other names.

further reading: