Google Translate’s Pinyin converter revisited

When Google Translate‘s Pinyin converter was first released about a year and a half ago, it sucked. Wow, did it ever suck. Since then, however, Google has instituted some changes. So it seems about time this was reexamined.

Fortunately, Google’s Pinyin converter is now much better than before.

Here’s the sort of FUBAR romanization — it certainly doesn’t deserve to be called Hanyu Pinyin — Google used to produce:

tán zh?ng guó de“y?“hé” wén” de wèn tí? w? jué de zuì h?o néng xi?n li?o jiè y? xià zài zh?ng guó t?ng yòng de y? yán?… rú gu? n? sh? yòng zh?ng guó de gòng tóng y? yán p? t?ng huà? n? li?o ji? zhè ge y? yán de y? f??b? rú“de? de? de“ hé“le” de bù tóng yòng f?? ma?zh? dào zhè ge y? yán de j? b?n y?n jié?bù b?o kuò sh?ng diào? zh? y?u408gè ma?

Now the same passage will look like this:

Tán zh?ngguó de “y?” hé “wén” de wèntí, w? juéde zuì h?o néng xi?n li?o jiè y?xià zài zh?ngguó t?ngyòng de y?yán…. Rúgu? n? sh?yòng zh?ngguó de gòngtóng y?yán p?t?nghuà, n? li?oji? zhège y?yán de y?f? (b?rú “de, de, de “hé “le” de bùtóng yòngf?) ma? Zh?dào zhège y?yán de j?b?n y?njié (bù b?okuò sh?ngdiào) zh?y?u 408 gè ma?

At last! Capitalization at the beginning of a sentence and word parsing! But — you knew there was going to be a but, didn’t you? — Google’s Pinyin converter falls significantly short because it still fails completely in two fundamental areas: capitalization of proper nouns and proper use of the apostrophe.

1. Proper Nouns

Google’s Pinyin converter fails to follow the basic point of capitalizing proper nouns. For example, here are some well-known place names. I have prefixed the names with “?” because Google automatically capitalizes the first word in a line; so to see how it handles capitalization of place names something other than the name must go first.

screenshot showing what happens if the following is entered into Google Translate: '???, ???, ???, ???'. That leads to the following in Google Translate: 'in Xi'an, in Chang [sic], in Chongqing, in Beijing'. But the romanization line reads 'Zai xian, Zai changan, Zai chongqing, Zai beijing'

Google Translate gets these right, other than the odd truncation of Chang’an. But the Pinyin converter (see the gray text at the bottom of the image above) fails to capitalize these, even though it correctly parses them as units and thus must “know” their meanings.

The same thing happens with personal names.

Input this:

????
????
????

Google Translate provides this:

Is Ma Ying-jeou
Mao Zedong
Chen Shui-bian

Those are correct, if the missing Iss are discounted.

But the Pinyin appears as “Shì m?y?ngji? Shì máozéd?ng Shì chénshu?bi?n“. So even though the software understands that these names are units, the capitalization and word parsing are still wrong and they are still not rendered as they should be in Pinyin: “M? Y?ngji?,” “Máo Zéd?ng,” “Chén Shu?bi?n.

There is nothing obscure about capitalizing proper nouns. How did this get missed?

2. Apostrophes

The cases of Xi’an and Chang’an above already demonstrate apostrophe omission. Let’s try a few more tests, including some words that are not proper nouns.

Input this:

?????
??
??
??

The Pinyin is rendered as “??rb?níy? Ránér Rénài Lián?u” rather than the correct forms of ?’?rb?níy?, rán’ér, rén’ài, and lián’?u.

As always I want to stress that, whatever you might have heard elsewhere, apostrophes are not optional. But the rules for their use are easy — so easy that I suspect a fairly simple computer script could fix this problem quickly and simply. (Only about 2 percent of Mandarin words, as written in Hanyu Pinyin, have apostrophes.)

As is the case with the mistakes with proper nouns, these apostrophe errors are all the more puzzling because Google Translate does not appear to share them. Fortunately, these problems should not be particularly difficult to fix, especially if the Pinyin converter can make better use of Google Translate’s database.

Although Google’s failures to implement capitalization of proper nouns and apostrophe use are significant problems, they could likely be corrected quickly and easily. (I strongly suspect this would take considerably less time than it has taken for me to write this post.) The result would be a vastly improved converter. So I am hopeful that Google will work on this soon.

3. Additional work

Once Google gets those basics fixed, it should focus on the simple matter of correcting spacing before and after some quotations (which would surely take just a few minutes to take care of) and any other such spacing errors, and fixing its word parsing related to numbers (which is a bit more complicated, though the basics are easy: everything from 1 to 100 is written solid).

Next would come something requiring a bit more care: the proper handling of Mandarin’s three tense-marking particles: zhe, guo, and le.

And Google should attach the pluralizing suffix -men to the word it modifies rather than leaving it separate (e.g., háizimen, not háizi men).

Then, with all of those taken care of, Google would have a pretty good Pinyin converter that I would be happy to praise. Of course even then it could still use other improvements; but those would most likely deal more with particulars than the fundamentals of how Pinyin is meant to be written.

A separate post, to be written soon, will compare the performance of several Pinyin converters (including Google’s). Stay tuned.

15 thoughts on “Google Translate’s Pinyin converter revisited

  1. I’m a beginner student, so I don’t have the knowledge to stress-test this aspect. But I was wondering, how well does it handle characters which have several different pronunciations? Is it able to choose the right one most of the time?

    Most issues you mention are minor, as they’re only about formatting (except for the missing apostrophes, which can lead to ambiguities even when tone marks are present). If the wrong pronunciation slips in for a character, that’s a much bigger issue.

  2. I don’t see the above as minor issues, though they should require only a minor amount of work to correct. For example, if what is written xian isn’t intended to be that but instead Xi’an, it’s not ambiguous; it’s simply flat-out wrong.

    Imagine the public reaction if Google’s translations from various languages into English were to omit not only all apostrophes but also the capital letters from personal nouns, especially if such errors were allowed to continue for well over a year. Keep in mind, too, that such a practice would be less of a problem for readers of English (who have lots and lots of practice) than readers of Pinyin (who don’t live in a world with a plethora of Pinyin-only Web sites, signage, books, magazines, etc.).

    But to turn to your question: Chinese characters with multiple pronunciations (pòy?nzì/???) can be a real @#$% to deal with. (This being a Pinyin site, I’d like to note that Hanyu Pinyin does not suffer from anything even approaching the same level of mafan, there being not much more than a few tone sandhi rules to keep in mind.) As long as Google parses texts as words (and Google Translate, including its Pinyin converter, appears to do a fairly good job of this), there’s not too much problem with this. But failure to do so can certainly result in major errors, such as still plague Google Maps for Taiwan, which still spells Chongqing as ZhongQing and Chengdu as ChengDou. Ugh.

  3. I didn’t receive my education in mainland China but my impression from friends who did is that they weren’t taught Pinyin beyond simply how to spell and using tonal marks. To my surprise, some of my friends who were educated in mainland China even write each syllable separately. If Hanyu Pinyin truly isn’t properly taught in mainland China, that would be pretty sad.

    Unfortunately similar situation also plagues most of my American friends who learned Chinese in the US either through classes offered by their university or local Chinese school. In fact, in most university classes, they spend no more than 1-2 classes on Pinyin and move straight to characters.

    I am all for getting things right, especially when it requires little effort. However, the comparison of misusage between Pinyin and English isn’t exactly fair. English is the main writing system for, well, English. Pinyin to most Chinese speakers, native or not, is secondary at best. From examples above, most Chinese speakers don’t even know how to properly use Pinyin. Therefore, Google is actually doing pretty well comparing to most websites that deal with Pinyin. Now, if Google messes up on characters, that would be a different story.

  4. Secondary is a big overstatement … I have heard the opinion that “Chinese cannot be spelt with ‘English’ letters” several times. As a speaker of a language with an almost completely phonetic writing system, I wonder where this shocking misconception comes from. (If I had only heard it once, I wouldn’t be surprised, but I have heard it several times.) Now, this is obvious nonsense of course for any spoken language. But is it possible that what they refer to is Classical Chinese, which I heard can be near-unintelligible when spoken out aloud (though probably this was an overstatement too)? What do you think?

  5. Of course you can write Chinese (Mandarin) with Roman letters, that’s what Hanyu Pinyin is. However, replacing Hanzi with Pinyin would be more trouble than it’s worth and would cause the loss of a great part of Chinese culture. I’m sure if you are truly interested in the pros and cons of replacing Hanzi with Pinyin, there are already tons of articles, discussion threads and various other opinion pieces available online so I won’t rehash them here.

    Whether you like it or not, in the modern Chinese society, be it Mainland China, Taiwan, Hong Kong/Macau, Singapore or various oversea Chinese communities, Hanzi is the main written script and Pinyin (or Zhuyin) is seen as not much more than a tool to teach children (and foreigners) how to pronounce characters and to transliterate Chinese names into a Roman script (just Pinyin, not Zhuyin).

    But again, you can either remain on your high horse and keep blindly point fingers, or actually learn the language and absorb the culture then perhaps you would understand why Hanzi is so important and irreplaceable. But even if you still don’t change your mind, at least you have more of a ground to stand on.

  6. Weili, I wonder, why do you need to get snide and personally attack Szabolcs (“remain on your high horse and keep blindly point fingers”)? There’s really no call for it.

    If you yourself would do a bit more reading, especially of the material on this Pinyin.info site, you’d know that Szabolcs is certainly not alone in his impression that Hanyu Pinyin not only could serve as a perfectly good writing system for MSM, but that if it eventually did replace Hanzi characters as the *primary* writing system, it would be a Good Thing.

    And yes, I have studied Chinese myself and developed a fair amount of fluency. From the start, I had the impression that Pinyin was a great writing system, and that Chinese characters were difficult and unnecessary, and that impression only deepened with time. Almost all my Chinese friends dismissed the idea as silly, the same way you do. I heard the same arguments over and over, that it “would be more trouble than it’s worth and would cause the loss of a great part of Chinese culture”. But in fact these arguments just don’t hold up to critical examination. This must be what every Chinese person is taught — either formally in school or through some other subtle channels, I don’t know. But I’d invite you to dig into more of the materials on this site, and others that are linked to from here, and maybe open up your mind a little bit.

  7. Wow, I guess Szabolcs wasn’t riding alone on his high horse.

    You essentially just labeled all Chinese as brainwashed drones with closed minds because majority of us disagree with your claims.

    Good job.

  8. Haha, are you trying to be funny? When did I use the words “brainwashed drones”? I’m actually trying to be respectful, but it seems that you are not.
    I have noticed a certain, shall we say, homogeneity, in Chinese’ people’s opinions towards a few topics that I find interesting. Am I wrong about that? You yourself admit the “majority of us disagree” with my claims. And at the same time, taking this particular topic as an example, I know for a fact that they’ve researched it less than I have. So where do these opinions come from? Please, I really want to know.

  9. Hanyu pinyin could function perfectly well as a written language for Mandarin speaking people. That’s indisputable. The ‘loss of culture’ argument is basically …. nonsense. Vietnamese culture seems to be going strong.

    On the other hand, at present there’s no desire among Mandarin speakers to use Hanyu pinyin as even a secondary script, much less a primary one. That’s their choice to make.

  10. Pingback: Pinyin news » Google Translate and romaji revisited

  11. Just wondering, Pinyin.info or whoever runs this site, why are you biased in favor of pinyin, and not of other phonetic alphabets that already exist for Chinese, like the Arabic Xiaoerjin alphabet that has been used to write Chinese (northwest mandarin dialects) for hundreds of years by the Chinese speaking Hui muslims in northwest china?

    and what about phags pa, a phonetic script designed by a Tibetan monk for writing Chinese?

    It sounds like you are more of a western cultural imperialist than a neutral linguist intent on reforming the chinese writing system, harping on about pinyin and cyrillic.

    You go on and on in your praise for the cyrillization of dungan and don’t mention anything about the fact that they used arabic xiaoerjing to write chinese before cyrillic.

    Oh yeah, and you also quoted from Soviet sources (which would be totally unreliable due to the Sino Soviet split), in drawing information about whether the dungan regard themselves as chinese.

    Thats like drawing information about jews from nazi sources.

    http://en.wikipedia.org/wiki/Xiao'erjing

    http://www.sinoglot.com/xiaoerjing/

    http://en.wikipedia.org/wiki/Phags-pa_script

    http://en.wikipedia.org/wiki/File:Yang_Wengshe_1314.jpg

  12. Mansuzi, a little politeness goes a long way. Just sayin’…

    Thanks for the reference on xiao’erjin, it’s interesting.

    I won’t presume to speak for the host here but as for why he concentrates on pinyin, I can think of a few reasons why I think that focus is a good idea.

    1. it’s the name of the site (duh)

    2. pinyin has official recognition in the PRC (and mainlanders do learn it to some extent, though not as thoroughly as might be advisable) none of the other systems has any real user base that I know of.

    3. pinyin has become a defacto second script for Mandarin in certain areas when characters aren’t sufficient or advised. None of the other systems have that going for them.

    Finally, no one is stopping you from starting a site on alternative phonetic scripts for Mandarin. I think such a site would be really interesting.

  13. I’d add one more reason to favor Hanyu Pinyin over a writing system that uses an Arabic script. You can call it cultural imperialism if you want, but it’s a simple fact that the Roman alphabet is the de facto international script for business and politics. Just as the Chinese government’s promotion of MSM over the regional languages is tragic, in some ways, to those that want to preserve the cultural heritage, nonetheless it makes practical sense, in that it allows everybody in China to talk to each other — which has obvious huge benefits. Just so, if Pinyin became a well-established writing system for Chinese, it would help to make the language accessible to the vast majority of foreigners throughout the world who would try to learn it. I don’t think the same could be said of an Arabic-based script. Maybe that’s a tragic loss, but it’s a practical fact.

  14. Come to think of it, the omission of apostrophes and capitalization for proper nouns points to Google using a reversed Pinyin input method that has been tweaked to handle punctuation and capitalization at the beginning of each sentence.

  15. Uh, I still haven’t figured out what the big deal about capitalisation and apostrophes is.

    AFAIK, Pinyin is intended to support students learning Chinese before they can read characters, not primarily to write prose and passages. Therefore, a lack of capitalisation/punctuation shouldn’t matter because no-one uses it.

    I also take exception to the comments about “the Chinese writing system being redundant when compared with Pinyin”. It’s very simple once you’ve learnt a critical mass of characters, although as with European spelling systems, it could do with an overhaul every once in a while.

    (Note: I read Chinese at a high-school level using Cantonese readings and have never, ever, ever seen more than a sentence transcribed using ????… not that that’s ever been a barrier.)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>