Google Translate and romaji revisited

OK, Google has improved its Pinyin converter some, though it still fails in important areas. So that’s the present situation for Google and Mandarin.

How about for Google and Japanese?

Professor J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures generously agreed to reexamine Google’s performance in conversions to r?maji (Japanese written in romanization).

Below is his latest evaluation.

For his initial analysis (in December 2009), see Google Translate and r?maji.

I ran the test passage through Google Translate again. There’s some improvement, but it’s still pretty mediocre.

Original Google Translate
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? 6-Nichi gogo 4-ji 35-fun-goro, T?ky?-to Chiyoda-ku K?kyogaien no tod? (uchibori-d?ri) no Nij?bashi zen k?saten de, Ch?goku kara no kank? kyaku no 40-dai no dansei ga j?y?sha ni hane rare, zenshin o tsuyoku Utte mamonaku shib? shita. Kuruma wa hod? ni noriagete aruite ita dansei (69) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no j?tai. Marunouchi-sho wa, unten shite ita T?ky?-to Minato-ku hakkin 3-ch?me, kaisha yakuin Takahashi nobe Tsubuse y?gi-sha (24) o jid?sha unten kashitsu sh?gai no utagai de genk?-han taiho shi, y?gi o d? chishi ni kirikaete shirabete iru.
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? D?sho ni yoru to, shib? shita dansei wa ?dan hod? o aruite watatte ita tokoro o chokushin shite kita kuruma ni hane rareta. Kuruma wa hidari ni ky? handoru o kiri, shad? to hod? no sakai ni oka reta kasetsu no saku o haneage, hod? ni noriageta toyuu. Saku wa hod? de ran’ningu o shite ita dansei (34) niatari, dansei wa ry?ashi ni karui kega.
???????????????????????????????????????????? D?sho wa, shib? shita dansei no mimoto kakunin o susumeru totomoni, t?ji no k?saten no shing? no j?ky? o shirabete iru.
????????????????????????????????????????? Genba sh?hen wa T?ky? kank? no supotto no hitotsudaga, saikin wa jogingu o tanoshimu hito mo fuete iru.

Notes:

  • The use of numerals dodges a plethora of errors, but “6-Nichi” is still wrong for Muika.
  • Lots of correct capitalizations have been added, but “uchibori” was missed and “Utte” capitalized by mistake.
  • Some false spaces or lack of spaces persist: “hane rare”, “oka reta”; “hitotsudaga” and “niatari” were correctly hitotsu da ga and ni atari in the original test.
  • Names still get butchered (“hakkin” for Shirogane, “nobe Tsubuse” for Nobuhiro.
  • The needless apostrophe in “ran’ningu” is still there.
  • Interestingly, “toyuu” is a new error: it should be to iu.
  • There’s evidence of some attempt to use hyphens, but why not in “kank? kyaku” or “Nij?bashi zen”?

So, to update: Google gets kudos for conscientiousness, but I stick by my original comments.

For more by Prof. Unger, see Pinyin.info’s recommended readings, which includes selections from The Fifth Generation Fallacy: Why Japan Is Betting Its Future on Artificial Intelligence, Literacy and Script Reform in Occupation Japan: Reading Between the Lines, and Ideogram: Chinese Characters and the Myth of Disembodied Meaning.

Google Translate’s Pinyin converter revisited

When Google Translate‘s Pinyin converter was first released about a year and a half ago, it sucked. Wow, did it ever suck. Since then, however, Google has instituted some changes. So it seems about time this was reexamined.

Fortunately, Google’s Pinyin converter is now much better than before.

Here’s the sort of FUBAR romanization — it certainly doesn’t deserve to be called Hanyu Pinyin — Google used to produce:

tán zh?ng guó de“y?“hé” wén” de wèn tí? w? jué de zuì h?o néng xi?n li?o jiè y? xià zài zh?ng guó t?ng yòng de y? yán?… rú gu? n? sh? yòng zh?ng guó de gòng tóng y? yán p? t?ng huà? n? li?o ji? zhè ge y? yán de y? f??b? rú“de? de? de“ hé“le” de bù tóng yòng f?? ma?zh? dào zhè ge y? yán de j? b?n y?n jié?bù b?o kuò sh?ng diào? zh? y?u408gè ma?

Now the same passage will look like this:

Tán zh?ngguó de “y?” hé “wén” de wèntí, w? juéde zuì h?o néng xi?n li?o jiè y?xià zài zh?ngguó t?ngyòng de y?yán…. Rúgu? n? sh?yòng zh?ngguó de gòngtóng y?yán p?t?nghuà, n? li?oji? zhège y?yán de y?f? (b?rú “de, de, de “hé “le” de bùtóng yòngf?) ma? Zh?dào zhège y?yán de j?b?n y?njié (bù b?okuò sh?ngdiào) zh?y?u 408 gè ma?

At last! Capitalization at the beginning of a sentence and word parsing! But — you knew there was going to be a but, didn’t you? — Google’s Pinyin converter falls significantly short because it still fails completely in two fundamental areas: capitalization of proper nouns and proper use of the apostrophe.

1. Proper Nouns

Google’s Pinyin converter fails to follow the basic point of capitalizing proper nouns. For example, here are some well-known place names. I have prefixed the names with “?” because Google automatically capitalizes the first word in a line; so to see how it handles capitalization of place names something other than the name must go first.

screenshot showing what happens if the following is entered into Google Translate: '???, ???, ???, ???'. That leads to the following in Google Translate: 'in Xi'an, in Chang [sic], in Chongqing, in Beijing'. But the romanization line reads 'Zai xian, Zai changan, Zai chongqing, Zai beijing'

Google Translate gets these right, other than the odd truncation of Chang’an. But the Pinyin converter (see the gray text at the bottom of the image above) fails to capitalize these, even though it correctly parses them as units and thus must “know” their meanings.

The same thing happens with personal names.

Input this:

????
????
????

Google Translate provides this:

Is Ma Ying-jeou
Mao Zedong
Chen Shui-bian

Those are correct, if the missing Iss are discounted.

But the Pinyin appears as “Shì m?y?ngji? Shì máozéd?ng Shì chénshu?bi?n“. So even though the software understands that these names are units, the capitalization and word parsing are still wrong and they are still not rendered as they should be in Pinyin: “M? Y?ngji?,” “Máo Zéd?ng,” “Chén Shu?bi?n.

There is nothing obscure about capitalizing proper nouns. How did this get missed?

2. Apostrophes

The cases of Xi’an and Chang’an above already demonstrate apostrophe omission. Let’s try a few more tests, including some words that are not proper nouns.

Input this:

?????
??
??
??

The Pinyin is rendered as “??rb?níy? Ránér Rénài Lián?u” rather than the correct forms of ?’?rb?níy?, rán’ér, rén’ài, and lián’?u.

As always I want to stress that, whatever you might have heard elsewhere, apostrophes are not optional. But the rules for their use are easy — so easy that I suspect a fairly simple computer script could fix this problem quickly and simply. (Only about 2 percent of Mandarin words, as written in Hanyu Pinyin, have apostrophes.)

As is the case with the mistakes with proper nouns, these apostrophe errors are all the more puzzling because Google Translate does not appear to share them. Fortunately, these problems should not be particularly difficult to fix, especially if the Pinyin converter can make better use of Google Translate’s database.

Although Google’s failures to implement capitalization of proper nouns and apostrophe use are significant problems, they could likely be corrected quickly and easily. (I strongly suspect this would take considerably less time than it has taken for me to write this post.) The result would be a vastly improved converter. So I am hopeful that Google will work on this soon.

3. Additional work

Once Google gets those basics fixed, it should focus on the simple matter of correcting spacing before and after some quotations (which would surely take just a few minutes to take care of) and any other such spacing errors, and fixing its word parsing related to numbers (which is a bit more complicated, though the basics are easy: everything from 1 to 100 is written solid).

Next would come something requiring a bit more care: the proper handling of Mandarin’s three tense-marking particles: zhe, guo, and le.

And Google should attach the pluralizing suffix -men to the word it modifies rather than leaving it separate (e.g., háizimen, not háizi men).

Then, with all of those taken care of, Google would have a pretty good Pinyin converter that I would be happy to praise. Of course even then it could still use other improvements; but those would most likely deal more with particulars than the fundamentals of how Pinyin is meant to be written.

A separate post, to be written soon, will compare the performance of several Pinyin converters (including Google’s). Stay tuned.

Oxford Chinese Dictionary goes online

cover image of the Oxford Chinese DictionaryOxford University Press has just announced that its massive Oxford Chinese Dictionary is now available through its Oxford Language Dictionaries Online subscription service.

I haven’t seen the online version yet myself; but from the publisher’s description it appears to be largely the same as the published edition, whose paucity of Pinyin is disappointing. The publisher, however, is promising that “Pinyin will be added to all Chinese translations” in November, which should be a major step forward.

Perhaps some of you at universities have institutional access. I would welcome reports.

source: What’s New, Oxford Language Dictionaries Online, May 2011.

old fashioned

photo of donuts and their label: 'Choco Fashioned / ??????', price NT$35 (about US$1.20)

Here’s a shot of some Hanzified, Mandarinized English I recently came across. Qi?okèlì (???) is of course a well-established loan word, from the English “chocolate” (though here the English is given in the more Japanese-English form of choco, as befits a Japanese donut chain store in Taiwan). ?uf?ixi?ng (???) is a rendering of “old fashioned.” Although the “old” is missing from the English above, it can be seen in both of the tags pictured below.

photo of donuts and their labels: 'White Cocoa Old Fashioned / ??????' and 'Old Fashioned / ?????'
Bái k?k? ?uf?ixi?ng (??????) and yuán wèi ?uf?ixi?ng (?????).

And if that’s not enough to fill you up with Hanzified English, perhaps try a piece of B?shìdùn pài (????), i.e., “Boston [cream] pie.”

Spreading the good news

Behold, I bring you good tidings.

As I keep having to note, most of the things that are supposedly in Pinyin are terrible. This is not because Pinyin itself is inherently poor or difficult. It’s because most people who produce such things have a fundamental lack of understanding of Pinyin as a system. (And, yes, that includes most users in China.) So it is with amazement that I report today on a journal that not only offers dozens of pages in Hanyu Pinyin — good Hanyu Pinyin — but does so twice every month. It’s also well worth noting that the journal is aimed primarily at adult native speakers of Mandarin, not foreigners trying to pick up the language, though certainly it could also be read by people in the latter group.

From what I’ve seen so far, this journal gets right the things most commonly written incorrectly elsewhere, including:

And it doesn’t use the atrocious ? that some people mistakenly believe is required either.

Unfortunately, punctuation and alphanumerics are not included in the Pinyin. But other than that there’s very little that doesn’t follow standard Pinyin orthography, the main exception being the indication of the tone sandhi related to the special cases of y? and , (e.g., the journal gives “bú shì” and “búdà” instead of the standard “bù shì” and “bùdà,” and “yìhuíshì” and “yí wèi” instead of the standard “y?huíshì” and “y? wèi“). That said, though, tone changes related to yi and bu can be something of a pain. So although this isn’t standard, I can see why it was done and am not entirely unsympathetic to this approach.

Here are a few sample lines (click to enlarge):
screenshot of some text in the journal, showing text in simplified Chinese characters with word-parsed Hanyu Pinyin above the Hanzi. Note: Yifusuoshu/???? = Ephesians

It would be nice if this were in Unicode, to help aid searches and cutting and pasting. The text, however, appears to have been made in a system devised years ago by the people at the journal. Regardless, I’m happy to see the Pinyin.

Overall, despite the lamentable absence of punctuation and Arabic numerals in the Pinyin, this is quality work, which is perhaps all the more remarkable in that the Pinyin and simplified Hanzi edition of this journal is not truly free to circulate in the land of its target audience. That’s because its publishers are Jehovah’s Witnesses, a group suppressed by the PRC (though it appears that at least at the moment their sites are not blocked by the great firewall). The journal, Sh?uwàngtái, may be more familiar to you by its English name: Watchtower. Whatever you might think of Jehovah’s Witnesses, I hope you’ll recognize the considerable accomplishment of those who put together this publication.

Getting to the Jehovah’s Witnesses Web pages that link to Sh?uwàngtái can be tricky. (Go to the magazines page, select “Chinese (Simplified)” for the language; then choose the month and file with Pinyin.) So I’m providing direct links to some documents below:

I haven’t found any Pinyin editions other than those. Perhaps old ones are taken offline.

Rénrén D?u X?yào Zh?dao De H?o Xi?oxi (I'd prefer 'de' instead of 'De' -- but that's no big deal) ???????????

With thanks to Victor Mair.

US grad enrollments in Mandarin fall

Although the number of people studying Mandarin in the United States has continued to rise (more about that in a later post), enrollments there in graduate courses in Mandarin have declined.

No. of U.S. Graduate School Enrollments in Mandarin from 1998 to 2009

(year: enrollments): 1998: 1220, 2002: 934, 2006: 1127, 2009: 1009

Grad School Enrollments in Mandarin as a Percentage of Total U.S. Post-Secondary Enrollments in Mandarin

1998: 5.15%, 2002: 3.35%, 2006: 2.63%, 2009: 1.96%

Here’s something I wrote the last time I addressed this topic.

The much-ballyhooed but also much-deserved increase in students studying Mandarin has all been at the undergraduate level. Given that the grad enrollment as a percentage of total enrollment for Mandarin is about the same as that for French (2.63 percent and 2.73 percent, respectively) it might appear that Mandarin has simply reached a “normal” ratio in this regard. But native speakers of English generally need much more time to master Mandarin than to master French. Simply put, four years, say, of post-secondary study of French provides students with a much greater level of fluency than four years of post-secondary study of Mandarin.

Also, there is a great deal more work that needs to be done in terms of translations from Mandarin. I do not at all mean to belittle the work being done in French — or in any other language…. I just mean that Mandarin has historically been underrepresented in U.S. universities given the number of speakers it has and its body of texts that have not yet been translated into English. U.S. universities need to be producing many more qualified grad students who can handle this specialized work. And right now, unfortunately, that’s not happening.

That still holds, except that grad enrollment as a percentage of total enrollment for Mandarin is even lower than before (1.96% vs. 2.37% for French, 1.99% for Spanish, and an impressive 4.68% for Korean).

sources:

Xin Tang no. 1: articles in Gwoyeu Romatzyh

click to view the PDFI’ve just put up another issue of Xin Tang.

As you may have noticed already, the name on the cover is given not as Xin Tang but as Shin Tarng. That’s because the journal started out being published in the Gwoyeu Romatzyh romanization system. But using the Hanyu Pinyin spelling here helps me keep track of these better.

Almost all of this issue is in Mandarin written in Gwoyeu Romatzyh. One article also has an en face translation into English. And as is the case with the other issues of Xin Tang, a variety of topics are covered.

Shin Tarng no. 1 (September/Ji?yuè 1982)

Hanyu Pinyin Cihui

image of the cover of this book, which gives 'HANYU PINYIN CIHUI', followed on the next line in larger characters by '??????', followed on the next line, in smaller letters, by '???' -- the text is white against a blue backgroundToday, for all you orthography junkies (Hello? Hello? Anybody there?), I have added a selection from the 1963 edition of Hanyu Pinyin Cihui (?????? / Hàny? P?ny?n Cíhuì).

The book, which is fully alphabetized by Hanyu Pinyin (i.e., like the ABC dictionary series, not like the Hanzi-by-Hanzi Pinyin ordering seen in most dictionaries published in the PRC), is a long list of Mandarin words as written in Hanyu Pinyin and Chinese characters. It’s meant as a reference for word division and other such orthographic concerns. It’s the sort of thing that just cried out to have been made into a full dictionary (especially since that’s what it looks like, minus definitions); but, unfortunately, it never was. But it was an important influence on the ABC series.

One can see some interesting instances of differences between Pinyin orthography then and now. For example, in this old edition of Hanyu Pinyin Cihui de tends to be appended to words and written as d, e.g. ái’áid, rather than the current ái’ái de (???). Similarly, zi is written z at the end of a word, e.g. ?igèz, rather than the current ?igèzi (???).

Also interesting is the mixed use of simplified and traditional Chinese characters. (It will be easier to see what I’m referring to if you open the PDF file of the introduction and A’s of Hanyu Pinyin Cihui.) The title on the cover is given as ?????? in Chinese characters — perfectly standard. But below this is ??? (z?ngdìng g?o / revised edition); note how dìng is written as ? rather than as ?.

More striking, though, for the modern reader is the script in the foreword. Here, what was written ?????? on the cover is written ??????, mixing traditional and simplified forms. The full traditional version of this would be written ??????. The text of the introduction is similarly mixed. This is because this was published before many simplified forms that are now standard were fully accepted officially.

The selection from this book here on Pinyin.info comprises the introduction and all of the entries beginning with the letter a.

image of a few entries