Google Web fonts and Pinyin — December 2011 update

When I put up my first post on Google Web fonts (Google Web fonts and Hanyu Pinyin), that site offered 252 font families, 29 of which cover at least parts of Latin Extended. Now, some three months later, the total has grown to 342 font families, with 70 of those covering at least parts of Latin Extended.

Only two of the new families, however, support Hanyu Pinyin with tone marks: Ubuntu Condensed and Ubuntu Mono. That brings the total to eight Google Web fonts that support Hanyu Pinyin: four serifs and four sans serifs.

Serif

  • EB Garamond
  • Gentium Basic
  • Gentium Book Basic
  • Neuton

Sans Serif

  • Andika
  • Ubuntu
  • Ubuntu Condensed
  • Ubuntu Mono

Here’s what the two new families, Ubuntu Condensed and Ubuntu Mono, look like next to the earlier Ubuntu.

example of Ubuntu, Ubuntu Condensed, and Ubuntu Mono in action on Hanyu Pinyin

For reference, here’s the total list of Latin Extended, with Pinyin-compliant fonts in bold.

Serif Faces

  1. Bitter
  2. Cardo
  3. Caudex
  4. EB Garamond
  5. Enriqueta
  6. Gentium Basic
  7. Gentium Book Basic
  8. Neuton
  9. Playfair Display
  10. Radley
  11. Sorts Mill Goudy

Sans Serif Faces

  1. Andika
  2. Anonymous Pro
  3. Anton
  4. Chango
  5. Didact Gothic
  6. Francois One
  7. Fresca
  8. Istok Web
  9. Jockey One
  10. Jura
  11. Marmelad
  12. Open Sans Condensed
  13. Open Sans
  14. Play
  15. Signika Negative
  16. Signika
  17. Tenor Sans
  18. Ubuntu
  19. Ubuntu Condensed
  20. Ubuntu Mono
  21. Varela
  22. Viga

Display Faces (all fail)

  1. Abril Fatface
  2. Arbutus
  3. Bubblegum Sans
  4. Butcherman Caps
  5. Chicle
  6. Eater Caps
  7. Forum
  8. Kelly Slab
  9. Knewave
  10. Lobster
  11. MedievalSharp
  12. Modern Antiqua
  13. Nosifer Caps
  14. Piedra
  15. Passion One
  16. Plaster
  17. Rammetto One
  18. Ribeye Marrow
  19. Ribeye
  20. Righteous
  21. Ruslan Display
  22. Stint Ultra Condensed

Handwriting Faces (all fail)

  1. Aguafina Script
  2. Aladin
  3. Devonshire
  4. Dr Sugiyama
  5. Fondamento
  6. Herr Von Muellerhoff
  7. Marck Script
  8. Miss Fajardos
  9. Miss Saint Delafield
  10. Monsieur La Doulaise
  11. Mr Bedford
  12. Mr Dafoe
  13. Mr De Haviland
  14. Mrs Sheppards
  15. Niconne
  16. Patrick Hand

Google introduces many new errors to Taipei-area maps

What on earth is going on over at Google?

Just last week I had nothing but love for Google Maps because it had finally made some important improvements to its maps of Taiwan. But just a few days later Google went and screwed up its maps again. The names of most of Taipei’s MRT stations are now written incorrectly. In most cases, this is merely a matter of form, with capitalization — and the important designation of MRT — missing. But in more than just a few instances some astonishing typos have been introduced. What’s especially puzzling and irksome about this is that in most of these cases Google Maps swapped good information for bad.

Meow tipped me off in a comment yesterday that “In Google Maps, Jiannan Rd. Station and Gangqian Station become Jianan road station and Ganggian station.”

Here’s a screenshot taken today of some MRT stations in Dazhi and Neihu:

As Meow said, Jiannan is written Jianan, and Gangqian is written Ganggian. What’s more, Dazhi is written Dachi, and Xihu is written His-Hu (Cupertino effect?).

There are now many such errors.

Here’s a screenshot taken last week.
dfd

And here’s the same place today.

As you can see, one of the instances of Jieyunsongjiangnanjing has been removed, which is good. But that’s the end of the good news. Another Jieyunsongjiangnanjing remains. And the one that was removed was replaced by Songjian nanjing station, with Songjiang misspelled and Nanjing and Station erroneously in lower case. And “MRT” is missing too.

It’s not just the station name that was changed, as the switch of one location from the Thai tourism office to the Panamanian embassy shows. (Perhaps both are in the same building.)

Here are some more examples of recently introduced errors.

Luchou should be Luzhou.
screenshot from Google maps showing 'Luchou' for 'Luzhou'

click to see unrotated image

screenshot from Google maps showing 'Sun-yat-sen memorial hall station' for 'Sun Yat-sen Memorial Hall Station'

The westernmost station on the blue line is now labeled Tongning. The pain! The pain! It should be Yongning, which is also visible.
screenshot from Google maps showing 'Tongning' instead of 'Yongning'

In perhaps the oddest example, Qili’an, which has been miswritten Qilian for years, has been redesignated Chlian.
screenshot from Google maps showing 'Chlian' instead of 'Qili'an'

Above we saw Gangqian written incorrectly as Ganggian and Minquan written incorrectly as Minguan. Here’s another example of a q being turned into a g: Banqiao has become Bangiao. Even the train station, which is a different rail system than the MRT, has been affected. But the High Speed Rail Station name remains in Tongyong Pinyin, which I most certainly disapprove of but which at least represents the current state of signage in the HSR system.
screenshot from Google maps showing 'Bangiao' instead of 'Banqiao'

Sloppy work, Google. Very sloppy. How could this have happened?

Google improves its maps of Taiwan

Two years ago when Google switched to Hanyu Pinyin in its maps of Taiwan, it did a poor job … despite the welcome use of tone marks.

Here are some of the problems I noted at the time:

  • The Hanyu Pinyin is given as Bro Ken Syl La Bles. (Terrible! Also, this is a new style for Google Maps. Street names in Tongyong were styled properly: e.g., Minsheng, not Min Sheng.)
  • The names of MRT stations remain incorrectly presented. For example, what is referred to in all MRT stations and on all MRT maps as “NTU Hospital” is instead referred to in broken Pinyin as “Tái Dà Y? Yuàn” (in proper Pinyin this would be Tái-Dà Y?yuàn); and “Xindian City Hall” (or “Office” — bleah) is marked as X?n Diàn Shì G?ng Su? (in proper Pinyin: “X?ndiàn Shìg?ngsu?” or perhaps “X?ndiàn Shì G?ngsu?“). Most but not all MRT stations were already this incorrect way (in Hanyu Pinyin rather than Tongyong) in Google Maps.
  • Errors in romanization point to sloppy conversions. For example, an MRT station in Banqiao is labeled X?n Bù rather than as X?np?. (? is one of those many Chinese characters with multiple Mandarin pronunciations.)
  • Tongyong Pinyin is still used in the names of most cities and townships (e.g., Banciao, not Banqiao).

I’m pleased to report that Google Maps has recently made substantial improvements.

First, and of fundamental importance, word parsing has finally been implemented for the most part. No more Bro Ken Syl La Bles. Hallelujah!

Here’s what this section of a map of Tainan looked like two years ago:

And here’s how it is now:

Oddly, “Jiànx?ng Jr High School” has been changed to “Tainan Municipal Chien-Shing Jr High School Library” — which is wordy, misleading (library?), and in bastardized Wade-Giles (misspelled bastardized Wade-Giles, at that). And “Girl High School” still hasn’t been corrected to “Girls’ High School”. (We’ll also see that problem in the maps for Taipei.)

But for the most part things are much better, including — at last! — a correct apostrophe: Y?u’ài St.

As these examples from Taipei show, the apostrophe isn’t just a one-off. Someone finally got this right.

Rén’ài, not Renai.
screenshot from Google Maps, showing how the correct Rén'ài (rather than the incorrect Renai) is used

Cháng’?n, not Changan.
screenshot from Google Maps, showing how the correct Cháng'?n is used

Well, for the most part right. Here we have the correct Dà’?n (and correct Ruì’?n) but also the incorrect Daan and Ta-An. But at least the street names are correct.
click for larger screenshot from Google Maps, showing how the correct Dà'?n (and correct Ruì'?n) is used but also the incorrect Daan and Ta-An

Second, MRT station names have been fixed … mostly. Most all MRT station names are now in the mixture of romanization and English that Taipei uses, with Google Maps also unfortunately following even the incorrect ones. A lot of this was fixed long ago. The stops along the relatively new Luzhou line, however, are all written wrong, as one long string of Pinyin.

To match the style used for other stations, this should be MRT Songjiang Nanjing, not Jieyunsongjiangnanjing.
screenshot from Google Maps, showing how the Songjiang-Nanjing MRT station is labeled 'Jieyunsongjiangnanjing Station' (with tone marks)

Third, misreadings of poyinzi (pòy?nzì/???) have largely been corrected.

Chéngd?, not Chéng D?u.
screenshot from Google Maps, showing how the correct 'Chéngd? Rd' is used

Like I said: have largely been corrected. Here we have the correct Chéngd? and Chóngqìng (rather than the previous maps’ Chéng D?u and Zhòng Qìng) but also the incorrect Houbu instead of the correct Houpu.
screenshot from Google Maps, showing how the correct Chóngqìng Rd and Chéngd? St are used but also how the incorrect Houbu (instead of Houpu) is shown

But at least the major ones are correct.

Unfortunately, the fourth point I raised two years ago (Tongyong Pinyin instead of Hanyu Pinyin at the district and city levels) has still not been addressed. So Google is still providing Tongyong Pinyin rather than the official Hanyu Pinyin at some levels. Most of the names in this map, for example, are distinctly in Tongyong Pinyin (e.g., Lujhou, Sinjhuang, and Banciao, rather than Luzhou, Xinzhuang, and Banqiao).

Google did go in and change the labels on some places from city to district when Taiwan revised their names; but, oddly enough, the company didn’t fix the romanization at the same time. But with any luck we won’t have to wait so long before Google finally takes care of that too.

Or perhaps we’ll have a new president who will revive Tongyong Pinyin and Google will throw out all its good work.

Google Translate’s Pinyin converter: now with apostrophes

Google has taken another major step toward making Google Translate‘s Pinyin converter decent. Finally, apostrophes.

Not long ago “??????????????” would have yielded “??rb?níy? ránér rénài lián?u p??r chá.” But now Google produces the correct “?’?rb?níy? rán’ér rén’ài lián’?u p?’?r chá.” (Well, one could debate whether that last one should be p?’?r chá, p?’?rchá, P?’?r chá, P?’?r Chá, or P?’?rchá. But the apostrophe is undoubtedly correct regardless.)

Also, the -men suffix is now solid with words (e.g., ??? –> péngyoumen and ??? –> háizimen). This is a small thing but nonetheless welcome.

The most significant remaining fundamental problem is the capitalization and parsing of proper nouns.

And numbers are still wrong, with everything being written separately. For example, “???????????????” should be rendered as “q?qi?n ji?b?i sìshís?n wàn w?qi?n liùb?i w?shíb?.” But Google is still giving this as “q? qi?n ji? b?i sì shí s?n wàn w? qi?n liù b?i w? shí b?.”

On the other hand, Google is starting to deal with “le”, with it being appended to verbs. This is a relatively tricky thing to get right, so I’m not surprised Google doesn’t have the details down yet.

So there’s still a lot of work to be done. But at least progress is being made in areas of fundamental importance. I’m heartened by the progress.

Related posts:

The current state:
screen shot of what Google Translate's Pinyin converter produces as of late September 2011

Google Translate’s Pinyin converter revisited

When Google Translate‘s Pinyin converter was first released about a year and a half ago, it sucked. Wow, did it ever suck. Since then, however, Google has instituted some changes. So it seems about time this was reexamined.

Fortunately, Google’s Pinyin converter is now much better than before.

Here’s the sort of FUBAR romanization — it certainly doesn’t deserve to be called Hanyu Pinyin — Google used to produce:

tán zh?ng guó de“y?“hé” wén” de wèn tí? w? jué de zuì h?o néng xi?n li?o jiè y? xià zài zh?ng guó t?ng yòng de y? yán?… rú gu? n? sh? yòng zh?ng guó de gòng tóng y? yán p? t?ng huà? n? li?o ji? zhè ge y? yán de y? f??b? rú“de? de? de“ hé“le” de bù tóng yòng f?? ma?zh? dào zhè ge y? yán de j? b?n y?n jié?bù b?o kuò sh?ng diào? zh? y?u408gè ma?

Now the same passage will look like this:

Tán zh?ngguó de “y?” hé “wén” de wèntí, w? juéde zuì h?o néng xi?n li?o jiè y?xià zài zh?ngguó t?ngyòng de y?yán…. Rúgu? n? sh?yòng zh?ngguó de gòngtóng y?yán p?t?nghuà, n? li?oji? zhège y?yán de y?f? (b?rú “de, de, de “hé “le” de bùtóng yòngf?) ma? Zh?dào zhège y?yán de j?b?n y?njié (bù b?okuò sh?ngdiào) zh?y?u 408 gè ma?

At last! Capitalization at the beginning of a sentence and word parsing! But — you knew there was going to be a but, didn’t you? — Google’s Pinyin converter falls significantly short because it still fails completely in two fundamental areas: capitalization of proper nouns and proper use of the apostrophe.

1. Proper Nouns

Google’s Pinyin converter fails to follow the basic point of capitalizing proper nouns. For example, here are some well-known place names. I have prefixed the names with “?” because Google automatically capitalizes the first word in a line; so to see how it handles capitalization of place names something other than the name must go first.

screenshot showing what happens if the following is entered into Google Translate: '???, ???, ???, ???'. That leads to the following in Google Translate: 'in Xi'an, in Chang [sic], in Chongqing, in Beijing'. But the romanization line reads 'Zai xian, Zai changan, Zai chongqing, Zai beijing'

Google Translate gets these right, other than the odd truncation of Chang’an. But the Pinyin converter (see the gray text at the bottom of the image above) fails to capitalize these, even though it correctly parses them as units and thus must “know” their meanings.

The same thing happens with personal names.

Input this:

????
????
????

Google Translate provides this:

Is Ma Ying-jeou
Mao Zedong
Chen Shui-bian

Those are correct, if the missing Iss are discounted.

But the Pinyin appears as “Shì m?y?ngji? Shì máozéd?ng Shì chénshu?bi?n“. So even though the software understands that these names are units, the capitalization and word parsing are still wrong and they are still not rendered as they should be in Pinyin: “M? Y?ngji?,” “Máo Zéd?ng,” “Chén Shu?bi?n.

There is nothing obscure about capitalizing proper nouns. How did this get missed?

2. Apostrophes

The cases of Xi’an and Chang’an above already demonstrate apostrophe omission. Let’s try a few more tests, including some words that are not proper nouns.

Input this:

?????
??
??
??

The Pinyin is rendered as “??rb?níy? Ránér Rénài Lián?u” rather than the correct forms of ?’?rb?níy?, rán’ér, rén’ài, and lián’?u.

As always I want to stress that, whatever you might have heard elsewhere, apostrophes are not optional. But the rules for their use are easy — so easy that I suspect a fairly simple computer script could fix this problem quickly and simply. (Only about 2 percent of Mandarin words, as written in Hanyu Pinyin, have apostrophes.)

As is the case with the mistakes with proper nouns, these apostrophe errors are all the more puzzling because Google Translate does not appear to share them. Fortunately, these problems should not be particularly difficult to fix, especially if the Pinyin converter can make better use of Google Translate’s database.

Although Google’s failures to implement capitalization of proper nouns and apostrophe use are significant problems, they could likely be corrected quickly and easily. (I strongly suspect this would take considerably less time than it has taken for me to write this post.) The result would be a vastly improved converter. So I am hopeful that Google will work on this soon.

3. Additional work

Once Google gets those basics fixed, it should focus on the simple matter of correcting spacing before and after some quotations (which would surely take just a few minutes to take care of) and any other such spacing errors, and fixing its word parsing related to numbers (which is a bit more complicated, though the basics are easy: everything from 1 to 100 is written solid).

Next would come something requiring a bit more care: the proper handling of Mandarin’s three tense-marking particles: zhe, guo, and le.

And Google should attach the pluralizing suffix -men to the word it modifies rather than leaving it separate (e.g., háizimen, not háizi men).

Then, with all of those taken care of, Google would have a pretty good Pinyin converter that I would be happy to praise. Of course even then it could still use other improvements; but those would most likely deal more with particulars than the fundamentals of how Pinyin is meant to be written.

A separate post, to be written soon, will compare the performance of several Pinyin converters (including Google’s). Stay tuned.

Google Translate and r?maji

The following is a guest post by Professor J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures.

The challenge

On 18 November 2009, Mark Swofford posted an item on his website pinyin.info criticizing the way Google Translate produces Hanyu Pinyin from standard Chinese text. He concluded by saying, “Google Translate will also romanize Japanese texts written in kanji and kana, Russian texts written in Cyrillic, etc. But I’ll leave those to others to analyze.” So I decided to take up Swofford’s challenge as it pertains to Japanese. Using Google Translate, I romanized a news item from the Asahi of 6 December 2009:

Original Google Translate
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? roku nichi gogo yon ji san go fun goro , t?ky? to chiyoda ku k?kyogaien no tod? ? uchibori d?ri ? no nij?bashi zen k?saten de , ch?goku kara no kank? kyaku no yon zero dai no dansei ga j?y?sha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shib? shi ta . kuruma wa hod? ni noriage te arui te i ta dansei ? roku ky? ? mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no j?tai . marunouchi sho wa , unten shi te i ta t?ky? to minato ku hakkin san ch?me , kaisha yakuin takahashi nobe tsubuse y?gi sha ? ni yon ? wo jid?sha unten kashitsu sh?gai no utagai de genk? han taiho shi , y?gi wo d? chishi ni kirikae te shirabe te iru .
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? d?sho ni yoru to , shib? shi ta dansei wa ?dan hod? wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni ky? handoru wo kiri , shad? to hod? no sakai ni oka re ta kasetsu no saku wo haneage , hod? ni noriage ta to iu . saku wa hod? de ran’ningu wo shi te i ta dansei ? san yon ? ni atari , dansei wa ry?ashi ni karui kega .
???????????????????????????????????????????? d?sho wa , shib? shi ta dansei no mimoto kakunin wo susumeru totomoni , t?ji no k?saten no shing? no j?ky? wo shirabe te iru .
????????????????????????????????????????? genba sh?hen wa t?ky? kank? no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru .

Google’s romanization algorithm does a thoroughly mediocre job compared with what a human transcriber would do. To see this, compare the following:

Google Translate human transcriber
roku nichi gogo yon ji san go fun goro , t?ky? to chiyoda ku k?kyogaien no tod? ? uchibori d?ri ? no nij?bashi zen k?saten de , ch?goku kara no kank? kyaku no yon zero dai no dansei ga j?y?sha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shib? shi ta . kuruma wa hod? ni noriage te arui te i ta dansei ? roku ky? ? mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no j?tai . marunouchi sho wa , unten shi te i ta t?ky? to minato ku hakkin san ch?me , kaisha yakuin takahashi nobe tsubuse y?gi sha ? ni yon ? wo jid?sha unten kashitsu sh?gai no utagai de genk? han taiho shi , y?gi wo d? chishi ni kirikae te shirabe te iru . Muika gogo yo-ji sanj?go-fun goro, T?ky?-to Chiyoda-ku K?kyo Gaien no tod? (Uchibori d?ri) no Nij?bashi-zen k?saten de, Ch?goku kara no kank?-kyaku no yonj?-dai no dansei ga j?y?sha ni hanerare, zenshin o tsuyoku utte mamonaku shib?-shita. Kuruma wa hod? ni noriagete aruite ita dansei (rokuj?ky?) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no j?tai. Marunouchi-sho wa, unten-shite ita T?ky?-to Minato-ku Shirogane san-ch?me, kaisha yakuin Takahashi Nobuhiro y?gisha (nij?yon) o jid?sha unten kashitsu sh?gai no utagai de genk?han taiho-shi, y?gi o d?-chishi ni kirikaete shirabete iru.
d?sho ni yoru to , shib? shi ta dansei wa ?dan hod? wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni ky? handoru wo kiri , shad? to hod? no sakai ni oka re ta kasetsu no saku wo haneage , hod? ni noriage ta to iu . saku wa hod? de ran’ningu wo shi te i ta dansei ? san yon ? ni atari , dansei wa ry?ashi ni karui kega . D?-sho ni yoru to, shib?-shita dansei wa ?dan hod? o aruite watatte ita tokoro o chokushin-shite kita kuruma ni hanerareta. Kuruma wa hidari ni ky?-handoru o kiri, shad? to hod? no sakai ni okareta kasetsu no saku o haneage, hod? ni noriageta to iu. Saku wa hod? de ranningu o shite ita dansei (sanj?yon) ni atari, dansei wa ry?ashi ni karui kega.
d?sho wa , shib? shi ta dansei no mimoto kakunin wo susumeru totomoni , t?ji no k?saten no shing? no j?ky? wo shirabe te iru . D?-sho wa, shib?-shita dansei no mimoto kakunin o susumeru to tomo ni, t?ji no k?saten no shing? no j?ky? o shirabete iru.
genba sh?hen wa t?ky? kank? no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru . Genba sh?hen wa T?ky? kank? no supotto no hitotsu da ga, saikin wa jogingu o tanoshimu hito mo fuete iru.

For the sake of comparison, I have retained Google’s Hepburn-style romanization. The following changes have been made in the text in the righthand column:

  1. Misread words have been rewritten. Many involve numerals; e.g. muika for “roku nichi”, yo-ji for “yon ji”, sanj?go-fun for “san go fun”. The personal name Nobuhiro is an educated guess, but “Nobetsubuse” is certainly wrong. Shirogane for “hakkin” is a place-name (N.B. Google did not produce *hakukin, indicating that the algorithm does more than just character-by-character on-yomi).
  2. False spaces and consequent misreadings have been eliminated. E.g. hanerare for “hane rare”, wattate ita for “wata~tsu te i ta”.
  3. Run-together phrases have been parsed correctly. E.g. to tomo ni for “totomoni”.
  4. Capitalization of proper nouns and the first words in sentences has been introduced.
  5. Hyphens are used conservatively for prefixes and suffixes, and for compound verbs with suru.
  6. Obsolete “wo” for the particle o has been eliminated. (N.B. Google did not produce *ha for the particle wa, so “wo” for o is just the result of laziness.)
  7. Apostrophes after n to indicate mora nasals in positions where they are not needed have been eliminated.
  8. Punctuation has been normalized to match for romanized format and paragraph indentations have been restored.

One could make the romanized text more easily readable by restoring arabic numerals, italicizing gairaigo, and so on. Of course, if the reporter knew that his/her copy would be reported orally or in romanization, s/he might have chosen different wording to avoid homophonic ambiguities. E.g., Marunouchi-sho could be Marunouchi Keisatsu-sho, though perhaps in the context of a traffic accident story, it is obvious that the suffix sho denotes ‘police station’. Furthermore, in a digraphic Japan, homophones might not be such as great problem. If, for instance, readers were accumstomed to seeing d?sho for ?? ‘same place’, then d?-sho would immediately signal that something different was meant, which, given context, might be entirely sufficient to eliminate misunderstanding.

But having said all that, my guess is that the romanization function of Google Translate was programmed with some care. Rather than criticize the quality Google’s algorithm, I suggest pursuing the logical consequences of assuming that it deserves about a B+ by current standards.

Analysis

Clearly, there is a vast amount of knowledge an editor needs if s/he wants to bring Google’s result up to an acceptable level of romanization for human consumption. That minimal level, in turn, is probably a far cry from what a committee of linguists might decide would be an ideal romanization for daily use in 21st-century Japan. It is quite obvious why Google’s algorithm blunders — the reasons were well understood and described long ago (e.g. in Unger 1987) — and though the algorithm can be improved, it can never produce perfect results. Computers cannot read minds, and mindreading is ultimately what it would take to produce a flawless romanization.1

Furthermore, imagine the representation of the words of the text that presumably takes shape in some form or other in the mind of the skilled reader of the original text. Given that Google’s programmers are doing their best to get their computers to identify words and their forms from Japanese textual data, it is clear that readers, who achieve excellent comprehension with little or no conscious effort, must be doing vastly more. The sequence of stages — from (1) the original text to (2) the Google transcription, (3) the better edited version, (4) some future “ideal” romanization scheme, and onward to (5) whatever the brain of the skilled reader ultimately distills and comprehends — concretely illustrates how, at each stage, different kinds of information — from the easily programmable to genuine expert knowledge — must be brought to bear on the raw data.

Of course, something similar can be said of English texts as well: like Chinese characters, orthographic words of English, even though written with letters of the roman alphabet, typically function both logographically and phonographically. The English reader has to do some work too. But how much? Think of the sequence of stages just described in reverse order. The step from the mind of an expert reader (5) to an ideal romanization (4) is short compared with the distance down to the crude level of romanization produced by Google Translate (2). Yet Google does quite a bit relative to the original text (1). It does not totally fail, but rather makes mistakes, which, as just demonstrated, a human editor can identify and correct. It manages to find many word boundaries and no doubt could do better if the company’s programmers consulted some linguists and exerted themselves more. The point is that Japanese readers must cover the whole distance from the text to genuine comprehension, a distance that must be much greater than that traversed by the practiced reader of English, for all its quaint anachronistic spellings. With a decent, standardized roman orthography, the Japanese reader would have a considerably shorter distance to negotiate.

Note

  1. Indeed, starting in the 1980s, Asahi pioneered in the use of an IBM-designed system called NELSON (New Editing and Layout System of Newspapers) that uses large-array keyboards (descriptive input) rather than the sort of kanji henkan methods (transcriptive input) common on personal computers and dedicated word-processing systems. Consequently, the expedient of storing the underlying roman or kana input stream alongside the selected characters is not available for Asahi stories. Of course, such information is routinely thrown away by many other input systems too.

Google Translate’s new Pinyin function sucks

Google Translate has a new function: conversion to Hanyu Pinyin, which would be exciting and wonderful if it were any good. But unfortunately it’s terrible, all things considered.

What Google has created is about at the same level as scripts hobbyists cobbled together the hard way about a decade ago from early versions of CE-DICT. Don’t get me wrong: I greatly admire what sites such as Ocrat achieved way back when. But for Google — with all of its data, talent, and money — to do essentially no better so many years later is nothing short of a disgrace.

To see Google Translate’s Pinyin function in action you must select “Chinese (Simplified)” or “Chinese (Traditional)” — not English — for the “Translate into” option. And then click on “Show romanization”.

For example, here’s what happens with the following text from an essay on simplified and traditional Chinese characters by Zhang Liqing:

????“?”?“?”?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????“?? ?? ?“ ?“?” ???????? ????????????????????408???

screenshot of Google Translate with the text above

Google Translate will produce this:
screenshot of Google Translate with the text above and how Google Translate puts this into Pinyin (see text below)

tán zh?ng guó de“y?“hé” wén” de wèn tí? w? jué de zuì h?o néng xi?n li?o jiè y? xià zài zh?ng guó t?ng yòng de y? yán?zh?ng guó de zh? yào y? yán y?u n? xi??wéi shèn me w? shu? zhè ge? ér bù shu? nà gè?y?n wèi huán jìng?y?n wèi bèi qi?ng pò?y?n wèi w? ài zhè ge y? yán?y?n wèi y?u bì yào?y?n wèi zhè ge y? yán h?n zhòng yào?y? xi?ng xi?ng shén me shì zh?ng guó rén de gòng tóng y? yán?yòng y? gè gòng tóng y? yán y?u bì yào ma?wèi shé me?bié de hàn y? de qù xiàng huì z?n me yàng?rú gu? n? sh? yòng zh?ng guó de gòng tóng y? yán p? t?ng huà? n? li?o ji? zhè ge y? yán de y? f??b? rú“de? de? de“ hé“le” de bù tóng yòng f?? ma?zh? dào zhè ge y? yán de j? b?n y?n jié?bù b?o kuò sh?ng diào? zh? y?u408gè ma?

Here’s what’s wrong:

  • This is all bro ken syl la bles instead of word parsing. (So it’s never even a question if they get the use of the apostrophe correct.)
  • Proper nouns are not capitalized (e.g., zh?ng guó vs. Zh?ngguó).
  • The first letter in each sentence is not capitalized.
  • Punctuation is not converted but remains in double-width Chinese style, which is wrong for Pinyin.
  • Spacing around most punctuation is also incorrect (e.g., although a space is added after a comma and a closing parenthesis, there’s no space after a period or a question mark. See also the spacing or lack thereof around quotation marks, numerals, etc.)
  • Because of lack of word parsing, some given pronunciations are wrong.

In my previous post I complained about Google Maps’ unfortunately botched switch to Hanyu Pinyin. I stated there that, unlike Google Maps, Google Translate would correctly produce “Chengdu” from “??” (which it does when “translate into” is set for English). But I see that the romanization bug feature of Google Translate also fails this simple test. It generates the incorrect “chéng d?u“.

All of this indicates that Google apparently is using a poor database and not only has no idea of how Pinyin is meant to be written but also lacks an understanding of even the basic rules of Pinyin.

If you should need to use a free Web-based Pinyin converter, avoid Google Translate. Instead use Adso (from the fine folk at Popup Chinese) or perhaps NCIKU or MDBG — all of which, despite their limitations (c’mon, guys, sentences begin with capital letters), are significantly better than what Google offers.

By the way, Google Translate will also romanize Japanese texts written in kanji and kana, Russian texts written in Cyrillic, etc. But I’ll leave those to others to analyze.

For lagniappe, here’s a real Hanyu Pinyin version of the text above:

Tán Zh?ngguó de “y?” hé “wén” de wèntí, w? juéde zuìh?o néng xi?n li?oji? y?xià zài Zh?ngguó t?ngyòng de y?yán. Zh?ngguó de zh?yào y?yán y?u n?xi?? Wèishénme w? shu? zhège, ér bù shu? nàge? Y?nwei huánjìng? Y?nwei bèi qi?ngpò? Y?nwei w? ài zhège y?yán? Y?nwei y?u bìyào? Y?nwei zhè ge y?yán h?n zhòngyào? Y? xi?ngxiang shénme shì Zh?ngguórén de gòngtóng y?yán? Yòng y?ge gòngtóng y?yán y?u bìyào ma? Weishenme? Biéde Hàny? de qùxiàng huì z?nmeyàng? Rúgu? n? sh?yòng Zh?ngguó de gòng tóng y?yán P?tónghuà, n? li?oji? zhège y?yán de y?f? (b?rú “de” hé “le” de bùtóng y?ngf?) ma? Zh?dao zhège y?yán de j?b?n y?njié (bù bàokuò sh?ngdiào) zh? y?u 408 ge ma?

Google Maps switches to Hanyu Pinyin for Taiwan (sloppily)

Until very recently, Google Maps gave street names in Taiwan in Tongyong Pinyin — most of the time, at least. This was the case even for Taipei, which most definitely has long used Hanyu Pinyin, not Tongyong Pinyin. The romanization on Google Maps was really a hodgepodge in the maps of Taiwan. And it’s still kind of a mess; but now it’s at least more consistent — and more consistent in Hanyu Pinyin.

First the good. In Google Maps:

  • Hanyu Pinyin, not Tongyong Pinyin, is now used for street names throughout Taiwan
  • Tone marks are indicated. (Previous maps with Tongyong did not indicate tones.)

Now the bad, and unfortunately there’s a lot of it and it’s very bad indeed:

  • The Hanyu Pinyin is given as Bro Ken Syl La Bles. (Terrible! Also, this is a new style for Google Maps. Street names in Tongyong were styled properly: e.g., Minsheng, not Min Sheng.)
  • The names of MRT stations remain incorrectly presented. For example, what is referred to in all MRT stations and on all MRT maps as “NTU Hospital” is instead referred to in broken Pinyin as “Tái Dà Y? Yuàn” (in proper Pinyin this would be Tái-Dà Y?yuàn); and “Xindian City Hall” (or “Office” — bleah) is marked as X?n Diàn Shì G?ng Su? (in proper Pinyin: “X?ndiàn Shìg?ngsu?” or perhaps “X?ndiàn Shì G?ngsu?“). Most but not all MRT stations were already this incorrect way (in Hanyu Pinyin rather than Tongyong) in Google Maps.
  • Errors in romanization point to sloppy conversions. For example, an MRT station in Banqiao is labeled X?n Bù rather than as X?np?. (? is one of those many Chinese characters with multiple Mandarin pronunciations.)
  • Tongyong Pinyin is still used in the names of most cities and townships (e.g., Banciao, not Banqiao).

Screenshot from earlier this evening, showing that Tongyong Pinyin is still being used in Google Maps for some city and district names (e.g., Gueishan, Sinjhuang, Banciao, Jhonghe, Sindian, and Jhongjheng rather than Hanyu Pinyin’s Guishan, Xinzhuang, Banqiao, Zhonghe, Xindian, and Zhongzheng, respectively).
map of Taipei area, with names as shown above

I don’t have any old screenshots of my own available at the moment, so for now I’ll refer you to an image that Fili used in an old post of his. Compare that with this screenshot I took a few minutes ago from Google Maps of the same section of Tainan:
tainan_google_maps2

Note especially how the name of the junior high school is presented.

  • Previously “Jian Xing Junior High School”.
  • Now “Jiàn Xìng Jr High School”.

This is typical of how in old maps some things were labeled (poorly) in Hanyu Pinyin. (Words, not bro ken syl la bles, are the basis for Pinyin orthography. This is a big deal, not a minor error.) And now such places are still labeled poorly in Hanyu Pinyin, but with the addition of tone marks.

I’d like to return to the point earlier on sloppy conversions. Surprisingly, ??? is given as “Chéng Do? Road” rather than as “Chéngd? Road“.
screenshot from Google Maps of 'Cheng Dou [sic] Rd', near Taipei's Ximending
Although “Xinpu” might not be the sort of name to be contained in some romanization databases, there is nothing in the least obscure about Chengdu, the name of a city of some 11 million people. Google Translate certainly knows the right thing to do with ???:
screenshot from Google Translate, showing how Google will translate '???' as 'Chengdu Rd'

But Google Maps doesn’t get this simple point right, which likely points to outsourcing. Why would Google do this? And why wouldn’t it ensure that a better job was done? Because, really, so far the long-overdue conversion to Hanyu Pinyin in Google Maps for Taiwan is something of a botch.