ɑ vs. a

image of the rounded 'a' and the normal 'a' with the example given of the word 'Hanyu' (with tone marks)About a year ago (which is roughly how overdue this post is), a commenter noted that some Chinese publishers “are convinced that Pinyin must be printed with ɑ (single-story „Latin alpha“, as opposed to double-story a), and with ɡ (single story; not double story g).”

But does Hanyu Pinyin in fact call for this longstanding Chinese habit of bad typography? This was one of the first questions I asked of Zhou Youguang, the father of Hanyu Pinyin, when I met with him: Are those who insist upon the ɑ-style letter correct?

“Oh, no,” Zhou replied. “That ‘ɑ’ is just for babies!” And he laughed that wonderful laugh of his that no doubt has contributed to his remarkable longevity.

Zhou was referring to the facts that the “ɑ” style of letter is usually found specifically in books for infants … and that this style generally does not belong elsewhere. In fact, ɑ and ɡ (written thusly, as opposed to g) are often referred to as infant characters. A variant of the letter y is sometimes included in this set.

Letters in that style are also found in the West — but almost always in books for toddlers, and often not even in those. Furthermore, even in those cases the use of such letters appears to have no positive effect on children’s reading.

The correct-style letters for Pinyin are the same as those for English, Zhou stated.

I hope that anyone who has been using “ɑ” will both officially and in practice switch to “a”. It’s long past time that the supposed rule calling for “ɑ” was treated as a dead letter.

Long live good typography!

Chinese characters: Like, wow.

Some recent comments on the Hanzi domain name situation brought to mind a rant I was working on last month and then abandoned. But it seems worth finishing — relatively speaking, because this is a topic that touches upon so many areas that I could never get through it all — because the problem I discuss is a fairly common one. So today I’d like to address what I think of as the “like, wow” fetish of Chinese characters. In this, Chinese characters are regarded as if they bestowed a wonderful gift upon the reader that no other script could. But exactly how they do that and what exactly that gift is, though, generally doesn’t make too much sense.

This sort of thing is common, and not just among New Age nonsense. A good example of this approach is found in Search Engine of the Song Dynasty, an op-ed piece published in the New York Times in mid May. Basically, the author discusses how having URLs in Chinese characters is a good thing, but does so in a vague, flowery way that brings to mind a stoned grad student with a large vocabulary — which might not be so bad if the author had gotten the facts straight.

I had hoped for at least a little better, given that the author, Ruiyan Xu, has completed a novel, The Lost and Forgotten Languages of Shanghai, whose protagonist has bilingual aphasia. So one would expect Xu, who was born in Shanghai and moved to the United States at the age of 10, to have a better-than-basic understanding of linguistics. Alas, no — not if the article is anything to go by.

My annoyance here, though, isn’t specifically with Xu, who seems like a nice person and whose book has been getting some good advance reviews. It’s more with the “like, wow” phenomenon in general and the eagerness of the mainstream press to publish things about “Chinese” even though the substance of such articles falls apart if one devotes even just a little effort to examining it.

So let’s get into it.

Baidu.com, the popular search engine often called the Chinese Google, got its name from a poem written during the Song Dynasty (960-1279). The poem is about a man searching for a woman at a busy festival, about the search for clarity amid chaos. Together, the Chinese characters bi [sic] and dù mean “hundreds of ways,” and come out of the last lines of the poem: “Restlessly I searched for her thousands, hundreds of ways./ Suddenly I turned, and there she was in the receding light.”

For reference, I’ll provide the poem. I’ve put the Chinese characters used by Baidu.com in bold and red.

東風夜放花千樹。
更吹落、星如雨。
寶馬雕車香滿路。
鳳簫聲動,
玉壺光轉,
一夜魚龍舞。  

蛾兒雪柳黃金縷。
笑語盈盈暗香去。
眾裡尋他千百度
驀然迴首,
那人卻在,
燈火闌珊處。

The author of the poem, Xin Qiji (Xīn Qìjī / 辛棄疾 / 辛弃疾), lived from 1140 to 1207 and was thus a contemporary of such Western poets as the troubadours Bertran de Born, Bernart de Ventadorn, and Giraut de Borneil — hardly poets whose work suffered for having been written with an alphabet.

Baidu, rendered in Chinese, is rich with linguistic, aesthetic and historical meaning. But written phonetically in Latin letters (as I must do here because of the constraints of the newspaper medium and so that more American readers can understand), it is barely anchored to the two original characters; along the way, it has lost its precision and its poetry.

Ugh. Where to start?

I’ll go ahead and skip “precision,” even though that’s perhaps not a word best applied to most poetry written in Literary Sinitic, and start with “rendered in Chinese.” However common the word might be, “Chinese” is a poor choice. In this case, the word seems to be intended to mean not any particular language but rather “Chinese characters,” which are not a language. Here, too, she appears to be blaming Pinyin for having lost something from Literary Sinitic, which is what the poem was written in. But Pinyin isn’t for Literary Sinitic; it’s for modern standard Mandarin. Also, whatever language Xin Qiji spoke could have been written with an alphabet with no loss of meaning, just like all other natural languages.

As Web addresses increasingly transition to non-Latin characters as a result of the changing rules for domain names, that series of Latin letters Chinese people usually see at the top of the screen when they search for something on Baidu may finally turn into intelligible words: “a hundred ways.”

Baidu vs. 百度 on the Baidu.com home page:

Can you feel the difference in precision and poetry? No?

Also, it’s not clear just how much of a “transition to non-Latin characters” there’s going to be, especially where Chinese characters are concerned, especially in places like Singapore.

Of course, this expansion of languages for domain names could lead to confusion: users seeking to visit Web sites with names in a script they don’t read could have difficulty putting in the addresses, and Web browsers may need to be reconfigured to support non-Latin characters. The previous system, with domain names composed of numbers, punctuation marks and Latin letters without accents, promoted standardization, wrangling into consistency and simplicity one small part of the Internet.

For “could have difficulty putting in the addresses” read “could find it next to impossible to enter the correct address.” And by “one small part of the Internet,” she appears to mean the name of every single domain on the entire Internet.

But something else, something important, has been lost.

Part of the beauty of the Chinese language comes from a kind of divisibility not possible in a Latin-based language. Chinese is composed of approximately 20,000 single-syllable characters, 10,000 of which are in common use.

No, no, and no.

  • By “Latin-based language” the author seems to be referring not to a Romance language but to a language that uses the Latin alphabet for its standard script.
  • What exactly is this divisibility? Mandarin words can be divided into morphemes. The words of English, French, etc. work the same way.
  • No language is composed of Chinese characters.
  • There are a hell of a lot more than 20,000 Chinese characters.
  • “Common use” is difficult to pin down. But most authorities would give a lower number.

These characters each mean something on their own; they are also combined with other characters to form hundreds of thousands of multisyllabic words.

No, that’s wrong. Again: words — whether they be multisyllabic or monosyllabic — are not made of Chinese characters. Instead, Chinese characters are the script most seen for written Mandarin.

Níhăo, for example, Chinese for “Hello,” is composed of ní — “you,” and hăo — “good.” Isn’t “You good” — both as a statement and a question — a marvelous and strangely precise breakdown of what we’re really saying when we greet someone?

Again, this is assigning meaning to characters, when the meaning is of course in the word itself.

Note, too, that “níhăo” is incorrect in several ways.

  • One of the basic rules of Hanyu Pinyin is that tone sandhi is not indicated. So even though — because in Mandarin if something has two third tones in a row, the first shifts to second tone — the greeting is pronounced níhǎo, the diacritical mark over the i should indicate third tone (ǐ) rather than second (í).
  • The diacritic over the a is wrong. It should be (Unicode ̌), not ă (Unicode ă) — sharp vs. rounded. (You may need to enlarge the fonts on the screen to see this.)
  • Most careful authorities write this with a space rather than as solid: nǐ hǎo rather than nǐhǎo. This, though, is something I don’t much care about. Popular usage of Pinyin as a real script will eventually work this out one or the other. Also, if someone is going to err in word parsing, I’d much rather they do it by making words solid rather than by breaking up the syllables.

The Romanization of Chinese into a phonetic system called Pinyin, using the Latin alphabet and diacritics (to indicate the four distinguishing tones in Mandarin), was developed by the Chinese government in the 1950s.

I’m a bit surprised the copy editors at the New York Times let that muddled sentence though. But I’ll pass over it without further observation.

Pinyin makes the language easier to learn and pronounce, and it has the added benefit of making Chinese characters easy to input into a computer. Yet Pinyin, invented for ease and standards, only represents sound.

In other words, Pinyin represents language — that being what writing systems are designed to do. And, yes, it’s easy to learn and use, which happens to be a good thing, not a bad one.

In Chinese, there are multiple characters with the exact same sound. The sound “băi,” for example, means 100, but it can also mean cypress, or arrange. And “Baidu,” without diacritics, can mean “a failed attempt to poison” or “making a religion of gambling.”

My dictionary gives some different phrases. But whatever. Then there’s also the simple point: If there’s a problem with writing Pinyin without diacritics, then don’t write Pinyin without diacritics, write it with diacritics. But I have a hard time imagining how anyone would get such things confused in context.

Q: “Honey, could you check Baidu for information on when that new movie is coming out?”
A: “Baidu? Sorry, could you write that in Chinese characters for me. I can’t tell if you mean “a failed attempt to poison,” “making a religion of gambling,” or the search engine.”

Behind this is just the usual homonym canard. In English, as in other languages, there are many morphemes with the exact same pronunciation (sound). If we look at the closest English has to the Mandarin sounds bai and du, we can get by, buy, bye, bi-, and dew, do, due, etc. — all of which have various meanings. Take a look.

Those who don’t need to be hit over the head again and again to understand the simple point that English has plenty of homonyms but does just fine with an alphabet — as would every other natural language, including of course Mandarin and the other Sinitic languages– may wish to just skim the following blockquote.

bi, bi-, buy, by, bye

buy (transitive verb)

  1. : to acquire possession, ownership, or rights to the
    use or services of by payment especially of money :
    purchase
    1. : to obtain in exchange for something often at
      a sacrifice <they bought peace with their
      freedom>
    2. : redeem
  2. : bribe, hire
  3. : to be the purchasing equivalent of <the dollar
    buys less today than it used to>
  4. : accept, believe <I don’t buy that hooey> -often
    used with into

buy (intransitive verb)

  1. : to make a purchase

buy (noun)

  1. : something of value at a favorable price; especially :
    bargain <it’s a real buy at that price>
  2. : an act of buying : purchase

bi (noun or adjective)

  1. : bisexual

bi- (prefix)

    1. : two <bilateral>
    2. : coming or occurring every two
      <bicentennial>
    3. : into two parts <bisect>
    1. : twice : doubly : on both sides
      <biconvex>
    2. : coming or occurring two times
      <biannual> – compare semi-
  1. : between, involving, or affecting two (specified)
    symmetrical parts <bilabial>
    1. : containing one (specified) constituent in
      double the proportion of the other constituent or
      in double the ordinary proportion
      <bicarbonate>
    2. : di- 2 <biphenyl>

bi- (Variant(s): or bio-) (combining form)

  1. : life : living organisms or tissue
    <bioluminescence> <biosphere>
  2. : biographical <biopic>

bye (noun)

  1. : the position of a participant in a tournament who
    advances to the next round without playing

by (preposition)

  1. : in proximity to : near <standing by the
    window>
    1. : through or through the medium of : via
      <enter by the door>
    2. : in the direction of : toward <north by
      east>
    3. : into the vicinity of and beyond : past
      <went right by him>
    1. : during the course of <studied by
      night>
    2. : not later than <by 2 p.m.>
    1. : through the agency or instrumentality of
      <by force>
    2. : born or begot of
    3. : sired or borne by
  2. : with the witness or sanction of <swear by all that
    is holy>
    1. : in conformity with <acted by the
      rules>
    2. : according to <called her by name>
    1. : on behalf of <did right by his
      children>
    2. : with respect to <a lawyer by
      profession>
    1. : in or to the amount or extent of <win by a
      nose>
    2. b chiefly Scottish : in comparison with :
      beside
  3. -used as a function word to indicate successive units or increments <little by little> <walk two by two>
  4. -used as a function word in multiplication, in division, and in measurements <divide a by b>
    <multiply 10 by 4> <a room 15 feet by 20 feet>
  5. : in the opinion of : from the point of view of
    <okay by me>

by (adjective)

  1. : being off the main route : side
  2. : incidental

by (noun)

  1. : something of secondary importance : a side issue

by/bye (interjection)

  1. : short for goodbye

dew, do, due

dew (noun)

  1. : moisture condensed upon the surfaces of cool bodies especially at night
  2. : something resembling dew in purity, freshness, or power to refresh
  3. : moisture especially when appearing in minute droplets: as

    1. : tears
    2. : sweat
    3. : droplets of water produced by a plant in transpiration

due (adjective)

  1. : owed or owing as a debt
    1. : owed or owing as a natural or moral right
      <everyone’s right to dissent is due the full protection of the Constitution – Nat Hentoff>
    2. : according to accepted notions or procedures : appropriate <with all due respect>
    1. : satisfying or capable of satisfying a need, obligation, or duty : adequate <giving the matter due attention>
    2. : regular, lawful <due proof of loss>
  2. : capable of being attributed : ascribable -used with to <this advance is partly due to a few men of genius –
    A. N. Whitehead>
  3. : having reached the date at which payment is required
    : payable <the rent is due>
  4. : required or expected in the prescribed, normal, or logical course of events : scheduled <the train is due at noon>; also : expected to give birth

due (noun)

  1. : something due or owed: as

    1. : something that rightfully belongs to one
      <give him his due>
    2. : a payment or obligation required by law or custom : debt
    3. plural : fees, charges <membership dues>

due (adverb)

  1. : directly, exactly <due north>
  2. <obsolete> : duly

do (transitive verb)

  1. : to bring to pass : carry out <do another’s wishes>
  2. : put -used chiefly in do to death
  3. : perform, execute

    1. <do some work> <did his duty>
    2. : commit <crimes done deliberately>
    1. : bring about, effect <trying to do good>
      <do violence>
    2. : to give freely : pay <do honor to her memory>
  4. : to bring to an end : finish -used in the past participle <the job is finally done>
  5. : to put forth : exert <did her best to win the race>
  6. : to wear out especially by physical exertion : exhaust
    <at the end of the race they were pretty well done> b
    : to attack physically : beat; also : kill
  7. : to bring into existence : produce <do a biography on the general>
  8. -used as a substitute verb especially to avoid
    repetition <if you must make such a racket, do it
    somewhere else>
  9. : to play the role or character of b : mimic; also : to behave like <do a Houdini and disappear> c : to perform in or serve as producer of <do a play>
  10. : to treat unfairly; especially : cheat <did him out of his inheritance>
  11. : to treat or deal with in any way typically with the sense of preparation or with that of care or attention:

      1. : to put in order : clean <was doing the kitchen>
      2. : wash <did the dishes after supper>
    1. : to prepare for use or consumption; especially : cook <like my steak done rare>
    2. : set, arrange <had her hair done>
    3. : to apply cosmetics to <wanted to do her face before the party>
    4. : decorate, furnish <did the living room in Early American> <do over the kitchen>
  12. : to be engaged in the study or practice of <do science>; especially : to work at as a vocation lt;what to do after college>
    1. : to pass over (as distance) : traverse <did 20 miles yesterday>
    2. : to travel at a speed of <doing 55 on the turnpike>
  13. : tour <doing 12 countries in 30 days>
    1. : to spend (time) in prison <has been doing time in a federal penitentiary>
    2. : to serve out (a period of imprisonment)
      <did ten years for armed robbery>
  14. : to serve the needs of : suit, suffice <worms will do us for bait>
  15. : to approve especially by custom, opinion, or propriety <you oughtn’t to say a thing like that — it’s not done – Dorothy Sayers>
  16. : to treat with respect to physical comforts <did themselves well>
  17. : use 3 <doesn’t do drugs>
  18. : to have sexual intercourse with
  19. : to partake of <let’s do lunch>

do (intransitive verb)

  1. : act, behave <do as I say>
    1. : get along, fare <do well in school>
    2. : to carry on business or affairs : manage
      <we can do without your help>
  2. : to take place : happen <what’s doing across the street>
  3. : to come to or make an end : finish -used in the past participle
  4. : to be active or busy <let us then be up and doing
    – H. W. Longfellow>
  5. : to be adequate or sufficient : serve <half of that will do>
  6. : to be fitting : conform to custom or propriety
    <won’t do to be late>
  7. -used as a substitute verb to avoid repetition
    <wanted to run and play as children do> ; used especially in British English following a modal auxiliary or perfective have <a great many people had died, or would do – Bruce Chatwin>
  8. -used in the imperative after an imperative to add emphasis <be quiet do>

do (verbal auxiliary)

    1. -used with the infinitive without to to form present and past tenses in legal and parliamentary language <do hereby bequeath> and in poetry <give what she did crave – Shakespeare>
    2. -used with the infinitive without to to form present and past tenses in declarative sentences with inverted word order <fervently do we pray –
      Abraham Lincoln>, in interrogative sentences
      <did you hear that?>, and in negative sentences <we don’t know> <don’t go>
  1. -used with the infinitive without to to form present and past tenses expressing emphasis <i do say> <do be careful>

do (noun)

  1. chiefly dialect : fuss, ado
  2. archaic : deed, duty
    1. : a festive get-together : affair, party
    2. chiefly British : battle
  3. : a command or entreaty to do something <a list of dos and don’ts>
  4. British : cheat, swindle
  5. : hairdo

All that’s without me bothering to get out a big dictionary.

Alas, poor English! How confused we must be to be using a mere alphabet. Oh, if only we could achieve linguistic, aesthetic, and historical meaning!

In the case of Baidu.com, the word, in Latin letters, has slipped away from its original context and meaning, and been turned into a brand.

Baidu is a brand, and as is generally thought of as such regardless of what script it is written in. Furthermore, it’s understood as a “word” only as that search engine. In the poem the characters “百度” are used to write not one word but two — and even written in Hanzi this is not something more than a relative handful of people in China or Taiwan would recognize as having come from that poem unless someone told them about it first.

Language is such a basic part of our lives, it seems ordinary and transparent. But language is strange and magical, too: it dredges up history and memory; it simultaneously bestows and destabilizes meaning. Each of the thousands of languages spoken around the world has its own system and rules, its own subversions, its own quixotic beauty. Whenever you try to standardize those languages, whether on the Internet, in schools or in literature, you lose something. What we gain in consistency costs us in precision and beauty.

When Chinese speakers Baidu (like Google, it too is a verb), we look for information on the Internet using a branded search engine. But when we see the characters for băi dù, we might, for one moment, engage with the poetry of our language, remember that what we are really trying to do is find what we were seeking in the receding light. Those sets of meanings, layered like a palimpsest, might appear suddenly, where we least expect them, in the address bar at the top of our browsers. And in some small way, those words, in our own languages, might help us see with clarity, and help us to make sense of the world.

Clarity? Clarity?!

I understand that the author, as a novelist rather than a linguist, might be preoccupied with the whole Ezra Pound “make it new” and “give people new eyes” thing. If so, good for her. But, still, one should not not confuse flights of fancy, no matter how cool they might sound, with facts and should at least attempt not to be completely wrong about almost everything, especially when publishing in the New York Times.

If the argument for Chinese characters is supposed to be that their continued, indeed expanded, use is necessary so people can quote poems in Literary Sinitic out of context so that what would be at best a low-single-digit percentage of native speakers of Mandarin or another modern Sinitic language might recognize the allusion despite a lack of context and might get a Hanzi-licious frisson out of the experience … that would have to be one of the most ridiculous things I’ve ever read.

Kicking the irony meter way up on all this is that the author of those remarks on the really cool feelings one can get from reading Chinese characters cannot herself read texts written in them, though she neglected to mention that little bit of information in her New York Times piece.

And for irony on top of irony, as someone who left China at the age of 10, she likely still knows her native Sinitic language, so texts written in romanization could give her the literacy in that language that she lacks in Chinese characters. Romanization could provide meaning; but instead she harps upon the virtues of Chinese characters.

Oh, and for a final bit of irony, here’s something else the author apparently didn’t bother to check: 百度.com already exists. And is anyone surprised to hear that the site at that address is not a search engine of the Song dynasty? Here’s what it looks like.
screenshot of 百度.com -- a linkspam site -- as of July 1, 2010

That’s right: 百度.com is just a linkspam site. But apparently because, unlike baidu.com, it has Chinese characters in the URL it’s linkspam with its own quixotic beauty; it’s linkspam with its own sets of meanings, layered like a palimpsest; and it’s linkspam that is rich with linguistic, aesthetic and historical meaning.

C’mon, people! Feel the poetry of it! The precision!

Like, wow.

X marks the spot?

In December Taiwan will be getting a new city. In fact, it will be the most populous city in the entire country: Xīnběi Shì (新北市).

For those not familiar with the situation, I should perhaps give a bit of background. Taiwan won’t suddenly have more people or buildings. Instead, the area known as Taipei County (which does not include the city of Taipei but which occupies a much greater area than Taipei and has a much greater total population) will be getting a long-overdue official upgrade to a “special municipality,” which means that it will get a lot more money and civil servants per capita from the central government. And as such the area will be dubbed a city, even though in appearance and demographic patterns it isn’t really a city at all but still a county containing several cities (which are to become “districts” despite having hundreds of thousands more inhabitants than some other places labeled “cities”), lots of towns, and plenty of empty countryside.

The Mandarin name will change from Táiběi Xiàn to Xīnběi Shì. (Xīn is the Mandarin word for “new.” Xiàn is “county.” Shì is “city.” And běi is “north.”)The official so-called English name is, tentatively, “Xinbei City.” Hanyu Pinyin! Yea!

Talking about “English” names is often misleading, since many people conflate English and romanization of Mandarin; and the usual pattern of Taiwanese place names not written in Chinese characters tends to be MANDARIN PROPER NAME + ENGLISH CATEGORY (e.g., “Taoyuan County”). So, at least in this post, I’m going to be a bit sloppy about what I’m calling “English.” Forgive me. OK, now back to the subject.

A couple of days ago, however, both major candidates for the powerful position of running the area currently known as Taipei County (Táiběi Xiàn) had a rare bit of agreement: both expressed a preference for using “New Taipei City” instead of “Xinbei City.” Ugh.

And to top things off, a couple dozen pro-Tongyong Pinyin protesters were outside Taipei County Hall the same day to protest against using Xinbei because it contains what they characterize as China’s demon letter X. Actually, that last part of hyperbole isn’t all that much of an exaggeration of their position. The X makes it look like the city is being crossed out, some of the protesters claimed.

This is, of course, stupid. But unfortunately it’s the sort of stupidity that sometimes plays well here, given how this is a country that pandered to the superstitious by removing 4′s from license plate numbers and ID cards and by changing the name of a subway line because if you cherry-picked from its syllables you could come up with a nickname that might remind people of a term for cheating in mah-jongg (májiàng). (Why bother with letting competent engineers do things the way they need to be done when problems can be fixed magically through attempts to eliminate puns!)

pro-Tongyong protesters hold up signs against using Hanyu Pinyin

The protesters would prefer the Tongyong form, Sinbei. I suspect foreigners here would rapidly change that to the English name “Sin City,” which I must admit would have a certain ring to it and might even be a tourist draw. Still, Tongyong has already done enough damage. Those wanting to promote Taiwan’s identity would be much better off channeling their energy into projects that might actually be useful to their cause.

The reason the government selected “Xinbei City” is that “New Taipei City” would be too similar to “Taipei City,” according to the head of the Taipei County Government’s Department of Civil Affairs. And, yes, they would be too similar. Also, Xinbei is simply the correct form in Hanyu Pinyin, which is Taiwan’s (and Taipei County’s) official romanization system. It would also be be much better still to omit “city” altogether.

Consider how this might work on signs, keeping in mind that Taipei and Xīnběi Shì are right next to each other. So such similar names as “New Taipei City” and “Taipei City” would run the risk of confusion, unlike, say, the case of New Jersey and Jersey. I wonder if the candidates for mayor of Xinbei are under the impression that they should change the name of the town across from Danshui from Bālǐ to something else because visitors to Taiwan might otherwise think they could drive to the Indonesian island of Bali from northern Taiwan.

They probably said they liked “New Taipei City” better because it sounds “more English” to them. And it is more English than “Xinbei.” But that’s not a good thing.

Once again it may be necessary to point out what ought to be obvious: The reason so-called English place names are needed is not because foreigners need places to have names in the English language. If it were, I suppose we could redub many places with appropriate names in real English: “Ugly Dump Filled With Concrete Buildings” (with numbers appended so the many possibilities could be distinguished from each other), “Nuclear Waste Depository,” “Armpit of Taiwan,” “Beautiful Little Town that Turns Into a Tourist Hell on Weekends,” etc. The possibilities are endless, though perhaps some of the nicer places would need to be given awful names — following the Iceland/Greenland model — lest they be overrun. The problem is that Chinese characters are too damn hard, and people who can’t read them (i.e., most foreign residents and tourists) need to be able to find places on maps, on Web pages, through signs, etc. And they need to be able to communicate through speech with people in Taiwan about places. Having two different names — the Mandarin one and the so-called English one — is just confusing. Having one name in Mandarin written in two systems (Chinese characters and romanization), however, makes sense and works best. (If Taiwan were to switch to using Taiwanese instead of Mandarin, that would be a whole ‘nother kettle of fish.)

But things that make sense and politicians don’t often fit well together.

Consider the signs. What a @#$% mess this could be. Let’s compare a few ramifications of using Xinbei and Taipei vs. using New Taipei City and Taipei City.

Xinbei and Taipei.

  • basically no chance of confusing one with the other
  • short (6 characters each), thus fitting better on signs
  • preexisting “Taipei [City]” signs wouldn’t have to be changed
  • Xinbei would be the correct romanization and not repeat the misleading pei of bastardized Wade-Giles
  • definitely no need to add “city” to either name, because there would be no “Taipei County” that might need to be distinguished from the city of Taipei, nor would there be a “Xinbei County” that would need to be distinguished from the city of Xinbei

Now let’s look at the case of New Taipei City and Taipei City.

  • relatively easy to confuse at a glance
  • relatively easy to confuse in general
  • long, and don’t fit as easily on signs (“New Taipei City” = 15 characters, including spaces; “Taipei City” = 11 characters, including the space)
  • “New Taipei City” would continue to ill-advised and outdated practice of using bastardized Wade-Giles spellings
  • any time the common adjective new needs to be applied to something dealing with “New Taipei City” or “Taipei City” the chances for confusion and mistakes would increase even more, esp. in headlines
  • the worst choice

The Taipei County Council will determine the final version of the name in September.

sources:

See also

(By the way, if any Taiwan reporters want to pick up on this blog post, please don’t just follow the usual practice here of simply asking one or two random foreigners if they think the name “New Taipei City” sounds OK, so then you conclude that there’s no problem. Try to get people who’ve actually thought about the situation for more than a few seconds and who could give you an informed opinion. My apologies to those reporters who of course know better.)

Taiwanese-English, English-Taiwanese dictionaries posted

Maryknoll Language Service Center has put online the complete texts of its Taiwanese-English and English-Taiwanese dictionaries. Better still, these have been released under a Creative Commons license. These are a terrific resource for anyone who’s interested in Hoklo.

Maryknoll deserves praise for this great work. Thanks are due, too, to Tailingua, which I know has been working behind the scenes to help make this happen.

From the English Amoy Dictionary (英語閩南語字典):
screenshot from the English-Taiwanese dictionary

And from the Taiwanese-English Dictionary (台語英語字典):
screenshot from the dictionary

source: Maryknoll dictionaries now free to download, Tailingua, June 17, 2010

Pinyin subtitles for Crouching Tiger, Hidden Dragon

Er, someone has created Hanyu Pinyin subtitles for the film Crouching Tiger, Hidden Dragon (Wòhǔcánglóng / 臥虎藏龍 / 卧虎藏龙). They’re in UTF-8 (Unicode) and come in two varieties: one with tone marks (link above), the other without. The latter would be useful primarily for those who have trouble getting diacritics to appear properly, such as many of those watching the movie through a TV hooked up to a DivX DVD player.

The set of subtitles also includes English and Mandarin in Chinese characters (both traditional and simplified versions).

The subtitles might seem to go by a bit quickly. But that’s generally because people don’t have much experience reading Hanyu Pinyin. (Also, the English subtitles leave out a lot. But the Pinyin ones are comprehensive.) Practice reading and you’ll get much faster at it.

Remember to use these only for good (e.g., practice reading Pinyin, Mandarin learning, helping those with problems reading Chinese characters) and not bad (e.g., piracy).

still from the movie, showing the subtitled text of Li Mubai saying 'Jianghu li wohucanglong'

Google Translate and rōmaji

The following is a guest post by Professor J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures.

The challenge

On 18 November 2009, Mark Swofford posted an item on his website pinyin.info criticizing the way Google Translate produces Hanyu Pinyin from standard Chinese text. He concluded by saying, “Google Translate will also romanize Japanese texts written in kanji and kana, Russian texts written in Cyrillic, etc. But I’ll leave those to others to analyze.” So I decided to take up Swofford’s challenge as it pertains to Japanese. Using Google Translate, I romanized a news item from the Asahi of 6 December 2009:

Original Google Translate
6日午後4時35分ごろ、東京都千代田区皇居外苑の都道(内堀通り)の二重橋前交差点で、中国からの観光客の40代の男性が乗用車にはねられ、全身を強く打って間もなく死亡した。車は歩道に乗り上げて歩いていた男性(69)もはね、男性は頭を強く打って意識不明の重体。丸の内署は、運転していた東京都港区白金3丁目、会社役員高橋延拓容疑者(24)を自動車運転過失傷害の疑いで現行犯逮捕し、容疑を同致死に切り替えて調べている。 roku nichi gogo yon ji san go fun goro , tōkyō to chiyoda ku kōkyogaien no todō ( uchibori dōri ) no nijūbashi zen kōsaten de , chūgoku kara no kankō kyaku no yon zero dai no dansei ga jōyōsha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shibō shi ta . kuruma wa hodō ni noriage te arui te i ta dansei ( roku kyū ) mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no jūtai . marunouchi sho wa , unten shi te i ta tōkyō to minato ku hakkin san chōme , kaisha yakuin takahashi nobe tsubuse yōgi sha ( ni yon ) wo jidōsha unten kashitsu shōgai no utagai de genkō han taiho shi , yōgi wo dō chishi ni kirikae te shirabe te iru .
 同署によると、死亡した男性は横断歩道を歩いて渡っていたところを直進してきた車にはねられた。車は左に急ハンドルを切り、車道と歩道の境に置かれた仮設のさくをはね上げ、歩道に乗り上げたという。さくは歩道でランニングをしていた男性(34)に当たり、男性は両足に軽いけが。 dōsho ni yoru to , shibō shi ta dansei wa ōdan hodō wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni kyū handoru wo kiri , shadō to hodō no sakai ni oka re ta kasetsu no saku wo haneage , hodō ni noriage ta to iu . saku wa hodō de ran’ningu wo shi te i ta dansei ( san yon ) ni atari , dansei wa ryōashi ni karui kega .
 同署は、死亡した男性の身元確認を進めるとともに、当時の交差点の信号の状況を調べている。 dōsho wa , shibō shi ta dansei no mimoto kakunin wo susumeru totomoni , tōji no kōsaten no shingō no jōkyō wo shirabe te iru .
 現場周辺は東京観光のスポットの一つだが、最近はジョギングを楽しむ人も増えている。 genba shūhen wa tōkyō kankō no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru .

Google’s romanization algorithm does a thoroughly mediocre job compared with what a human transcriber would do. To see this, compare the following:

Google Translate human transcriber
roku nichi gogo yon ji san go fun goro , tōkyō to chiyoda ku kōkyogaien no todō ( uchibori dōri ) no nijūbashi zen kōsaten de , chūgoku kara no kankō kyaku no yon zero dai no dansei ga jōyōsha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shibō shi ta . kuruma wa hodō ni noriage te arui te i ta dansei ( roku kyū ) mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no jūtai . marunouchi sho wa , unten shi te i ta tōkyō to minato ku hakkin san chōme , kaisha yakuin takahashi nobe tsubuse yōgi sha ( ni yon ) wo jidōsha unten kashitsu shōgai no utagai de genkō han taiho shi , yōgi wo dō chishi ni kirikae te shirabe te iru . Muika gogo yo-ji sanjūgo-fun goro, Tōkyō-to Chiyoda-ku Kōkyo Gaien no todō (Uchibori dōri) no Nijūbashi-zen kōsaten de, Chūgoku kara no kankō-kyaku no yonjū-dai no dansei ga jōyōsha ni hanerare, zenshin o tsuyoku utte mamonaku shibō-shita. Kuruma wa hodō ni noriagete aruite ita dansei (rokujūkyū) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no jūtai. Marunouchi-sho wa, unten-shite ita Tōkyō-to Minato-ku Shirogane san-chōme, kaisha yakuin Takahashi Nobuhiro yōgisha (nijūyon) o jidōsha unten kashitsu shōgai no utagai de genkōhan taiho-shi, yōgi o dō-chishi ni kirikaete shirabete iru.
dōsho ni yoru to , shibō shi ta dansei wa ōdan hodō wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni kyū handoru wo kiri , shadō to hodō no sakai ni oka re ta kasetsu no saku wo haneage , hodō ni noriage ta to iu . saku wa hodō de ran’ningu wo shi te i ta dansei ( san yon ) ni atari , dansei wa ryōashi ni karui kega . Dō-sho ni yoru to, shibō-shita dansei wa ōdan hodō o aruite watatte ita tokoro o chokushin-shite kita kuruma ni hanerareta. Kuruma wa hidari ni kyū-handoru o kiri, shadō to hodō no sakai ni okareta kasetsu no saku o haneage, hodō ni noriageta to iu. Saku wa hodō de ranningu o shite ita dansei (sanjūyon) ni atari, dansei wa ryōashi ni karui kega.
dōsho wa , shibō shi ta dansei no mimoto kakunin wo susumeru totomoni , tōji no kōsaten no shingō no jōkyō wo shirabe te iru . Dō-sho wa, shibō-shita dansei no mimoto kakunin o susumeru to tomo ni, tōji no kōsaten no shingō no jōkyō o shirabete iru.
genba shūhen wa tōkyō kankō no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru . Genba shūhen wa Tōkyō kankō no supotto no hitotsu da ga, saikin wa jogingu o tanoshimu hito mo fuete iru.

For the sake of comparison, I have retained Google’s Hepburn-style romanization. The following changes have been made in the text in the righthand column:

  1. Misread words have been rewritten. Many involve numerals; e.g. muika for “roku nichi”, yo-ji for “yon ji”, sanjūgo-fun for “san go fun”. The personal name Nobuhiro is an educated guess, but “Nobetsubuse” is certainly wrong. Shirogane for “hakkin” is a place-name (N.B. Google did not produce *hakukin, indicating that the algorithm does more than just character-by-character on-yomi).
  2. False spaces and consequent misreadings have been eliminated. E.g. hanerare for “hane rare”, wattate ita for “wata~tsu te i ta”.
  3. Run-together phrases have been parsed correctly. E.g. to tomo ni for “totomoni”.
  4. Capitalization of proper nouns and the first words in sentences has been introduced.
  5. Hyphens are used conservatively for prefixes and suffixes, and for compound verbs with suru.
  6. Obsolete “wo” for the particle o has been eliminated. (N.B. Google did not produce *ha for the particle wa, so “wo” for o is just the result of laziness.)
  7. Apostrophes after n to indicate mora nasals in positions where they are not needed have been eliminated.
  8. Punctuation has been normalized to match for romanized format and paragraph indentations have been restored.

One could make the romanized text more easily readable by restoring arabic numerals, italicizing gairaigo, and so on. Of course, if the reporter knew that his/her copy would be reported orally or in romanization, s/he might have chosen different wording to avoid homophonic ambiguities. E.g., Marunouchi-sho could be Marunouchi Keisatsu-sho, though perhaps in the context of a traffic accident story, it is obvious that the suffix sho denotes ‘police station’. Furthermore, in a digraphic Japan, homophones might not be such as great problem. If, for instance, readers were accumstomed to seeing dōsho for 同所 ‘same place’, then dō-sho would immediately signal that something different was meant, which, given context, might be entirely sufficient to eliminate misunderstanding.

But having said all that, my guess is that the romanization function of Google Translate was programmed with some care. Rather than criticize the quality Google’s algorithm, I suggest pursuing the logical consequences of assuming that it deserves about a B+ by current standards.

Analysis

Clearly, there is a vast amount of knowledge an editor needs if s/he wants to bring Google’s result up to an acceptable level of romanization for human consumption. That minimal level, in turn, is probably a far cry from what a committee of linguists might decide would be an ideal romanization for daily use in 21st-century Japan. It is quite obvious why Google’s algorithm blunders — the reasons were well understood and described long ago (e.g. in Unger 1987) — and though the algorithm can be improved, it can never produce perfect results. Computers cannot read minds, and mindreading is ultimately what it would take to produce a flawless romanization.1

Furthermore, imagine the representation of the words of the text that presumably takes shape in some form or other in the mind of the skilled reader of the original text. Given that Google’s programmers are doing their best to get their computers to identify words and their forms from Japanese textual data, it is clear that readers, who achieve excellent comprehension with little or no conscious effort, must be doing vastly more. The sequence of stages — from (1) the original text to (2) the Google transcription, (3) the better edited version, (4) some future “ideal” romanization scheme, and onward to (5) whatever the brain of the skilled reader ultimately distills and comprehends — concretely illustrates how, at each stage, different kinds of information — from the easily programmable to genuine expert knowledge — must be brought to bear on the raw data.

Of course, something similar can be said of English texts as well: like Chinese characters, orthographic words of English, even though written with letters of the roman alphabet, typically function both logographically and phonographically. The English reader has to do some work too. But how much? Think of the sequence of stages just described in reverse order. The step from the mind of an expert reader (5) to an ideal romanization (4) is short compared with the distance down to the crude level of romanization produced by Google Translate (2). Yet Google does quite a bit relative to the original text (1). It does not totally fail, but rather makes mistakes, which, as just demonstrated, a human editor can identify and correct. It manages to find many word boundaries and no doubt could do better if the company’s programmers consulted some linguists and exerted themselves more. The point is that Japanese readers must cover the whole distance from the text to genuine comprehension, a distance that must be much greater than that traversed by the practiced reader of English, for all its quaint anachronistic spellings. With a decent, standardized roman orthography, the Japanese reader would have a considerably shorter distance to negotiate.

Note

  1. Indeed, starting in the 1980s, Asahi pioneered in the use of an IBM-designed system called NELSON (New Editing and Layout System of Newspapers) that uses large-array keyboards (descriptive input) rather than the sort of kanji henkan methods (transcriptive input) common on personal computers and dedicated word-processing systems. Consequently, the expedient of storing the underlying roman or kana input stream alongside the selected characters is not available for Asahi stories. Of course, such information is routinely thrown away by many other input systems too.

Journal issue focuses on romanization

cover of this issue of the Journal of the Royal Asiatic SocietyThe most recent issue of the Journal of the Royal Asiatic Society of Great Britain and Ireland (third series, volume 20, part 1, January 2010) features the following articles on romanization movements and script reforms.

  • Editorial Introduction: Romanisation in Comparative Perspective, by İlker Aytürk
  • The Literati and the Letters: A Few Words on the Turkish Alphabet Reform, by Laurent Mignon
  • Alphabet Reform in the Six Independent ex-Soviet Muslim Republics, by Jacob M. Landau
  • Politics of Romanisation in Azerbaijan (1921–1992), by Ayça Ergun
  • Romanisation in Uzbekistan Past and Present, by Mehmet Uzman
  • Romanisation of Bengali and Other Indian Scripts, by Dennis Kurzon
  • The Rōmaji movement in Japan, by Nanette Gottlieb
  • Postscript from the JRAS Editor, Sarah Ansari

Unfortunately, none of these cover any Sinitic languages or the case of Vietnam. And Gottlieb’s take on rōmaji is certainly more conservative than Unger’s. But I expect this will all make for interesting reading.

I am able to view all of the articles on my system. But perhaps others will run up against a subscription wall.

I thank Victor H. Mair for drawing this publication to my attention.

Google Translate’s new Pinyin function sucks

Google Translate has a new function: conversion to Hanyu Pinyin, which would be exciting and wonderful if it were any good. But unfortunately it’s terrible, all things considered.

What Google has created is about at the same level as scripts hobbyists cobbled together the hard way about a decade ago from early versions of CE-DICT. Don’t get me wrong: I greatly admire what sites such as Ocrat achieved way back when. But for Google — with all of its data, talent, and money — to do essentially no better so many years later is nothing short of a disgrace.

To see Google Translate’s Pinyin function in action you must select “Chinese (Simplified)” or “Chinese (Traditional)” — not English — for the “Translate into” option. And then click on “Show romanization”.

For example, here’s what happens with the following text from an essay on simplified and traditional Chinese characters by Zhang Liqing:

談中國的“語”和“文”的問題,我覺得最好能先了解一下在中國通用的語言。中國的主要語言有哪些?為甚麼我說這個,而不說那個?因為環境?因為被強迫?因為我愛這個語言?因為有必要?因為這個語言很重要?也想想什麼是中國人的共同語言。用一個共同語言有必要嗎?為什麼?別的漢語的去向會怎麼樣?如果你使用中國的共同語言普通話,你了解這個語言的語法(比如“的, 得, 地“ 和“了” 的不同用法)嗎? 知道這個語言的基本音節(不包括聲調)只有408個嗎?

screenshot of Google Translate with the text above

Google Translate will produce this:
screenshot of Google Translate with the text above and how Google Translate puts this into Pinyin (see text below)

tán zhōng guó de“yǔ“hé” wén” de wèn tí, wǒ jué de zuì hǎo néng xiān liǎo jiè yī xià zài zhōng guó tōng yòng de yǔ yán。zhōng guó de zhǔ yào yǔ yán yǒu nǎ xiē?wéi shèn me wǒ shuō zhè ge, ér bù shuō nà gè?yīn wèi huán jìng?yīn wèi bèi qiǎng pò?yīn wèi wǒ ài zhè ge yǔ yán?yīn wèi yǒu bì yào?yīn wèi zhè ge yǔ yán hěn zhòng yào?yě xiǎng xiǎng shén me shì zhōng guó rén de gòng tóng yǔ yán。yòng yī gè gòng tóng yǔ yán yǒu bì yào ma?wèi shé me?bié de hàn yǔ de qù xiàng huì zěn me yàng?rú guǒ nǐ shǐ yòng zhōng guó de gòng tóng yǔ yán pǔ tōng huà, nǐ liǎo jiě zhè ge yǔ yán de yǔ fǎ(bǐ rú“de, de, de“ hé“le” de bù tóng yòng fǎ) ma?zhī dào zhè ge yǔ yán de jī běn yīn jié(bù bāo kuò shēng diào) zhǐ yǒu408gè ma?

Here’s what’s wrong:

  • This is all bro ken syl la bles instead of word parsing. (So it’s never even a question if they get the use of the apostrophe correct.)
  • Proper nouns are not capitalized (e.g., zhōng guó vs. Zhōngguó).
  • The first letter in each sentence is not capitalized.
  • Punctuation is not converted but remains in double-width Chinese style, which is wrong for Pinyin.
  • Spacing around most punctuation is also incorrect (e.g., although a space is added after a comma and a closing parenthesis, there’s no space after a period or a question mark. See also the spacing or lack thereof around quotation marks, numerals, etc.)
  • Because of lack of word parsing, some given pronunciations are wrong.

In my previous post I complained about Google Maps’ unfortunately botched switch to Hanyu Pinyin. I stated there that, unlike Google Maps, Google Translate would correctly produce “Chengdu” from “成都” (which it does when “translate into” is set for English). But I see that the romanization bug feature of Google Translate also fails this simple test. It generates the incorrect “chéng dōu”.

All of this indicates that Google apparently is using a poor database and not only has no idea of how Pinyin is meant to be written but also lacks an understanding of even the basic rules of Pinyin.

If you should need to use a free Web-based Pinyin converter, avoid Google Translate. Instead use Adso (from the fine folk at Popup Chinese) or perhaps NCIKU or MDBG — all of which, despite their limitations (c’mon, guys, sentences begin with capital letters), are significantly better than what Google offers.

By the way, Google Translate will also romanize Japanese texts written in kanji and kana, Russian texts written in Cyrillic, etc. But I’ll leave those to others to analyze.

For lagniappe, here’s a real Hanyu Pinyin version of the text above:

Tán Zhōngguó de “yǔ” hé “wén” de wèntí, wǒ juéde zuìhǎo néng xiān liǎojiě yīxià zài Zhōngguó tōngyòng de yǔyán. Zhōngguó de zhǔyào yǔyán yǒu nǎxiē? Wèishénme wǒ shuō zhège, ér bù shuō nàge? Yīnwei huánjìng? Yīnwei bèi qiǎngpò? Yīnwei wǒ ài zhège yǔyán? Yīnwei yǒu bìyào? Yīnwei zhè ge yǔyán hěn zhòngyào? Yě xiǎngxiang shénme shì Zhōngguórén de gòngtóng yǔyán? Yòng yīge gòngtóng yǔyán yǒu bìyào ma? Weishenme? Biéde Hànyǔ de qùxiàng huì zěnmeyàng? Rúguǒ nǐ shǐyòng Zhōngguó de gòng tóng yǔyán Pǔtónghuà, nǐ liǎojiě zhège yǔyán de yǔfǎ (bǐrú “de” hé “le” de bùtóng yǒngfǎ) ma? Zhīdao zhège yǔyán de jīběn yīnjié (bù bàokuò shēngdiào) zhǐ yǒu 408 ge ma?