Chinese characters: Like, wow.

Some recent comments on the Hanzi domain name situation brought to mind a rant I was working on last month and then abandoned. But it seems worth finishing — relatively speaking, because this is a topic that touches upon so many areas that I could never get through it all — because the problem I discuss is a fairly common one. So today I’d like to address what I think of as the “like, wow” fetish of Chinese characters. In this, Chinese characters are regarded as if they bestowed a wonderful gift upon the reader that no other script could. But exactly how they do that and what exactly that gift is, though, generally doesn’t make too much sense.

This sort of thing is common, and not just among New Age nonsense. A good example of this approach is found in Search Engine of the Song Dynasty, an op-ed piece published in the New York Times in mid May. Basically, the author discusses how having URLs in Chinese characters is a good thing, but does so in a vague, flowery way that brings to mind a stoned grad student with a large vocabulary — which might not be so bad if the author had gotten the facts straight.

I had hoped for at least a little better, given that the author, Ruiyan Xu, has completed a novel, The Lost and Forgotten Languages of Shanghai, whose protagonist has bilingual aphasia. So one would expect Xu, who was born in Shanghai and moved to the United States at the age of 10, to have a better-than-basic understanding of linguistics. Alas, no — not if the article is anything to go by.

My annoyance here, though, isn’t specifically with Xu, who seems like a nice person and whose book has been getting some good advance reviews. It’s more with the “like, wow” phenomenon in general and the eagerness of the mainstream press to publish things about “Chinese” even though the substance of such articles falls apart if one devotes even just a little effort to examining it.

So let’s get into it., the popular search engine often called the Chinese Google, got its name from a poem written during the Song Dynasty (960-1279). The poem is about a man searching for a woman at a busy festival, about the search for clarity amid chaos. Together, the Chinese characters bi [sic] and dù mean “hundreds of ways,” and come out of the last lines of the poem: “Restlessly I searched for her thousands, hundreds of ways./ Suddenly I turned, and there she was in the receding light.”

For reference, I’ll provide the poem. I’ve put the Chinese characters used by in bold and red.



The author of the poem, Xin Qiji (X?n Qìj? / ??? / ???), lived from 1140 to 1207 and was thus a contemporary of such Western poets as the troubadours Bertran de Born, Bernart de Ventadorn, and Giraut de Borneil — hardly poets whose work suffered for having been written with an alphabet.

Baidu, rendered in Chinese, is rich with linguistic, aesthetic and historical meaning. But written phonetically in Latin letters (as I must do here because of the constraints of the newspaper medium and so that more American readers can understand), it is barely anchored to the two original characters; along the way, it has lost its precision and its poetry.

Ugh. Where to start?

I’ll go ahead and skip “precision,” even though that’s perhaps not a word best applied to most poetry written in Literary Sinitic, and start with “rendered in Chinese.” However common the word might be, “Chinese” is a poor choice. In this case, the word seems to be intended to mean not any particular language but rather “Chinese characters,” which are not a language. Here, too, she appears to be blaming Pinyin for having lost something from Literary Sinitic, which is what the poem was written in. But Pinyin isn’t for Literary Sinitic; it’s for modern standard Mandarin. Also, whatever language Xin Qiji spoke could have been written with an alphabet with no loss of meaning, just like all other natural languages.

As Web addresses increasingly transition to non-Latin characters as a result of the changing rules for domain names, that series of Latin letters Chinese people usually see at the top of the screen when they search for something on Baidu may finally turn into intelligible words: “a hundred ways.”

Baidu vs. ?? on the home page:

Can you feel the difference in precision and poetry? No?

Also, it’s not clear just how much of a “transition to non-Latin characters” there’s going to be, especially where Chinese characters are concerned, especially in places like Singapore.

Of course, this expansion of languages for domain names could lead to confusion: users seeking to visit Web sites with names in a script they don’t read could have difficulty putting in the addresses, and Web browsers may need to be reconfigured to support non-Latin characters. The previous system, with domain names composed of numbers, punctuation marks and Latin letters without accents, promoted standardization, wrangling into consistency and simplicity one small part of the Internet.

For “could have difficulty putting in the addresses” read “could find it next to impossible to enter the correct address.” And by “one small part of the Internet,” she appears to mean the name of every single domain on the entire Internet.

But something else, something important, has been lost.

Part of the beauty of the Chinese language comes from a kind of divisibility not possible in a Latin-based language. Chinese is composed of approximately 20,000 single-syllable characters, 10,000 of which are in common use.

No, no, and no.

  • By “Latin-based language” the author seems to be referring not to a Romance language but to a language that uses the Latin alphabet for its standard script.
  • What exactly is this divisibility? Mandarin words can be divided into morphemes. The words of English, French, etc. work the same way.
  • No language is composed of Chinese characters.
  • There are a hell of a lot more than 20,000 Chinese characters.
  • “Common use” is difficult to pin down. But most authorities would give a lower number.

These characters each mean something on their own; they are also combined with other characters to form hundreds of thousands of multisyllabic words.

No, that’s wrong. Again: words — whether they be multisyllabic or monosyllabic — are not made of Chinese characters. Instead, Chinese characters are the script most seen for written Mandarin.

Níh?o, for example, Chinese for “Hello,” is composed of ní — “you,” and h?o — “good.” Isn’t “You good” — both as a statement and a question — a marvelous and strangely precise breakdown of what we’re really saying when we greet someone?

Again, this is assigning meaning to characters, when the meaning is of course in the word itself.

Note, too, that “níh?o” is incorrect in several ways.

  • One of the basic rules of Hanyu Pinyin is that tone sandhi is not indicated. So even though — because in Mandarin if something has two third tones in a row, the first shifts to second tone — the greeting is pronounced níh?o, the diacritical mark over the i should indicate third tone (?) rather than second (í).
  • The diacritic over the a is wrong. It should be a? (Unicode ̌), not ? (Unicode ă) — sharp vs. rounded. (You may need to enlarge the fonts on the screen to see this.)
  • Most careful authorities write this with a space rather than as solid: n? h?o rather than n?h?o. This, though, is something I don’t much care about. Popular usage of Pinyin as a real script will eventually work this out one or the other. Also, if someone is going to err in word parsing, I’d much rather they do it by making words solid rather than by breaking up the syllables.

The Romanization of Chinese into a phonetic system called Pinyin, using the Latin alphabet and diacritics (to indicate the four distinguishing tones in Mandarin), was developed by the Chinese government in the 1950s.

I’m a bit surprised the copy editors at the New York Times let that muddled sentence though. But I’ll pass over it without further observation.

Pinyin makes the language easier to learn and pronounce, and it has the added benefit of making Chinese characters easy to input into a computer. Yet Pinyin, invented for ease and standards, only represents sound.

In other words, Pinyin represents language — that being what writing systems are designed to do. And, yes, it’s easy to learn and use, which happens to be a good thing, not a bad one.

In Chinese, there are multiple characters with the exact same sound. The sound “b?i,” for example, means 100, but it can also mean cypress, or arrange. And “Baidu,” without diacritics, can mean “a failed attempt to poison” or “making a religion of gambling.”

My dictionary gives some different phrases. But whatever. Then there’s also the simple point: If there’s a problem with writing Pinyin without diacritics, then don’t write Pinyin without diacritics, write it with diacritics. But I have a hard time imagining how anyone would get such things confused in context.

Q: “Honey, could you check Baidu for information on when that new movie is coming out?”
A: “Baidu? Sorry, could you write that in Chinese characters for me. I can’t tell if you mean “a failed attempt to poison,” “making a religion of gambling,” or the search engine.”

Behind this is just the usual homonym canard. In English, as in other languages, there are many morphemes with the exact same pronunciation (sound). If we look at the closest English has to the Mandarin sounds bai and du, we can get by, buy, bye, bi-, and dew, do, due, etc. — all of which have various meanings. Take a look.

Those who don’t need to be hit over the head again and again to understand the simple point that English has plenty of homonyms but does just fine with an alphabet — as would every other natural language, including of course Mandarin and the other Sinitic languages– may wish to just skim the following blockquote.

bi, bi-, buy, by, bye

buy (transitive verb)

  1. : to acquire possession, ownership, or rights to the
    use or services of by payment especially of money :
    1. : to obtain in exchange for something often at
      a sacrifice <they bought peace with their
    2. : redeem
  2. : bribe, hire
  3. : to be the purchasing equivalent of <the dollar
    buys less today than it used to>
  4. : accept, believe <I don’t buy that hooey> -often
    used with into

buy (intransitive verb)

  1. : to make a purchase

buy (noun)

  1. : something of value at a favorable price; especially :
    bargain <it’s a real buy at that price>
  2. : an act of buying : purchase

bi (noun or adjective)

  1. : bisexual

bi- (prefix)

    1. : two <bilateral>
    2. : coming or occurring every two
    3. : into two parts <bisect>
    1. : twice : doubly : on both sides
    2. : coming or occurring two times
      <biannual> – compare semi-
  1. : between, involving, or affecting two (specified)
    symmetrical parts <bilabial>
    1. : containing one (specified) constituent in
      double the proportion of the other constituent or
      in double the ordinary proportion
    2. : di- 2 <biphenyl>

bi- (Variant(s): or bio-) (combining form)

  1. : life : living organisms or tissue
    <bioluminescence> <biosphere>
  2. : biographical <biopic>

bye (noun)

  1. : the position of a participant in a tournament who
    advances to the next round without playing

by (preposition)

  1. : in proximity to : near <standing by the
    1. : through or through the medium of : via
      <enter by the door>
    2. : in the direction of : toward <north by
    3. : into the vicinity of and beyond : past
      <went right by him>
    1. : during the course of <studied by
    2. : not later than <by 2 p.m.>
    1. : through the agency or instrumentality of
      <by force>
    2. : born or begot of
    3. : sired or borne by
  2. : with the witness or sanction of <swear by all that
    is holy>
    1. : in conformity with <acted by the
    2. : according to <called her by name>
    1. : on behalf of <did right by his
    2. : with respect to <a lawyer by
    1. : in or to the amount or extent of <win by a
    2. b chiefly Scottish : in comparison with :
  3. -used as a function word to indicate successive units or increments <little by little> <walk two by two>
  4. -used as a function word in multiplication, in division, and in measurements <divide a by b>
    <multiply 10 by 4> <a room 15 feet by 20 feet>
  5. : in the opinion of : from the point of view of
    <okay by me>

by (adjective)

  1. : being off the main route : side
  2. : incidental

by (noun)

  1. : something of secondary importance : a side issue

by/bye (interjection)

  1. : short for goodbye

dew, do, due

dew (noun)

  1. : moisture condensed upon the surfaces of cool bodies especially at night
  2. : something resembling dew in purity, freshness, or power to refresh
  3. : moisture especially when appearing in minute droplets: as

    1. : tears
    2. : sweat
    3. : droplets of water produced by a plant in transpiration

due (adjective)

  1. : owed or owing as a debt
    1. : owed or owing as a natural or moral right
      <everyone’s right to dissent is due the full protection of the Constitution – Nat Hentoff>
    2. : according to accepted notions or procedures : appropriate <with all due respect>
    1. : satisfying or capable of satisfying a need, obligation, or duty : adequate <giving the matter due attention>
    2. : regular, lawful <due proof of loss>
  2. : capable of being attributed : ascribable -used with to <this advance is partly due to a few men of genius –
    A. N. Whitehead>
  3. : having reached the date at which payment is required
    : payable <the rent is due>
  4. : required or expected in the prescribed, normal, or logical course of events : scheduled <the train is due at noon>; also : expected to give birth

due (noun)

  1. : something due or owed: as

    1. : something that rightfully belongs to one
      <give him his due>
    2. : a payment or obligation required by law or custom : debt
    3. plural : fees, charges <membership dues>

due (adverb)

  1. : directly, exactly <due north>
  2. <obsolete> : duly

do (transitive verb)

  1. : to bring to pass : carry out <do another’s wishes>
  2. : put -used chiefly in do to death
  3. : perform, execute

    1. <do some work> <did his duty>
    2. : commit <crimes done deliberately>
    1. : bring about, effect <trying to do good>
      <do violence>
    2. : to give freely : pay <do honor to her memory>
  4. : to bring to an end : finish -used in the past participle <the job is finally done>
  5. : to put forth : exert <did her best to win the race>
  6. : to wear out especially by physical exertion : exhaust
    <at the end of the race they were pretty well done> b
    : to attack physically : beat; also : kill
  7. : to bring into existence : produce <do a biography on the general>
  8. -used as a substitute verb especially to avoid
    repetition <if you must make such a racket, do it
    somewhere else>
  9. : to play the role or character of b : mimic; also : to behave like <do a Houdini and disappear> c : to perform in or serve as producer of <do a play>
  10. : to treat unfairly; especially : cheat <did him out of his inheritance>
  11. : to treat or deal with in any way typically with the sense of preparation or with that of care or attention:

      1. : to put in order : clean <was doing the kitchen>
      2. : wash <did the dishes after supper>
    1. : to prepare for use or consumption; especially : cook <like my steak done rare>
    2. : set, arrange <had her hair done>
    3. : to apply cosmetics to <wanted to do her face before the party>
    4. : decorate, furnish <did the living room in Early American> <do over the kitchen>
  12. : to be engaged in the study or practice of <do science>; especially : to work at as a vocation lt;what to do after college>
    1. : to pass over (as distance) : traverse <did 20 miles yesterday>
    2. : to travel at a speed of <doing 55 on the turnpike>
  13. : tour <doing 12 countries in 30 days>
    1. : to spend (time) in prison <has been doing time in a federal penitentiary>
    2. : to serve out (a period of imprisonment)
      <did ten years for armed robbery>
  14. : to serve the needs of : suit, suffice <worms will do us for bait>
  15. : to approve especially by custom, opinion, or propriety <you oughtn’t to say a thing like that — it’s not done – Dorothy Sayers>
  16. : to treat with respect to physical comforts <did themselves well>
  17. : use 3 <doesn’t do drugs>
  18. : to have sexual intercourse with
  19. : to partake of <let’s do lunch>

do (intransitive verb)

  1. : act, behave <do as I say>
    1. : get along, fare <do well in school>
    2. : to carry on business or affairs : manage
      <we can do without your help>
  2. : to take place : happen <what’s doing across the street>
  3. : to come to or make an end : finish -used in the past participle
  4. : to be active or busy <let us then be up and doing
    – H. W. Longfellow>
  5. : to be adequate or sufficient : serve <half of that will do>
  6. : to be fitting : conform to custom or propriety
    <won’t do to be late>
  7. -used as a substitute verb to avoid repetition
    <wanted to run and play as children do> ; used especially in British English following a modal auxiliary or perfective have <a great many people had died, or would do – Bruce Chatwin>
  8. -used in the imperative after an imperative to add emphasis <be quiet do>

do (verbal auxiliary)

    1. -used with the infinitive without to to form present and past tenses in legal and parliamentary language <do hereby bequeath> and in poetry <give what she did crave – Shakespeare>
    2. -used with the infinitive without to to form present and past tenses in declarative sentences with inverted word order <fervently do we pray –
      Abraham Lincoln>, in interrogative sentences
      <did you hear that?>, and in negative sentences <we don’t know> <don’t go>
  1. -used with the infinitive without to to form present and past tenses expressing emphasis <i do say> <do be careful>

do (noun)

  1. chiefly dialect : fuss, ado
  2. archaic : deed, duty
    1. : a festive get-together : affair, party
    2. chiefly British : battle
  3. : a command or entreaty to do something <a list of dos and don’ts>
  4. British : cheat, swindle
  5. : hairdo

All that’s without me bothering to get out a big dictionary.

Alas, poor English! How confused we must be to be using a mere alphabet. Oh, if only we could achieve linguistic, aesthetic, and historical meaning!

In the case of, the word, in Latin letters, has slipped away from its original context and meaning, and been turned into a brand.

Baidu is a brand, and as is generally thought of as such regardless of what script it is written in. Furthermore, it’s understood as a “word” only as that search engine. In the poem the characters “??” are used to write not one word but two — and even written in Hanzi this is not something more than a relative handful of people in China or Taiwan would recognize as having come from that poem unless someone told them about it first.

Language is such a basic part of our lives, it seems ordinary and transparent. But language is strange and magical, too: it dredges up history and memory; it simultaneously bestows and destabilizes meaning. Each of the thousands of languages spoken around the world has its own system and rules, its own subversions, its own quixotic beauty. Whenever you try to standardize those languages, whether on the Internet, in schools or in literature, you lose something. What we gain in consistency costs us in precision and beauty.

When Chinese speakers Baidu (like Google, it too is a verb), we look for information on the Internet using a branded search engine. But when we see the characters for b?i dù, we might, for one moment, engage with the poetry of our language, remember that what we are really trying to do is find what we were seeking in the receding light. Those sets of meanings, layered like a palimpsest, might appear suddenly, where we least expect them, in the address bar at the top of our browsers. And in some small way, those words, in our own languages, might help us see with clarity, and help us to make sense of the world.

Clarity? Clarity?!

I understand that the author, as a novelist rather than a linguist, might be preoccupied with the whole Ezra Pound “make it new” and “give people new eyes” thing. If so, good for her. But, still, one should not not confuse flights of fancy, no matter how cool they might sound, with facts and should at least attempt not to be completely wrong about almost everything, especially when publishing in the New York Times.

If the argument for Chinese characters is supposed to be that their continued, indeed expanded, use is necessary so people can quote poems in Literary Sinitic out of context so that what would be at best a low-single-digit percentage of native speakers of Mandarin or another modern Sinitic language might recognize the allusion despite a lack of context and might get a Hanzi-licious frisson out of the experience … that would have to be one of the most ridiculous things I’ve ever read.

Kicking the irony meter way up on all this is that the author of those remarks on the really cool feelings one can get from reading Chinese characters cannot herself read texts written in them, though she neglected to mention that little bit of information in her New York Times piece.

And for irony on top of irony, as someone who left China at the age of 10, she likely still knows her native Sinitic language, so texts written in romanization could give her the literacy in that language that she lacks in Chinese characters. Romanization could provide meaning; but instead she harps upon the virtues of Chinese characters.

Oh, and for a final bit of irony, here’s something else the author apparently didn’t bother to check: ??.com already exists. And is anyone surprised to hear that the site at that address is not a search engine of the Song dynasty? Here’s what it looks like.
screenshot of ??.com -- a linkspam site -- as of July 1, 2010

That’s right: ??.com is just a linkspam site. But apparently because, unlike, it has Chinese characters in the URL it’s linkspam with its own quixotic beauty; it’s linkspam with its own sets of meanings, layered like a palimpsest; and it’s linkspam that is rich with linguistic, aesthetic and historical meaning.

C’mon, people! Feel the poetry of it! The precision!

Like, wow.

sg domain names in Chinese characters lag

Between November, 23, 2009, when Singapore first began registering .sg names in Chinese characters, and June 10, 2010, when registrations of Chinese-character .sg domain names opened to all without any additional fee, only 1,024 such names were registered, or just 0.88 percent of all .sg domain names. This apparently includes not just second-level domains (e.g., ??.sg) but also third-level domains (e.g., ??

The percentage will likely rise in the coming months, as the process has only recently opened to everyone on a first-come, first-served basis. But, still, demand for such names in Singapore has so far been underwhelming.

A bit more information:

Registrations were accepted in phases, with registrations for government organizations starting on Nov. 23, 2009. Beginning in January, SGNIC began accepting domain name registrations from trademark holders.

During the third phase, the general public was allowed to register domain names starting on March 25, but applicants were charged a “priority fee” of S$100 (US$72) for each domain name, with domain names sought by several applicants awarded to the highest bidder.

In all three phases, applicants could apply for a domain name made up of Chinese numbers or a name with just one Chinese character for a fee of S$500 [US$360]….

The fourth and final phase began on June 10, with SGNIC accepting domain name applications on a first-come, first-served basis. The S$100 priority fee is no longer required, but applicants are no longer allowed to register domain names using Chinese numbers or names with just one Chinese character….

When IDA announced the introduction of Chinese-language domain names last year, SGNIC said the effort was partly intended to help Singaporean businesses target the Chinese market.

source: Singapore registers 1,000 Chinese-language domain names, IDG News Service, June 23, 2010

Taiwanese-English, English-Taiwanese dictionaries posted

Maryknoll Language Service Center has put online the complete texts of its Taiwanese-English and English-Taiwanese dictionaries. Better still, these have been released under a Creative Commons license. These are a terrific resource for anyone who’s interested in Hoklo.

Maryknoll deserves praise for this great work. Thanks are due, too, to Tailingua, which I know has been working behind the scenes to help make this happen.

From the English Amoy Dictionary (???????):
screenshot from the English-Taiwanese dictionary

And from the Taiwanese-English Dictionary (??????):
screenshot from the dictionary

source: Maryknoll dictionaries now free to download, Tailingua, June 17, 2010

OMG, it’s Hanzified English

Taiwanese movie poster in Mandarin for 'Date Night', a.k.a. '?????'In Taiwan, the new movie Date Night has been given the Mandarin title Yu?huì o mài gà (?????/?????).

Yu?huì is simply the word for “date.” The interesting part is “o mài gà” (???), which is a Mandarinized form of the English “oh my god.” (I wonder if this, being written in Hanzi despite still being basically English, would pass China’s new need for supposed purity.)

Most people here — especially those younger than about 40 — would simply write “oh my god” (or, less frequently, “o my god”) in English in the middle of an otherwise Mandarin text. (I’ll spare everyone the chart of Google searches; but it backs this up.) But brevity is standard in movie titles here, and “???” is a lot more compact on a movie poster than “oh my god.” This, however, raises the question of why “???” instead of the equally concise “OMG”. I don’t know the answer to that. But the path of lettered words in Mandarin is certainly not without twists and turns.

Like most other uses of Hanzified English, the results are not entirely faithful to the original sounds.

Mandarin’s ou would be a closer phonetic fit than o for the English “oh”.
There’s ?u (?/?), a surname. But most of the time this Chinese character is pronounced q? (being one of those many Chinese characters with multiple pronunciations), so that certainly wouldn’t work well. There’s ?u, which has a more clearly phonetic Hanzi (?/?), but which has to do with vomit (?utù/??/??). Another possible choice would be ?u (?/?); but that is associated mainly with Europe and doesn’t get used much as a phonetic component in non-Europe-related loan words outside the word for ohm: ?um? (??/??).

Mài (the Mandarin word for wheat), unlike most other Mandarin morphemes pronounced mai (various tones), gets used phonetically in lots of various loan words, such as Màid?ngláo (McDonald’s/???/???), Màiji? (Mecca/??/??), D?nmài (Denmark/??/??), and K?màilóng (Cameroon/???/???). So its use is to be expected, though semantically there’s no link. And mài is certainly a better fit for the English my than it is for the Mc of McDonald’s, the Mec of Mecca, the mark of Denmark, or the me of Cameroon.

For ga there’s not a lot of choice. ? is often seen in the phonetic loan g?lí (curry). The biggest problem here is that the same ? is also used as k? in a different, common phonetic loan: k?f?i (coffee). There’s ?; but, like ?, it’s not exactly a well-known character.

Anyway, I could go on for a long time listing various possibilities. But the main point is that Chinese characters just don’t do well at this sort of thing.

As for Pinyin, I suppose the orthography could get interesting: o mài gà, o màigà, omài gà, or omàigà. But a Pinyin orthography would probably simply encourage people to write this in the original: oh my god.

BTW, you may wish to try the following experiment. The in o mài gà is most often seen in writing the word g?ngà (??/??), which means awkward/embarrassed. Ask native speakers of Mandarin to write g?ngà in Hanzi for you by hand without using a dictionary, a computer, or any other form of assistance. I bet that most people — even those with university degrees — won’t be able to write this common, ordinary word correctly.

And for lagniappe, the character ? is also sometimes seen in written Taiwanese as the equivalent of Mandarin’s ji? (?/add). I spotted an example of this just the other day on a cafe sign (in the sense of “buy something and ga something else for a special price”) but didn’t have a camera with me.

How to create Hanyu Pinyin subtitles

Since posting about the Pinyin subtitles for Crouching Tiger, Hidden Dragon and The Story of Stuff I have received several messages inquiring about how someone might make Pinyin subtitles themselves. So I might as well put the answer online.

Although at the present stage of software implementation subtitle conversion isn’t as simple as pushing a button, the process is not particularly difficult, assuming you have a good source text to work from. But this does require some time and the right tools.

The Right Tools

The most important tool is, of course, the one that performs the conversion to Hanyu Pinyin. And it’s crucial to keep in mind that not all Pinyin converters are created equal; in fact, the vast majority of so-called Pinyin converters are best avoided entirely. The world does not need any more texts in the hobbled, poorly written mess that many people erroneously think of as Hanyu Pinyin; but it very much needs texts in real Hanyu Pinyin. So don’t waste your time with a program that doesn’t do a good job of word parsing, etc.

At present the clear front-runner for converting Chinese characters to Hanyu Pinyin texts (real Hanyu Pinyin texts) with a minimum need for user assistance is Key Chinese (Windows and Mac). The demo version is fully functional for 30 days. Key’s considerably less expensive “Hanzi To Pinyin With Tones Conversion Utility” for MS Word texts (also with a 30-day demo) would probably also work well, though I haven’t tried it myself.

Wenlin (Windows and Mac) is another excellent program that can produce properly spelled and word-parsed Hanyu Pinyin. But it requires users to run some disambiguation themselves, which can take a lot of time when you’re talking about something with as much text as a screenplay. Nonetheless, Wenlin’s incorporation of John DeFrancis’s ABC Chinese-English Comprehensive Dictionary makes it a helpful reference when performing post-conversion checks. Also, especially if one does not have Key, Wenlin — even the function-limited but non-expiring demo version — is useful for handling some adjustments (such as removing tone marks or providing a workaround when dealing with programs that don’t handle Chinese characters well).

You’ll also need a Unicode-friendly text editor with good support of regular expressions (to allow wildcard searches). I like Em Editor, which is Windows based. But lots of other programs would work. One could even use MS Word if so inclined.

Finally, having subtitles in an additional language (usually but not necessarily English) is often desirable, not just for others who would use these subtitles but for yourself as you create the Pinyin subtitles. But often the subtitles one may find in Mandarin are not in synch with those in another language. Software can fix this problem. But I don’t have enough experience with this to recommend certain programs over others.

To sum up, the tools I recommend for creating Hanyu Pinyin subtitles are

  1. Key Chinese
  2. Wenlin
  3. EmEditor (or another Unicode-friendly text editor)
  4. a subtitle synchronizer

Actually, just the first one, Key, is sufficient to produce Pinyin subtitles. But in my experience using a combination of all four programs is preferable.

Now it’s time to get down to business.

The Main Steps

  1. acquire source-version subtitles
  2. synchronize subtitle files
  3. identify names of the movie’s characters (dramatis personae)
  4. perform initial conversion of subtitles in Chinese characters to Pinyin
  5. double check the results and perform necessary cleanup
  6. create additional version without tone marks
  7. share your work

1. Acquire subtitles for conversion and reference

At present the most useful site for finding Mandarin subtitles written in Chinese characters is probably Shooter. You may need to try searching for your desired title in both simplified and traditional characters. Also, be aware that movies — especially movies not filmed in Mandarin — often have different names in China, Taiwan, Hong Kong, etc.

You may find it useful to look for subtitles in other languages, too. Shooter can be useful for that, though you may have better luck finding English subtitles at or similar English-language sites.

One can often find different subtitle files for the same movie, so you may wish to examine more than one for quality. Another thing that’s worth keeping in mind: Converting from traditional Chinese characters to simplified Chinese characters is less problematic than vice versa.

2. Synchronize subtitle files

Once you have the files, you should synchronize them with each other according to the directions for the particular program you are using.

If the program you’re using for this chokes on Chinese characters, though, you’ll need to take a couple extra steps. First, convert the Chinese characters to Unicode numerical character references using either Pinyin Info’s NCR conversion tool or Wenlin (full or demo version). The reason for this is that even synchronizers that screw up “???” should be able to handle the NCR equivalent: “&#26446;&#24917;&#30333;”.

In Wenlin,
Edit –> Make transformed copy –> Encode &#; [decimal]

Take the NCR text and synchronize the files. After you get this taken care of, reconvert to Chinese characters.

In Wenlin,
Edit –> Make transformed copy –> Decode &#;

3. identify names of the movie’s characters

You must teach your software know which strings of Hanzi represent names. For example, it’s crucial for clarity that the character name “???” is written “L? Mùbái” rather than as “l? mù bái“. This part takes some time up front. But do not skip this step, because it is not only crucial but will save a lot of trouble in the long run.

Before doing this, however, people may want to refamiliarize themselves with Hanyu Pinyin’s rules for proper nouns (PDF). Note especially what is supposed to be capitalized and what isn’t.

The Mandarin version of Wikipedia is one resource that can be helpful in identifying the names of at least the main characters in the movie. But you’ll want to look for more names and forms than will be listed there. Keep in mind that characters aren’t always addressed by their full names. You need to look for other forms as well (e.g., in Crouching Tiger, Hidden Dragon Li Mubai is sometimes referred to as “Li Mubai” but other times as “Li ye” or simply as “Mubai”) and enter them.

English subtitles can be very useful for locating most proper nouns in the text. (Hooray for word parsing and capitalization of proper nouns!) The following search of an English subtitle file should help pinpoint the location of proper nouns.

find (with “Match Case” and “Use Regular Expressions” checked):

in MS Word, find (with “Use wildcards” checked):
[!\.] [A-Z][a-z]

Since you’ve already synchronized your subtitles, you’ll easily be able to find the corresponding point in the Mandarin subtitles by looking at the time the line appears.

As you gather the names, or after you compile the full list, add your findings to the Pinyin converter’s user dictionary. In Key, perform Language –> Add Record, then fill in the Hanzi and Pinyin fields.

4. Perform initial conversion to Pinyin

OK, I know you’re eager to run the conversion and see all of those Hanzi turn into lovely Hanyu Pinyin. But there’s one quick step you need to do first. If you’re using Key Chinese, the program won’t make use of all of those character names you just painstakingly added to the user dictionary unless you first run “linguistic reconstruction” on the subtitles you wish to convert:
Language –> Linguistic Reconstruction

Now you’re ready for the big step:
Language –> Convert to Pinyin

5. Double check the results and perform necessary cleanup

Unfortunately, most Pinyin converters — even the best — tend to be lazy about inserting spaces in some of the places they belong, such as around numeric and alphabetic strings. For example, “?3?22????????5?31??????” will generally convert to something that looks like this:
“zì3yuè22rì (X?ngq?y?) q? zhì5yuè31rì (X?ngq?y?)”.
But it should look like this:
“zì 3 yuè 22 rì (X?ngq?y?) q? zhì 5 yuè 31 rì (X?ngq?y?)”.

To fix this in your Pinyin text, run the following regular expression in EmEditor. Make sure “Match Case” is not checked.

\1 \2 \3

If you do this in Word, you’ll need to use the following instead in your wildcard search.

\1 \2 \3

The rest of cleanup work usually involves you simply reading through the text, looking for errors, perhaps while listening to the movie.

6. Create additional version without tone marks

If you have Key, this is very easy: Highlight the entire text, then
Format –> Strip Tone Marks.

And you’re done, though because Key keeps u-umlaut as such, if your television or other device doesn’t show the letter ü correctly you may wish to convert “ü” to “v”.

If you don’t have Key or access to another program that can do the same thing as easily, then use a combination of Wenlin (again, even the demo will do what you need) and a text editor. First, paste your Pinyin text into Wenlin. Then select all of the text and perform
Edit –> Make transformed copy… –> Replace tone marks with 1-4

Copy and paste the results into a new document in your text editor. Then run the following search-and-replace. Make certain the “Use Regular Expressions” or “Use Wildcards” box is checked.


replace with:
Then click “Replace All”.

What this looks like in EmEditor:
image showing the search-and-replace dialog box for the above

What this looks like in MS Word:

7. Share your work

It’s much better if people can concentrate on producing new material rather than having to redo things others have already taken care of. So if you make a good Hanyu Pinyin version of something, please let me know.

Pinyin subtitles for ‘The Story of Stuff’

screenshot from the video, showing Pinyin subtitles: Shìde, shìde, shìde: w?men quánd?u yào huísh?u, k?x? gu?ng kào huísh?u hái bùgòu.

The Story of Stuff is a 20-minute video on the costs and absurdities of having a culture wrapped up in unchecked consumerism. It gained especially wide attention after the New York Times published a front-page article about it. A related book was released earlier this month.

The entire video can be downloaded freely in high- and low-resolution versions. And now there’s a collection of subtitles of possible interest to many readers of Pinyin News.

The zip file contains seven subtitle files:

  • Hanyu Pinyin with tone marks
  • Hanyu Pinyin without tone marks
  • traditional Chinese characters (Unicode)
  • traditional Chinese characters (Big5)
  • simplified Chinese characters (Unicode)
  • simplified Chinese characters (GB)
  • English

The star of the video, author Annie Leonard, has a lot to get through in just 20 or so minutes, so many people may find it easier, at least at first, to read the Pinyin subtitles that do not include tone marks.

Google Maps switches to Hanyu Pinyin for Taiwan (sloppily)

Until very recently, Google Maps gave street names in Taiwan in Tongyong Pinyin — most of the time, at least. This was the case even for Taipei, which most definitely has long used Hanyu Pinyin, not Tongyong Pinyin. The romanization on Google Maps was really a hodgepodge in the maps of Taiwan. And it’s still kind of a mess; but now it’s at least more consistent — and more consistent in Hanyu Pinyin.

First the good. In Google Maps:

  • Hanyu Pinyin, not Tongyong Pinyin, is now used for street names throughout Taiwan
  • Tone marks are indicated. (Previous maps with Tongyong did not indicate tones.)

Now the bad, and unfortunately there’s a lot of it and it’s very bad indeed:

  • The Hanyu Pinyin is given as Bro Ken Syl La Bles. (Terrible! Also, this is a new style for Google Maps. Street names in Tongyong were styled properly: e.g., Minsheng, not Min Sheng.)
  • The names of MRT stations remain incorrectly presented. For example, what is referred to in all MRT stations and on all MRT maps as “NTU Hospital” is instead referred to in broken Pinyin as “Tái Dà Y? Yuàn” (in proper Pinyin this would be Tái-Dà Y?yuàn); and “Xindian City Hall” (or “Office” — bleah) is marked as X?n Diàn Shì G?ng Su? (in proper Pinyin: “X?ndiàn Shìg?ngsu?” or perhaps “X?ndiàn Shì G?ngsu?“). Most but not all MRT stations were already this incorrect way (in Hanyu Pinyin rather than Tongyong) in Google Maps.
  • Errors in romanization point to sloppy conversions. For example, an MRT station in Banqiao is labeled X?n Bù rather than as X?np?. (? is one of those many Chinese characters with multiple Mandarin pronunciations.)
  • Tongyong Pinyin is still used in the names of most cities and townships (e.g., Banciao, not Banqiao).

Screenshot from earlier this evening, showing that Tongyong Pinyin is still being used in Google Maps for some city and district names (e.g., Gueishan, Sinjhuang, Banciao, Jhonghe, Sindian, and Jhongjheng rather than Hanyu Pinyin’s Guishan, Xinzhuang, Banqiao, Zhonghe, Xindian, and Zhongzheng, respectively).
map of Taipei area, with names as shown above

I don’t have any old screenshots of my own available at the moment, so for now I’ll refer you to an image that Fili used in an old post of his. Compare that with this screenshot I took a few minutes ago from Google Maps of the same section of Tainan:

Note especially how the name of the junior high school is presented.

  • Previously “Jian Xing Junior High School”.
  • Now “Jiàn Xìng Jr High School”.

This is typical of how in old maps some things were labeled (poorly) in Hanyu Pinyin. (Words, not bro ken syl la bles, are the basis for Pinyin orthography. This is a big deal, not a minor error.) And now such places are still labeled poorly in Hanyu Pinyin, but with the addition of tone marks.

I’d like to return to the point earlier on sloppy conversions. Surprisingly, ??? is given as “Chéng Do? Road” rather than as “Chéngd? Road“.
screenshot from Google Maps of 'Cheng Dou [sic] Rd', near Taipei's Ximending
Although “Xinpu” might not be the sort of name to be contained in some romanization databases, there is nothing in the least obscure about Chengdu, the name of a city of some 11 million people. Google Translate certainly knows the right thing to do with ???:
screenshot from Google Translate, showing how Google will translate '???' as 'Chengdu Rd'

But Google Maps doesn’t get this simple point right, which likely points to outsourcing. Why would Google do this? And why wouldn’t it ensure that a better job was done? Because, really, so far the long-overdue conversion to Hanyu Pinyin in Google Maps for Taiwan is something of a botch.