Pinyin sort order

The standard for alphabetically sorting Hanyu Pinyin is given in the ABC dictionary series edited by John DeFrancis and issued by the University of Hawaii Press.

Here’s the basic idea:

The ordering is primarily simply alphabetical. Diacritical marks, punctuation, juncture and capitalization are only taken into account when the strings being compared are otherwise identical. For example, píng’?n sorts before p?ny?n, because pingan sorts before pinyin, because g precedes y alphabetically.

Only when two strings are alphabetically identical is non-alphabetical information taken into account.

The series’ Reader’s Guide presents the specifics of the sort order. Since I don’t have to worry about how much space this takes up on my site, I have reformatted the information slightly to give the examples as numbered lists.

Head entry transcriptions with the same sequence of letters are ordered first strictly by letter sequence regardless of tones, then by initial syllable tone in the sequence 0 1 2 3 4. For entries with the same initial tone, arrangement is by the tone of the second syllable, again in the order 0 1 2 3 4. For example:

  1. sh?shi
  2. sh?sh?
  3. sh?shí
  4. sh?sh?
  5. sh?shì
  6. shísh?
  7. shíshì
  8. sh?sh?
  9. shìsh?

Irrespective of tones, entries with the vowel u precede those with ü.
For example:

  1. l?
  2. l?
  3. l?
  4. l?
  1. n?

Entries without apostrophe precede those with apostrophe. For example:

  1. biànargue
  2. b?’ànthe other shore

Lower-case entries precede upper-case entries. For example:

  1. hòujìnaftereffect
  2. Hòu JìnLater Jin dynasty

For entries with identical spelling, including tones, arrangement is by order of frequency….

For most users, the most important thing to note is that the neutral tone is regarded as 0, not as 5. Thus, the order is not? á ? à a,” but “a ? á ? à.” And, because lowercase comes before uppercase, notA a ? ? Á á ? ? À à” but “a A ? ? á Á ? ? à À.

One can see this in action in the A entries for the ABC English-Chinese, Chinese-English Dictionary. And here are some sample pages from an earlier ABC dictionary.

The ABC series follows the example of the Hanyu Pinyin Cihui (?????? / Hàny? P?ny?n Cíhuì) (example), with only one minor difference, as noted by Tom Bishop:

HPC [Hanyu Pinyin Cihui] gave hyphens and spaces the same priority as apostrophes, so that lìg?ng sorted before l?-g?ng, in spite of the tones. Usage of hyphens and spaces in pinyin is still far from being fully standardized. (The same is true in English orthography.) Consequently, for collation it makes sense to give less weight to hyphens and spaces, and more weight to tones, thus sorting l?-g?ng before lìg?ng. In ABC, hyphens and spaces don’t affect the sort order unless they change the pronunciation in the same way that apostrophe would; for example, 1míng-àn ?? and 2míng’àn ?? are treated as homophones, and they sort after m?ng?n ??.

New database of cross-strait differences in Mandarin goes online

Last week, on the same day President Ma Ying-jeou accepted the resignation of a minister who made some drunken lewd remarks at a w?iyá (year-end office party), Ma was joking to the media about blow jobs.

Classy.

screenshot from a video of a news story on this

But it was all for a good cause, of course. You see, the Mandarin expression chu? l?ba, when not referring to the literal playing of a trumpet, is usually taken in Taiwan to refer to a blow job. But in China, Ma explained, chu? l?ba means the same thing as the idiom p?i m?pì (pat/kiss the horse’s ass — i.e., flatter). And now that we have the handy-dandy Zh?nghuá Y?wén Zh?shikù (Chinese Language Database), which Ma was announcing, we can look up how Mandarin differs in Taiwan and China, and thus not get tripped up by such misunderstandings. Or at least that’s supposed to be the idea.

The database, which is the result of cross-strait cooperation, can be accessed via two sites: one in Taiwan, the other in China.

It’s clear that a lot of money has been spent on this. For example, many entries are accompanied by well-documented, precise explanations by distinguished lexicographers. Ha! Just kidding! Many entries are really accompanied by videos — some two hundred of them — of cutesy puppets gabbing about cross-strait differences in Mandarin expressions. But if there’s a video in there of the panda in the skirt explaining to the sheep in the vest that a useful skill for getting ahead in Chinese society is chu? l?ba, I haven’t found it yet. Will NMA will take up the challenge?

Much of the site emphasizes not so much language as Chinese characters. For example, another expensively produced video feeds the ideographic myth by showing off obscure Hanzi, such as the one for ch?ng.

WARNING: The screenshot below links to a video that contains scenes with intense wawa-ing and thus may not be suitable for anyone who thinks it’s not really cute for grown women to try to sound like they’re only thwee-and-a-half years old.

cheng3

In a welcome bit of synchronicity, Victor Mair posted on Language Log earlier the same week on the unpredictability of Chinese character formation and pronunciation, briefly discussing just such patterns of duplication, triplication, etc.

Mair notes:

Most of these characters are of relatively low frequency and, except for a few of them, neither their meanings nor their pronunciations are known by persons of average literacy.

Many more such characters consisting or two, three, or four repetitions of the same character exist, and their sounds and meanings are in most cases equally or more opaque.

The Hanzi for ch?ng (which looks like ??? run together as one character) in the video above is sufficiently obscure that it likely won’t be shown correctly in many browsers on most systems when written in real text: ????. But never fear: It’s already in Unicode and so should be appearing one of these years in a massively bloated system font.

Further reinforcing the impression that the focus is on Chinese characters, Liú Zhàoxuán, who is the head of the association in charge of the project on the Taiwan side, equated traditional Chinese characters with Chinese culture itself and declared that getting the masses in China to recognize them is an important mission. (Liu really needs to read Lü Shuxiang’s “Comparing Chinese Characters and a Chinese Spelling Script — an evening conversation on the reform of Chinese characters.”)

Then he went on about how Chinese characters are a great system because, supposedly, they have a one-to-one correspondence with language that other scripts cannot match and people can know what they mean by looking at them (!) and that they therefore have a high degree of artistic quality (g?odù de yìshùxìng). Basically, the person in charge of this project seems to have a bad case of the Like Wow syndrome, which is not a reassuring trait for someone in charge of producing a dictionary.

The same cooperation that built the Web sites led to a new book, Li?ng’àn M?irì Y? Cí (???????? / Roughly: Cross-Strait Term-a-Day Book), which was also touted at the press conference.

The book contains Hanyu Pinyin, as well as zhuyin fuhao. But, alas, the book makes the Pinyin look ugly and fails completely at the first rule of Pinyin: use word parsing. (In the online images from the book, such as the one below, all of the words are se pa ra ted in to syl la bles.)

The Web site also has ugly Pinyin, with the CSS file for the Taiwan site calling for Pinyin to be shown in SimSun, which is one of the fonts it’s better not to use for Pinyin. But the word parsing on the Web site is at least not always wrong. Here are a few examples.

  • “???” is given as p?oshénr (good).
  • And apostrophes appear to be used correctly: e.g., fàn’?n (??), ch?n’?n (??), and f?i’?n (??).
  • But “???” is run together as “dìèrch?n” (no hyphen) rather than as shown correctly as dì-èr ch?n.
  • And “??????” is given as yí?e tóu li?n??e dà (for Taiwan) and y??e tóu li?n??e dà (for China). But ge is supposed to be written separately. (The variation of tone for yi is in this case useful.)

Still, my general impression from this is that we should not expect the forthcoming cross-strait dictionary to be very good.

Further reading:

How to handle ‘de’ and interjections in Hanyu Pinyin

cover image for the bookToday’s selection from Yin Binyong’s X?nhuá P?nxi? Cídi?n (???????? / ????????) deals with how to write Mandarin’s various de‘s, mood particles, and interjections.

This reading is available in two versions:

  • simplified Chinese characters: ???? ????? (zhùcí, tàncí)
  • traditional Chinese characters: ???? ?????

I’ve already written about the principles in previous posts. For example, see

How to write numbers and measure words in Hanyu Pinyin

cover image for the bookToday’s selection from Yin Binyong’s X?nhuá P?nxi? Cídi?n (???????? / ????????) is about writing numbers and measure words.

This reading is available in two versions:

For more on this, see these posts and the PDFs linked to therein.

How to write verbs in Hanyu Pinyin (Mandarin text)

cover image for the book

Here’s the first of several selected readings from Yin Binyong’s X?nhuá P?nxi? Cídi?n (???????? / ????????). It covers the writing of verbs.

This reading is available in two versions:

  • simplified Chinese characters: ???? ??
  • traditional Chinese characters: ???? ??

For those who would like to read about this in English, see

important book on Pinyin to be excerpted on this site

cover image for the bookX?nhuá P?nxi? Cídi?n (???????? / ????????), is the second of Yin Binyong’s two books on Pinyin orthography. The first, Chinese Romanization: Pronunciation and Orthography, is in English and Mandarin; much of it is already available here on Pinyin.Info.

Although Xinhua Pinxie Cidian is only in Mandarin, the large number of examples makes it easy to get the point even if you may not read Mandarin in Chinese characters very well.

This week I will begin posting some excerpts from this invaluable work. What’s more, I have made a version in traditional Chinese characters, which I hope that readers in Taiwan, Hong Kong, and elsewhere will take advantage of. So those not used to reading simplified Chinese characters will have a choice (which is more than the government of Taiwan is providing these days).

I’m extremely happy to be able to bring you this information and with to acknowledge the generosity of the Commercial Press. Stay tuned.

Oxford Chinese Dictionary goes online

cover image of the Oxford Chinese DictionaryOxford University Press has just announced that its massive Oxford Chinese Dictionary is now available through its Oxford Language Dictionaries Online subscription service.

I haven’t seen the online version yet myself; but from the publisher’s description it appears to be largely the same as the published edition, whose paucity of Pinyin is disappointing. The publisher, however, is promising that “Pinyin will be added to all Chinese translations” in November, which should be a major step forward.

Perhaps some of you at universities have institutional access. I would welcome reports.

source: What’s New, Oxford Language Dictionaries Online, May 2011.

Wenlin releases major upgrade (4.0)

Wenlin logoOne of my favorite programs, Wenlin (which bills itself as “software for learning Chinese”), has just released a major upgrade for both Mac and Windows versions. This doesn’t happen often; it has been three-and-a-half years since the most recent big change was issued (Wenlin 3.4) and heaven only knows how long since 3.0 came out. So, yes, this release has many substantial improvements.

One of the features nearest and dearest to my heart is that Wenlin 4.0 features greatly improved handling of Pinyin. I was among the field testers for the new version, so I’ve already spent a lot of time examining this feature. Here are a few important aspects of this:

  • Conversions from Chinese characters follow Hanyu Pinyin orthography much more closely than before. This is a major change for the better. (There’s still some room for improvement. But I don’t think we’ll have to wait years for this.)
  • In the past, using Wenlin to convert long texts in Chinese characters into Pinyin could be a real chore, with users having to examine example after example of Chinese characters with multiple pronunciations in order to select the proper pronunciation for that particular context. But now users may, if they so desire, tell Wenlin not to ask users for disambiguation input. Of course, that doesn’t mean that Wenlin will always guess right; but many users will be happy that this trade-off allows them to skip the frustration of, for example, having to tell the program over and over and over that, yes, in this case ? is pronounced shu? rather than shuì.
  • Relative newcomers to Mandarin may appreciate that for common words tone sandhi is indicated in Wenlin with additional marks (a dot or line below the vowel). This feature can also be turned off, for those who want standard Pinyin.

There are, of course, many improvements beyond the area of Pinyin. Here are a few:

  • One limitation of Wenlin 3.x was that its English dictionary wasn’t very large. But Wenlin 4.0 includes not only the ABC Chinese-English Comprehensive Dictionary but also the excellent new ABC English-Chinese, Chinese-English Dictionary (now finally in stock in the printed version).
  • The flashcards are now set up to handle not just individual characters but polysyllabic words.
  • There’s full Unicode Unihan 6.0 support for more than 75,000 Chinese characters.
  • And for those who think 75,000 just isn’t enough, users can now access Wenlin’s CDL technology. Through this, users can create new, variant, and rare characters; moreover, these can be published and shared with other Wenlin users or CDL-friendly devices.
  • Seal script versions of more than 11,000 characters are provided.
  • Wenlin contains an e-edition of the Shuowen Jiezi (Shu?wén Ji?zì / ???? / ????).
  • Coders will be interested to know that Wenlin appears to be headed toward becoming open-source.
  • Both Mandarin and English entries are marked with grade levels, which aids learners by indicating relative frequency of use. The levels for Mandarin words are based on the Hanyu Shuiping Kaoshi (Hàny? Sh?ipíng K?oshì / ?????? / ?????? / HSK).

The full version (i.e., the CD with the program comes in a box and is likely packaged with a hard copy of the manual) is US$199, or US$179 if you download it from the Wenlin Web store. Upgrades from 3.x cost US$49.

For more information, see the summary of features and outline of what’s new in Wenlin 4.0.

screenshot from Wenlin 4.0 -- click for larger version