Pinyin sort order

The standard for alphabetically sorting Hanyu Pinyin is given in the ABC dictionary series edited by John DeFrancis and issued by the University of Hawaii Press.

Here’s the basic idea:

The ordering is primarily simply alphabetical. Diacritical marks, punctuation, juncture and capitalization are only taken into account when the strings being compared are otherwise identical. For example, píng’ān sorts before pīnyīn, because pingan sorts before pinyin, because g precedes y alphabetically.

Only when two strings are alphabetically identical is non-alphabetical information taken into account.

The series’ Reader’s Guide presents the specifics of the sort order. Since I don’t have to worry about how much space this takes up on my site, I have reformatted the information slightly to give the examples as numbered lists.

Head entry transcriptions with the same sequence of letters are ordered first strictly by letter sequence regardless of tones, then by initial syllable tone in the sequence 0 1 2 3 4. For entries with the same initial tone, arrangement is by the tone of the second syllable, again in the order 0 1 2 3 4. For example:

  • shīshi
  • shīshī
  • shīshí
  • shīshǐ
  • shīshì
  • shíshī
  • shíshì
  • shǐshī
  • shìshī
  • Irrespective of tones, entries with the vowel u precede those with ü.
    For example:

    Entries without apostrophe precede those with apostrophe. For example:

    1. biànargue
    2. bǐ’ànthe other shore

    Lower-case entries precede upper-case entries. For example:

    1. hòujìnaftereffect
    2. Hòu JìnLater Jin dynasty

    For entries with identical spelling, including tones, arrangement is by order of frequency….

    For most users, the most important thing to note is that the neutral tone is regarded as 0, not as 5. Thus, the order is notā á ǎ à a,” but “a ā á ǎ à.” And, because lowercase comes before uppercase, notA a Ā ā Á á Ǎ ǎ À à” but “a A ā Ā á Á ǎ Ǎ à À.”

    One can see this in action in the A entries for the ABC English-Chinese, Chinese-English Dictionary. And here are some sample pages from an earlier ABC dictionary.

    The ABC series follows the example of the Hanyu Pinyin Cihui (汉语拼音词汇 / 漢語拼音詞彙 / Hànyǔ Pīnyīn Cíhuì) (example), with only one minor difference, as noted by Tom Bishop:

    HPC [Hanyu Pinyin Cihui] gave hyphens and spaces the same priority as apostrophes, so that lìgōng sorted before lǐ-gōng, in spite of the tones. Usage of hyphens and spaces in pinyin is still far from being fully standardized. (The same is true in English orthography.) Consequently, for collation it makes sense to give less weight to hyphens and spaces, and more weight to tones, thus sorting lǐ-gōng before lìgōng. In ABC, hyphens and spaces don’t affect the sort order unless they change the pronunciation in the same way that apostrophe would; for example, ¹míng-àn 明暗 and ²míng’àn 冥暗 are treated as homophones, and they sort after mǐngǎn 敏感.

    Not the same sound

    Today’s New York Times exhibits one of my pet peeves. (Yes, I do seem to have a lot of those.)

    This particular one is the practice of declaring that some Mandarin word or expression has “the same sound” as something else — even though it doesn’t. Claiming that the Mandarin words for death and four sound identical is a frequent example of this.

    So today we have this:

    Consider Tide detergent, Taizi, whose Chinese characters literally mean “gets rid of dirt.” (Characters are important: the same sound written differently could mean “too purple.”)

    Nope. The Mandarin name for Tide detergent is Tàizì. On the other hand, “too purple” would be “tài z?,” which is close but not the same.

    Tàizì ? tài z?

    So, the answer to the question “When is a homophone not a homophone?” is “When it’s not a @#$%! homophone.”

    But I will give the Times points for not mentioning wax tadpoles.

    source: Picking Brand Names in China Is a Business Itself, New York Times, November 11, 2011

    Now on Weishenme Zhongwen zheme TM nan?

    Earlier this year a Mandarin translation of David Moser’s classic essay Why Chinese Is So Damn Hard appeared on the Web. And then it disappeared. With the permission of both the translator and the original author, I’m placing this work back online.

    It’s available here in two versions:


    Maybe I’ll make a Pinyin version too one of these years.

    Google Web fonts and Hanyu Pinyin

    Back in the last century, getting Web browsers to correctly display Pinyin was such a troublesome task that I remember once even employing GIFs of first- and third-tone letters to get those to look right. So there were a whole lotta IMG tags in my text. Sure, I put the necessary info in ALT tags (e.g., “alt=’a3′”), just in case. But, still, I shudder to recall having to resort to that particular hack.

    Things are better now, though still far from ideal. Something that promises to considerably improve the situation of website viewers not all having the same font you may wish to use is CSS3’s @font-face, which allows those creating Web pages to employ fonts that are provided online. Google is helping with this through its Google Web Fonts. (Current count: 252 font families.)

    But is anything in Google’s collection capable of dealing with Hanyu Pinyin? Armed with a handy-dandy Pinyin pangram, I had a look at what Google has made available.

    Not surprisingly, most of the 29 font families marked as offering the “Latin Extended” character set failed to handle the entire Hanyu Pinyin set. The ??? group is the most likely to be unsupported at present, with third-tone vowels also frequently missing.

    Here are the Google Web fonts that do support Hanyu Pinyin with tone marks:

    • EB Garamond (227 KB)
    • Gentium Basic (263 KB — and about the same for each of the three accompanying styles: italic, bold, bold italic)
    • Gentium Book Basic (267 KB — and about the same for each of the three accompanying styles: italic, bold, bold italic)
    • Neuton (56 KB — and about the same for each of the five accompanying styles: italic, bold, light, extra light, extra bold)

    screenshot of the Pinyin fonts above


    • Neuton has relatively weak tone marks, so I wouldn’t recommend it for Web pages aimed at beginning students of Mandarin.

    Sans Serifs

    • Andika (1.4 MB)
    • Ubuntu (350 KB) — available in eight styles

    screenshot of the Pinyin fonts above

    Some Ubuntu sample PDFs: Ubuntu regular, Ubuntu italic, Ubuntu bold, Ubuntu bold italic, Ubuntu light, Ubuntu light italic, Ubuntu medium, Ubuntu medium italic.

    Andika sample PDF.


    • Andika’s relatively large size (1.4 MB) makes it unsuitable for @font-face use because of download time. (Its license, however, would permit someone with the time and energy to crack it open and remove lots of the glyphs not needed for Pinyin, thus reducing the size.) More fundamentally, though, I don’t much like the look of it; but YMMV.

    Since Google is likely to expand the number of fonts it offers, I’m including the list of all 29 faces I tried for this experiment, which should make it easier for those wanting to test only new fonts. (It is possible, however, that Pinyin support will be added later to some fonts that fail in this area now. If anyone hears of any such changes, please let me know.) Use of bold indicates Pinyin support; everything else failed.

    Display Faces with Latin Extended (all fail)

    • Abril Fatface
    • Forum
    • Kelly Slab
    • Lobster
    • MedievalSharp
    • Modern Antiqua
    • Ruslan Display
    • Tenor Sans

    Handwriting Faces with Latin Extended (all fail)

    • Patrick Hand

    Serif Faces with Latin Extended

    • Cardo
    • Caudex
    • EB Garamond
    • Gentium Basic
    • Gentium Book Basic
    • Neuton
    • Playfair Display
    • Sorts Mill Goudy

    Sans Serif Faces with Latin Extended

    • Andika
    • Anonymous Pro
    • Anton
    • Didact Gothic
    • Francois One
    • Istok Web
    • Jura
    • Open Sans
    • Open Sans Condensed
    • Play
    • Ubuntu
    • Varela

    Additional resource: SIL Fonts for downloading (including the full versions of Andika and Gentium).

    Taiwanese romanization used for Hanzi input method

    Since I just posted about the new Hakka-based Chinese character input method I would be amiss not to note as well the release early this year of a different Chinese character input method based on Taiwanese romanization.

    This one is available in Windows, Mac, and Linux flavors.

    See the FAQ and documents below for more information (Mandarin only).

    Táiw?n M?nnány? Hànzì sh?rùf? 2.0 b?n xiàzài (?????????? 2.0???) [Readers may wish to note the use of Minnan, which is generally preferred among unificationists and some advocates of Hakka and the languages of Taiwan’s tribes.]

    source: Jiàoyùbù Táiw?n M?nnány? Hànzì sh?rùf? (?????????????); Ministry of Education, Taiwan; June 16, 2010(?) / February 14, 2011(?) [Perhaps the Windows and Linux versions came first, with the Mac version following in 2011.]

    The where and why of missing second tones

    image of 'zhong' written with 1st, 2nd, 3rd, and 4th tone -- with the 2nd-tone one in light gray instead of black textMy previous post mentioned that not all tonal permutations exist in the real world. For example, modern standard Mandarin has zh?ng, zh?ng, and zhòng, but doesn’t have zhóng. I did not, however, get into any of the reasons for the absence of second-tone zhong.

    Fortunately, my friend James E. Dew, who is much more qualified than I to discuss such fine points of linguistics, was kind enough to send in the explanation below. Jim used to teach the Chinese language and linguistics at the University of Michigan; and for many years he directed the Inter-University Program (a.k.a. the Stanford Center) in Taipei. He is also the author of 6000 Chinese Words: A Vocabulary Frequency Handbook and coauthor of Classical Chinese: A Functional Approach.

    Most simply stated, Mandarin syllable shapes with unaspirated occlusive initials and nasal finals don’t occur in second tone. This can be restated a bit less opaquely for those who have not studied Chinese historical phonology, as follows:

    Syllables that begin with unaspirated stops b, d, g, or affricates j, zh, z, and end in a nasal n or ng, as a rule don’t have second-tone forms. There are a few exceptions, such as béng ( / “needn’t”) and zán ( / “we”), which were new words formed by contraction — from búyòng and zámén, respectively — after the tone class split described below took place.

    This came about because when Middle Chinese (of Sui-Tang times) píngshēng 平声/?? split into yīnpíng 阴平/?? (modern Mandarin “first tone”) and yángpíng 阳平/?? (M “second tone”), syllables with aspirated initials went into the new yángpíng class, while those with unaspirated initials all fell into the yīnpíng (M first tone) group, thus leaving no unaspirated syllables with nasal finals in the modern Mandarin second tone class.

    An interesting corollary to this rule is that among Mandarin “open” syllables (those that end in a vowel) with the above-listed initials, almost all of the second-tone syllables derive from Middle Chinese rùshēng 入声/??, and their cognates have stop endings in the southern dialects that preserve rùshēng, as illustrated by the Cantonese examples given below.

    For those who like to pronounce what they read, Cantonese rùshēng syllables have level tones, either high, mid or low. In the Yale romanization used here, high tone is marked with a macron (e.g., dāk), mid tone is unmarked, and low tone is signified by an h following the vowel. A double “aa” sounds like the “a” in “father,” while a single “a” is a mid central vowel. Thus baht sounds like English “but” and dāk sounds like English “duck.”
      Mandarin Cantonese
    bái baahk
    báo bohk
    別/别 bié biht
    敵/敌 dihk
    閣/阁 gok
    國/国 guó gwok
    極/极 gihk
    夾/夹 jiá gaap
    結/结 jié git
    節/节 jié jit
    覺/觉 jué gok
    決/决 jué kyut
    雜/杂 jaahp
    澤/泽 jaahk
    閘/闸 zhá jaahp
    zhái jaahk
    zhé jit
    執/执 zhí jāp
    zhí jihk
    zhú jūk
    濁/浊 zhuó juhk

    Pinyin’s never-used letter?

    As most people reading this blog know, Mandarin has about 1,300 syllables (interjections and loan words complicate the count a little). If tones — a basic part of the language — are disregarded, the number of drops to 400 and something syllables.

    Given 410 or so basic syllables and 4 tones — one of these days I need to write something more on the wrongful neglect of the so-called neutral tone — some people might expect there to be more like 1,640 syllables instead of about 1,300. The reason for the lower number is that not all syllables exist in all four tones. For example, quite clearly the official language of Zh?ngguó does not lack zh?ng … or zh?ng or zhòng. But zhóng is another matter.

    So not all possible tonal variations of those 400-something syllables appear in modern standard Mandarin. But what about letters?

    If you look at the official alphabet for Hanyu Pinyin, it’s exactly the same as that for English (other than in pronunciation, of course), which is a bit odd, especially considering that Pinyin doesn’t use the letter v (or at least isn’t supposed to for Mandarin words).

    So in this case, I’m excluding v but otherwise being expansionist about the glyphs I’m calling letters. To be specific: I’m referring to a-z, minus v, but including ?, á, ?, à, ?, é, ?, è, ?, í, ?, ì, ?, ó, ?, ò, ?, ú, ?, ù, ü, ?, ?, ?, and ?. (Even though ?, Í, ?, Ì, ?, Ú, ?, Ù, Ü, ?, ?, ?, and ? never come at the beginning of a word, let’s not automatically eliminate them, because there is an occasional need for ALL CAPS.)

    Are there any of those possible glyphs that don’t appear at all — at least as given in the large ABC Comprehensive Chinese-English Dictionary?

    The answer, perhaps surprisingly, is yes.

    Which letter is it?

    a. ? b. ? c. ? d. ?

    Have you made your choice?

    It doesn’t take much thought to eliminate C as the answer. “N?” (woman) is one of those first-couple-of-Mandarin-lessons vocabulary terms. And the word for green (l?sè) is hardly obscure either. It might be harder to think of a word with the letter ?; but there are some. Donkey (l?) is probably the most common. So the answer is A: ?.

    It’s important to note that the lack of ? is in appearance only. The sound ? occurs in plenty of Mandarin words; it’s just that Pinyin’s simplified orthography calls for writing “u” instead where ? follows j, q, x, or y.

    But even though I didn’t find an example of ?, I’d encourage font designers not to scratch it from their list of must-have glyphs for Pinyin faces, especially since teachers will no doubt want to continue giving tone-pattern drills based on four tones for all vowels, regardless. Also, someone with a searchable edition of the Hanyu Da Cidian or maybe the new Oxford online edition is probably about to use the comments to point me to some obscure entry there….

    How to handle ‘de’ and interjections in Hanyu Pinyin

    cover image for the bookToday’s selection from Yin Binyong’s X?nhuá P?nxi? Cídi?n (???????? / ????????) deals with how to write Mandarin’s various de‘s, mood particles, and interjections.

    This reading is available in two versions:

    • simplified Chinese characters: ???? ????? (zhùcí, tàncí)
    • traditional Chinese characters: ???? ?????

    I’ve already written about the principles in previous posts. For example, see