Pinyin sort order

The standard for alphabetically sorting Hanyu Pinyin is given in the ABC dictionary series edited by John DeFrancis and issued by the University of Hawaii Press.

Here’s the basic idea:

The ordering is primarily simply alphabetical. Diacritical marks, punctuation, juncture and capitalization are only taken into account when the strings being compared are otherwise identical. For example, píng’ān sorts before pīnyīn, because pingan sorts before pinyin, because g precedes y alphabetically.

Only when two strings are alphabetically identical is non-alphabetical information taken into account.

The series’ Reader’s Guide presents the specifics of the sort order. Since I don’t have to worry about how much space this takes up on my site, I have reformatted the information slightly to give the examples as numbered lists.

Head entry transcriptions with the same sequence of letters are ordered first strictly by letter sequence regardless of tones, then by initial syllable tone in the sequence 0 1 2 3 4. For entries with the same initial tone, arrangement is by the tone of the second syllable, again in the order 0 1 2 3 4. For example:

  • shīshi
  • shīshī
  • shīshí
  • shīshǐ
  • shīshì
  • shíshī
  • shíshì
  • shǐshī
  • shìshī
  • Irrespective of tones, entries with the vowel u precede those with ü.
    For example:

    Entries without apostrophe precede those with apostrophe. For example:

    1. biànargue
    2. bǐ’ànthe other shore

    Lower-case entries precede upper-case entries. For example:

    1. hòujìnaftereffect
    2. Hòu JìnLater Jin dynasty

    For entries with identical spelling, including tones, arrangement is by order of frequency….

    For most users, the most important thing to note is that the neutral tone is regarded as 0, not as 5. Thus, the order is notā á ǎ à a,” but “a ā á ǎ à.” And, because lowercase comes before uppercase, notA a Ā ā Á á Ǎ ǎ À à” but “a A ā Ā á Á ǎ Ǎ à À.”

    One can see this in action in the A entries for the ABC English-Chinese, Chinese-English Dictionary. And here are some sample pages from an earlier ABC dictionary.

    The ABC series follows the example of the Hanyu Pinyin Cihui (汉语拼音词汇 / 漢語拼音詞彙 / Hànyǔ Pīnyīn Cíhuì) (example), with only one minor difference, as noted by Tom Bishop:

    HPC [Hanyu Pinyin Cihui] gave hyphens and spaces the same priority as apostrophes, so that lìgōng sorted before lǐ-gōng, in spite of the tones. Usage of hyphens and spaces in pinyin is still far from being fully standardized. (The same is true in English orthography.) Consequently, for collation it makes sense to give less weight to hyphens and spaces, and more weight to tones, thus sorting lǐ-gōng before lìgōng. In ABC, hyphens and spaces don’t affect the sort order unless they change the pronunciation in the same way that apostrophe would; for example, ¹míng-àn 明暗 and ²míng’àn 冥暗 are treated as homophones, and they sort after mǐngǎn 敏感.

    Pinyin Dongwuyuan: an illustrated Pinyin alphabet

    Here’s a new book I made for fun: Pīnyīn Dòngwùyuán (4.3 MB PDF).

    It goes through the letters of the alphabet: A is for ānchun, B is for bānmǎ, C is for chángjǐnglù, etc., all the way through Z, which is for zhāngyú.

    But X is not for xióngmāo. I’m sick of pandas. Let’s let some other animals have some time in the spotlight.

    Although technically speaking the Pinyin alphabet is the same as that for English, I prefer to go with A–Z, minus V but plus Ü.

    O and R were the tricky ones to find animals for.

    Perhaps some teachers will print this out and hang it up in their classrooms. Or kids could use it as a coloring book. You have my permission to do just about anything you like with this — other than sell it or add Chinese characters. (The world already has plenty of material in Hanzi, but not nearly enough in Pinyin.)

    I made sure to include multiples of some common morphemes (e.g., bān, hǎi, and ; è and zhāng; hǎimǎ and hǎi’ōu; niú, wōniú, and xīniú), which I hope will be useful.

    For fonts, I used the Linux Libertine family.

    This took me far longer to make than I thought it would, so I hope some people enjoy it or at least find it interesting.

    Pinyin font: Linux Libertine

    Linux Libertine in Wikipedia logoLinux Libertine is perhaps most familiar as the font used in the Wikipedia logo. This surprisingly large font family also works well with Hanyu Pinyin, though a few adjustments need to be made before all of the fonts in this family work as they should with Pinyin texts.

    Here’s how those working on Linux Libertine describe it:

    We work on a versatile font family. It is designed to give you an alternative for fonts like T*mes New Roman. We’re creating free software and publish our fonts under terms of the GPL and OFL. Please have a look at the paragraph concerning the license.

    It is our aim to support the many western languages and provide many special characters. Our fonts cover the codepages of Western Latin, Greek, Cyrillic (with their specific enhancements), Hebrew, IPA and many more. Furthermore, typographical features such as ligatures, small capitals, different number styles, scientific symbols, etc. are implemented in this font. Linux Libertine thus contains more than 2000 characters.

    Here’s what it looks like with Pinyin. (Click to view a PDF, which is much clearer.)
    screenshot of Linux Libertine in action on Pinyin text

    image of a rhinocerous (xiniu) and the word 'xiniu' in Linux Libertine

    All in all: Not bad.

    New row about old foolishness

    It appears that few things are harder to get rid of than a Taipei City Government official’s bad idea.

    Four years ago I noted that city hall was sponsoring a “festival” for beef noodle soup and promoting it to foreigners through a machine-translated Chinglish Web site and the absurd use of the supposedly English “Newrow Mian” for niúròumiàn (牛肉麵/牛肉面).

    The city has continued to host the annual event. This year, the city appears to have moved to solve its Chinglish problem by simply failing to provide English translations — though one wonders just where the “international” part comes in without much of anything in English. Thus, useful English is lacking; but fake English like “Newrow Mian” remains.

    image of logo that reads '2011 Taipei International New row Mian Festival'

    This has come to the attention of the media. For example, see this video report: Niúròumiàn = New Row Mian? Shì-fǔ zhíyì rěyì.

    Táiběi Shìzhèngfǔ jǔbàn niúròumiàn jié, xiànzài yào tuī wǎng guójì, buguò què yǒu yǎnjiān mínzhòng fāxiàn, huódòng hǎibào, bǎ Zhōngwén “niúròumiàn” zhíjiē yīn yìchéng Yīngwén de “New Row Mian,” bùshǎo guówài lǚkè kànle dōu tǎnyán, wánquán bù dǒng shénme yìsi, zhìyí shì-fǔ shìbushì Yīngwén fānyì yòu chūbāo, buguò shì-fǔ chéngqīng, shuōshì wèile xuānchuán “niúròumiàn” de Zhōngwén niànfǎ, ràng tā xiàng shòusī, pīsà yīyàng, ràng quánshìjiè dōu zhāozhe yuánwén niàn.

    According to the brief write-up above, some people had noticed that foreigners had no idea what this “new row mian” was or even how to say it, so the municipal authorities explained that this is for the sake of publicizing the Chinese pronunciation of niúròumiàn. City authorities dream that English will take on “new row mian” as a loan term, just like sushi and pizza. (Apparently it’s important to convey to the world the Chinese-ness (with Taiwanese characteristics) of this dish, so “beef noodle soup” — which is what just about everyone in Taiwan calls this when speaking in English — just won’t do.)

    Sigh.

    Really, this isn’t that difficult. If you want to use the roman alphabet to write a Mandarin term, use Hanyu Pinyin. Although Pinyin will not be helpful in all situations to people who know nothing about the system, neither will anything else. But Hanyu Pinyin stands the best chance of working because it’s the international system for writing Mandarin in romanization. It’s also Taiwan’s official system for writing Mandarin in romanization. And it’s even the Taipei City Government’s official system for writing Mandarin in romanization, which means the city is supposed to use it rather than employing ad hoc bullshit year after year.

    Anyway, the festival doesn’t start until November 17, so if you have ever wanted to “beef the world” — and who hasn’t? — now’s the time. (That this is being run by an ad agemcy agency that somehow missed getting its own name right, however, doesn’t inspire confidence.)

    If anyone would like to let the city know your thoughts about this, the contact person is Ms. Yè, who can be reached at 1999 ext. 6507, or at 02-2599-2875 ext. 214 or 220. Tell them this concerns the Táiběi Guójì Niúròumiàn Jié.

    Further reading:

    And for still more reading, see the Taipei City Government’s massive PDF (157 MB!) for the 2008 event. This has lots of English (and Japanese!), which appears not to have been machine translated; but some parts could certaintly use improvement, such as “The regretful beef noodles have been staying in my memory.” Additionally, the romanization system employed is Tongyong Pinyin, rather than Taipei’s official Hanyu Pinyin (e.g., “Rih Pin Shan Si Dao Siao Mian” instead of “Rì Pǐn Shānxī dāoxiāomiàn” and “HONG SHIH FU SIN JHUAN” instead of Hóng Shīfu Xīn Zhuàn).

    Of course, it’s not consistent even in its incorrect use of Tongyong. It also contains broken bastardized Wade-Giles (e.g., the “Kuan Tu” MRT station instead of “Guandu”) and the city’s “new row” whenever it gets the chance (e.g., HUANG ZAN NEWROW MIAN FANG instead of Huáng Zàn Niúròumiàn Fáng / 皇贊牛肉麵坊).

    Later, all of the stores’ addresses are given in Tongyong Pinyin (e.g., Chongcing, Mincyuan, Jhihnan, Mujha, Singlong, Jhongsiao).

    Saint Joe’s

    A Catholic church in Jinlun (Jīnlún/金崙), Taidong, Taiwan. Note the absence of Chinese characters.

    photo of a church, with 'KIOKAI NI' and 'SANTO YOSEF' written on it in large letters

    The town of Jinlun being in an area with many members of the Paiwan tribe, I checked with a Chen Chun-Mei (Chén Chūnměi / 陳春美), a Paiwan specialist at Guólì Zhōngxīng Dàxué (National Chung Hsing University / 國立中興大學), who wrote that kiokai is one of many words Paiwan borrowed from Japanese (kyōkai/教会: meaning church), and that ni in Paiwan means of or by.

    So this is the Church of Saint Joseph.

    I was also interested to hear on the train to Jinlun that some of the announcements in advance of some stations in Taidong County were in not only Mandarin, Taiwanese, and English, but also an aboriginal language. I’m guessing Paiwan. Even in the announcements in that language, however, the place names themselves sounded like they were given in Mandarin forms, though the descriptions were not.

    Further reading:

    Now on Pinyin.info: Weishenme Zhongwen zheme TM nan?

    Earlier this year a Mandarin translation of David Moser’s classic essay Why Chinese Is So Damn Hard appeared on the Web. And then it disappeared. With the permission of both the translator and the original author, I’m placing this work back online.

    It’s available here in two versions:

    Enjoy!

    Maybe I’ll make a Pinyin version too one of these years.

    Script font for Pinyin

    Unfortunately, relatively few fonts support Hanyu Pinyin (with tone marks, that is). So I was surprised to come across Pecita, by Philippe Cochy. This is the first script typeface I recall seeing that covers Pinyin … and a lot more.

    It might be too individualistic for much Pinyin use. But I’m very glad to know it exists and hope to see many more creations like it.

    GIF of Pecita in action: A-Z, a-z, plus the diacritics used in Pinyin and a pinyin pangram

    Pecita is licensed under the SIL Open Font License, Version 1.1.

    Additional links:

    Pinyin pangram challenge

    One of the many things I plan to do eventually is to put up some graphics of how Pinyin looks in various font faces. A Pinyin pangram would do nicely for a sample text. You know: a short Mandarin sentence in Hanyu Pinyin that uses all of the following 26 letters: abcdefghijklmnopqrstuüwxyz (i.e., the English alphabet’s a-z, minus v but plus ü).

    But then I couldn’t find one. So I put the question out to some people I know and quickly got back two Pinyin pangrams.

    Ruanwo bushi yingzuo; putongfan bushi xican; maibuqi lüde kan jusede. (57 letters)

    and

    Zuotian wo bang wo de pengyou Lü Xisheng qu chengli mai yi wan doufuru he ban zhi kaoji. (70 letters)

    from Robert Sanders and Cynthia Ning, respectively.

    James Dew weighed in with some helpful advice. And, with some additional help from the original two contributors and my wife, I made some additional modifications, eventually resulting in a variant reduced to 48 letters:

    Zuotian wo bang nü’er qu yi jia chaoshi mai kele, xifan, doupi.

    With tone marks, that’s “Zuótiān wǒ bāng nǚ’ér qù yī jiā chāoshì mǎi kělè, xīfàn, dòupí.”

    I suppose xīfàn is not really the sort of thing one buys at a chāoshì. On the other hand, people probably don’t worry much about whether jackdaws really do love someone’s big sphinx of quartz, so I think we’re OK. Still, something shorter than 48 letters should be possible — though pangram-friendly brevity is more easily accomplished in English than in Mandarin as spelled in Hanyu Pinyin. As one correspondent noted:

    Most of the “excess” letters are vowels. Trouble is that Chinese doesn’t pile up the consonants much. Brown, for example, takes care of b, r, w, and n, while only expending one little o…. There’s no word like string in Chinese (5 consonants; one vowel). Chinese piles up vowels: zuotian and chaoshi and doufu and kaoji all use more vowels than consonants.

    I’m challenging readers to come up with more Pinyin pangrams.

    But I don’t want this to be a reversed shi shi shi stunt, so let’s stay away from Literary Sinitic. And I’d prefer the equivalent of “The quick brown fox jumps over the lazy dog” to that of “Cwm fjord veg balks nth pyx quiz.” In other words, wherever possible this should be in real-world, sayable Mandarin.

    One possible variant on this would be to use “abcdefghijklmnopqrstuüwxyz” plus all the forms with diacritics āáǎàēéěèīíǐìōóǒòūúǔùǘǚǜ.” (No ǖ — first-tone ü, that is — is necessary.) But that would be even more work.

    Those who devise good pangrams will will be covered in róngyào — or something like that.

    Happy hunting.