Pinyin sort order

The standard for alphabetically sorting Hanyu Pinyin is given in the ABC dictionary series edited by John DeFrancis and issued by the University of Hawaii Press.

Here’s the basic idea:

The ordering is primarily simply alphabetical. Diacritical marks, punctuation, juncture and capitalization are only taken into account when the strings being compared are otherwise identical. For example, píng’?n sorts before p?ny?n, because pingan sorts before pinyin, because g precedes y alphabetically.

Only when two strings are alphabetically identical is non-alphabetical information taken into account.

The series’ Reader’s Guide presents the specifics of the sort order. Since I don’t have to worry about how much space this takes up on my site, I have reformatted the information slightly to give the examples as numbered lists.

Head entry transcriptions with the same sequence of letters are ordered first strictly by letter sequence regardless of tones, then by initial syllable tone in the sequence 0 1 2 3 4. For entries with the same initial tone, arrangement is by the tone of the second syllable, again in the order 0 1 2 3 4. For example:

  1. sh?shi
  2. sh?sh?
  3. sh?shí
  4. sh?sh?
  5. sh?shì
  6. shísh?
  7. shíshì
  8. sh?sh?
  9. shìsh?

Irrespective of tones, entries with the vowel u precede those with ü.
For example:

  1. l?
  2. l?
  3. l?
  4. l?
  1. n?

Entries without apostrophe precede those with apostrophe. For example:

  1. biànargue
  2. b?’ànthe other shore

Lower-case entries precede upper-case entries. For example:

  1. hòujìnaftereffect
  2. Hòu JìnLater Jin dynasty

For entries with identical spelling, including tones, arrangement is by order of frequency….

For most users, the most important thing to note is that the neutral tone is regarded as 0, not as 5. Thus, the order is not? á ? à a,” but “a ? á ? à.” And, because lowercase comes before uppercase, notA a ? ? Á á ? ? À à” but “a A ? ? á Á ? ? à À.

One can see this in action in the A entries for the ABC English-Chinese, Chinese-English Dictionary. And here are some sample pages from an earlier ABC dictionary.

The ABC series follows the example of the Hanyu Pinyin Cihui (?????? / Hàny? P?ny?n Cíhuì) (example), with only one minor difference, as noted by Tom Bishop:

HPC [Hanyu Pinyin Cihui] gave hyphens and spaces the same priority as apostrophes, so that lìg?ng sorted before l?-g?ng, in spite of the tones. Usage of hyphens and spaces in pinyin is still far from being fully standardized. (The same is true in English orthography.) Consequently, for collation it makes sense to give less weight to hyphens and spaces, and more weight to tones, thus sorting l?-g?ng before lìg?ng. In ABC, hyphens and spaces don’t affect the sort order unless they change the pronunciation in the same way that apostrophe would; for example, 1míng-àn ?? and 2míng’àn ?? are treated as homophones, and they sort after m?ng?n ??.

Wenlin releases major upgrade (4.0)

Wenlin logoOne of my favorite programs, Wenlin (which bills itself as “software for learning Chinese”), has just released a major upgrade for both Mac and Windows versions. This doesn’t happen often; it has been three-and-a-half years since the most recent big change was issued (Wenlin 3.4) and heaven only knows how long since 3.0 came out. So, yes, this release has many substantial improvements.

One of the features nearest and dearest to my heart is that Wenlin 4.0 features greatly improved handling of Pinyin. I was among the field testers for the new version, so I’ve already spent a lot of time examining this feature. Here are a few important aspects of this:

  • Conversions from Chinese characters follow Hanyu Pinyin orthography much more closely than before. This is a major change for the better. (There’s still some room for improvement. But I don’t think we’ll have to wait years for this.)
  • In the past, using Wenlin to convert long texts in Chinese characters into Pinyin could be a real chore, with users having to examine example after example of Chinese characters with multiple pronunciations in order to select the proper pronunciation for that particular context. But now users may, if they so desire, tell Wenlin not to ask users for disambiguation input. Of course, that doesn’t mean that Wenlin will always guess right; but many users will be happy that this trade-off allows them to skip the frustration of, for example, having to tell the program over and over and over that, yes, in this case ? is pronounced shu? rather than shuì.
  • Relative newcomers to Mandarin may appreciate that for common words tone sandhi is indicated in Wenlin with additional marks (a dot or line below the vowel). This feature can also be turned off, for those who want standard Pinyin.

There are, of course, many improvements beyond the area of Pinyin. Here are a few:

  • One limitation of Wenlin 3.x was that its English dictionary wasn’t very large. But Wenlin 4.0 includes not only the ABC Chinese-English Comprehensive Dictionary but also the excellent new ABC English-Chinese, Chinese-English Dictionary (now finally in stock in the printed version).
  • The flashcards are now set up to handle not just individual characters but polysyllabic words.
  • There’s full Unicode Unihan 6.0 support for more than 75,000 Chinese characters.
  • And for those who think 75,000 just isn’t enough, users can now access Wenlin’s CDL technology. Through this, users can create new, variant, and rare characters; moreover, these can be published and shared with other Wenlin users or CDL-friendly devices.
  • Seal script versions of more than 11,000 characters are provided.
  • Wenlin contains an e-edition of the Shuowen Jiezi (Shu?wén Ji?zì / ???? / ????).
  • Coders will be interested to know that Wenlin appears to be headed toward becoming open-source.
  • Both Mandarin and English entries are marked with grade levels, which aids learners by indicating relative frequency of use. The levels for Mandarin words are based on the Hanyu Shuiping Kaoshi (Hàny? Sh?ipíng K?oshì / ?????? / ?????? / HSK).

The full version (i.e., the CD with the program comes in a box and is likely packaged with a hard copy of the manual) is US$199, or US$179 if you download it from the Wenlin Web store. Upgrades from 3.x cost US$49.

For more information, see the summary of features and outline of what’s new in Wenlin 4.0.

screenshot from Wenlin 4.0 -- click for larger version

ABC English-Chinese, Chinese-English Dictionary out soon

front cover of the ABC English-Chinese, Chinese-English DictionaryThe ABC Chinese-English Dictionary was published ten years ago. It was revolutionary in that, for the first time, a Mandarin-English dictionary was ordered entirely by the headwords’ pronunciation as written in pinyin. (Stroke and radical indexes are also there to aid finding a character when its shape is known but not its pronunciation.) Other dictionaries in the DeFrancis ABC series have followed. But up to now there been no ABC dictionary with an English to Mandarin section as well as a Mandarin to English one.

At the end of this month the University of Hawai`i Press is releasing the ABC English-Chinese, Chinese-English Dictionary. The new dictionary, which is 1,252 pages long, has 29,670 entries in its English-Mandarin section and 37,963 entries for Mandarin-English (total 67,633 entries). (The much larger ABC Chinese-English Comprehensive Dictionary has some 196,000 entries — all Mandarin-English).

This is a big year for Mandarin-English dictionaries, with the forthcoming release of the ABC ECCE and the release three months ago of the massive Oxford Chinese Dictionary. From the standpoint of Pinyin, however, the Oxford dictionary is a disappointment. For example, the Oxford dictionary has no Pinyin in the English-Mandarin section, just Chinese characters; in some other places tone marks are missing from some of the Pinyin, where it appears at all. Perhaps this will be rectified in the online edition, which has yet to appear. At the moment, though, the Oxford looks like a fairly traditional dictionary — albeit a huge one — aimed mainly at English learners in China, which isn’t necessarily a bad thing if you happen to be among that very large group of people. For more on the Oxford, see the video at Danwei and the entries at Chinese Forums (with some images) and Language Log.

Unlike the Oxford dictionary, the ABC ECCE offers both Pinyin and Chinese characters for all entries and sample sentences. (See samples below. Click on those for more extensive examples in PDF files.)

From what I’ve seen so far of the ABC English-Chinese, Chinese-English Dictionary, I expect it to become the dictionary for English-speaking students of Mandarin. I’ll write more about this once I’m able to see a hard copy.

The ABC English-Chinese, Chinese-English Dictionary retails for only US$20, compared to US$75 for the Oxford.

From the Mandarin-English section. But don’t expect the text in the printed edition to be this large. I’ve enlarged the image to make it easier to read on the Web.
examples of entries in the Mandarin-English section of the ABC English-Chinese, Chinese-English Dictionary

From the English-Mandarin section:
examples of entries in the English-Mandarin section of the ABC English-Chinese, Chinese-English Dictionary

(ISBN-10: 0824834852; ISBN-13: 978-0824834852)

See also:

Xin Tang 6

cover of Xin Tang, no. 6My previous post linked to a new HTML version of Homographobia, an essay by John DeFrancis. The work was first published in November 1985, in the sixth issue of Xin Tang (New China).

Xin Tang (X?n Táng) is an especially interesting journal in that it is primarily in Mandarin written in romanization. A variety of romanization systems and methods are employed over the course of the journal. Indeed, over the course of its run one can see many questions of systems and orthographies being worked out.

I want to stress, though, that the journal does not restrict itself to material of interest only to romanization specialists. It also features poetry, illustrated stories, philosophy, letters to the editor, children’s material, and much more.

English and a few Chinese characters are also found; and there are even articles in languages such as Turkish (with Mandarin and English translations).

Most of what appears in English is also translated into Mandarin — romanized Mandarin, of course. So DeFrancis’s essay also appears, appropriately, in Pinyin:

Homographobia is a disorder characterized by an irrational fear of ambiguity when individual lexical items which are now distinguished graphically lose their distinctive features and become identical if written phonemically. The seriousness of the disorder appears to be in direct proportion to the increase in number of items with identical spelling that phonemic rendering might bring about….

Tongyinci-kongjuzheng shi yi zhong xinli shang d shichang, tezheng shi huluande haipa yong pinyin zhuanxie dangqing kao zixing fende hen qingchu d cir hui shiqu tamend bianbiexing. Kan qilai, zhei ge bing d yanzhongxing gen pinyin shuxie keneng zaocheng d tongxing pinshi shuliang d zengjia cheng zhengbi….

All of the issue with the DeFrancis essay is now online: Xin Tang no. 6.

illustration of a dragon reading a copy of Xin Tang, from an illustrated story
Note the occasional employment of a tonal spelling (shuui).

Homographobia

Twenty-five years ago, John DeFrancis wrote a terrific essay on what he aptly dubbed homographobia (in Mandarin: tóngy?ncí-k?ngjùzhèng). It’s a word that deserves wider currency, as the irrational fear he describes still affects a great many people.

Homographobia is a disorder characterized by an irrational fear of ambiguity when individual lexical items which are now distinguished graphically lose their distinctive features and become identical if written phonemically. The seriousness of the disorder appears to be in direct proportion to the increase in number of items with identical spelling that phonemic rendering might bring about. The aberration may not exist at all among people favored by writing systems that are already closely phonemic, such as Spanish and German. It exists to a mild degree among readers of a poorly phonemic (actually morphophonemic) writing system such as English, some of whom suffer anxiety reactions at the thought of the confusion that might arise if, for example, rain, rein, and reign were all written as rane. It exists in its most virulent form among those exposed to Chinese characters, which, among all the writing systems ever created, are unique in their ability to convey meaning under extreme conditions of isolation

That the fear is a genuine phobia, that is an irrational fear, is attested to by the fact that it is confined only to those cases in which lexical items that are now distinguished in writing would lose their distinctiveness if written phonemically, as in the case of the three English homophones mentioned above. Quite irrationally, the fear is not provoked by lexical items which are not now distinguished in writing, even though the amount of already existing homography might be considerably greater than in projected cases, such as the mere three English words pronounced rane. The English graphic form can, for example, has at least ten different meanings which to a normal mind might appear as ten different words. But no one, either in or out of his right mind in such matters, suffers any anxiety from the problems which in theory should exist in such extensive homography.

The uncritical acceptance of current written forms as an immutable given ignores the accidents in the history of writing that have resulted in current graphic differentiation for some homophones and not for others. Such methodological myopia cannot lead to any useful consideration of ambiguity….

The complete essay is now online: Homographobia.

John DeFrancis video

John DeFrancisTen years ago John DeFrancis was awarded the Chinese Language Teachers Association’s first lifetime achievement award. Since he could not be present at the association’s annual conference to receive the award, he sent a videotape of a 12-minute acceptance speech. The video was recently edited down to 6:27 and uploaded to YouTube: John DeFrancis remarks.

Here’s my summary of the main points:

0:00 — While working on what he intended to be a largely political study of Chinese nationalism, DeFrancis encountered references to people who wanted China to adopt an alphabetic writing system, an idea which he initially dismissed. But discovering Lu Xun’s interest in romanization led him to investigate the matter further. [I'm frustrated by the cut away from this discussion. Perhaps a fuller version of the video will be posted later.]
1:30 — Emphasizes he’s not in favor of completely abandoning Chinese characters. Rather, he favors digraphia.
2:30 — “I’d like to mention three aspects of the Chinese field which have interested me.”

  1. pedagogy (2:50) — lots of advancements
  2. linguistic aspect (3:20) — that’s also progressing well
  3. socio-linguistics (3:52) — the field isn’t doing as well as it should be

5:00 — computers and Chinese characters. DeFrancis tears into the Chinese government for its emphasis on shape-based character-input methods rather than Pinyin.