important book on Pinyin to be excerpted on this site

cover image for the bookX?nhuá P?nxi? Cídi?n (???????? / ????????), is the second of Yin Binyong’s two books on Pinyin orthography. The first, Chinese Romanization: Pronunciation and Orthography, is in English and Mandarin; much of it is already available here on Pinyin.Info.

Although Xinhua Pinxie Cidian is only in Mandarin, the large number of examples makes it easy to get the point even if you may not read Mandarin in Chinese characters very well.

This week I will begin posting some excerpts from this invaluable work. What’s more, I have made a version in traditional Chinese characters, which I hope that readers in Taiwan, Hong Kong, and elsewhere will take advantage of. So those not used to reading simplified Chinese characters will have a choice (which is more than the government of Taiwan is providing these days).

I’m extremely happy to be able to bring you this information and with to acknowledge the generosity of the Commercial Press. Stay tuned.

Spreading the good news

Behold, I bring you good tidings.

As I keep having to note, most of the things that are supposedly in Pinyin are terrible. This is not because Pinyin itself is inherently poor or difficult. It’s because most people who produce such things have a fundamental lack of understanding of Pinyin as a system. (And, yes, that includes most users in China.) So it is with amazement that I report today on a journal that not only offers dozens of pages in Hanyu Pinyin — good Hanyu Pinyin — but does so twice every month. It’s also well worth noting that the journal is aimed primarily at adult native speakers of Mandarin, not foreigners trying to pick up the language, though certainly it could also be read by people in the latter group.

From what I’ve seen so far, this journal gets right the things most commonly written incorrectly elsewhere, including:

And it doesn’t use the atrocious ? that some people mistakenly believe is required either.

Unfortunately, punctuation and alphanumerics are not included in the Pinyin. But other than that there’s very little that doesn’t follow standard Pinyin orthography, the main exception being the indication of the tone sandhi related to the special cases of y? and , (e.g., the journal gives “bú shì” and “búdà” instead of the standard “bù shì” and “bùdà,” and “yìhuíshì” and “yí wèi” instead of the standard “y?huíshì” and “y? wèi“). That said, though, tone changes related to yi and bu can be something of a pain. So although this isn’t standard, I can see why it was done and am not entirely unsympathetic to this approach.

Here are a few sample lines (click to enlarge):
screenshot of some text in the journal, showing text in simplified Chinese characters with word-parsed Hanyu Pinyin above the Hanzi. Note: Yifusuoshu/???? = Ephesians

It would be nice if this were in Unicode, to help aid searches and cutting and pasting. The text, however, appears to have been made in a system devised years ago by the people at the journal. Regardless, I’m happy to see the Pinyin.

Overall, despite the lamentable absence of punctuation and Arabic numerals in the Pinyin, this is quality work, which is perhaps all the more remarkable in that the Pinyin and simplified Hanzi edition of this journal is not truly free to circulate in the land of its target audience. That’s because its publishers are Jehovah’s Witnesses, a group suppressed by the PRC (though it appears that at least at the moment their sites are not blocked by the great firewall). The journal, Sh?uwàngtái, may be more familiar to you by its English name: Watchtower. Whatever you might think of Jehovah’s Witnesses, I hope you’ll recognize the considerable accomplishment of those who put together this publication.

Getting to the Jehovah’s Witnesses Web pages that link to Sh?uwàngtái can be tricky. (Go to the magazines page, select “Chinese (Simplified)” for the language; then choose the month and file with Pinyin.) So I’m providing direct links to some documents below:

I haven’t found any Pinyin editions other than those. Perhaps old ones are taken offline.

Rénrén D?u X?yào Zh?dao De H?o Xi?oxi (I'd prefer 'de' instead of 'De' -- but that's no big deal) ???????????

With thanks to Victor Mair.

Wenlin releases major upgrade (4.0)

Wenlin logoOne of my favorite programs, Wenlin (which bills itself as “software for learning Chinese”), has just released a major upgrade for both Mac and Windows versions. This doesn’t happen often; it has been three-and-a-half years since the most recent big change was issued (Wenlin 3.4) and heaven only knows how long since 3.0 came out. So, yes, this release has many substantial improvements.

One of the features nearest and dearest to my heart is that Wenlin 4.0 features greatly improved handling of Pinyin. I was among the field testers for the new version, so I’ve already spent a lot of time examining this feature. Here are a few important aspects of this:

  • Conversions from Chinese characters follow Hanyu Pinyin orthography much more closely than before. This is a major change for the better. (There’s still some room for improvement. But I don’t think we’ll have to wait years for this.)
  • In the past, using Wenlin to convert long texts in Chinese characters into Pinyin could be a real chore, with users having to examine example after example of Chinese characters with multiple pronunciations in order to select the proper pronunciation for that particular context. But now users may, if they so desire, tell Wenlin not to ask users for disambiguation input. Of course, that doesn’t mean that Wenlin will always guess right; but many users will be happy that this trade-off allows them to skip the frustration of, for example, having to tell the program over and over and over that, yes, in this case ? is pronounced shu? rather than shuì.
  • Relative newcomers to Mandarin may appreciate that for common words tone sandhi is indicated in Wenlin with additional marks (a dot or line below the vowel). This feature can also be turned off, for those who want standard Pinyin.

There are, of course, many improvements beyond the area of Pinyin. Here are a few:

  • One limitation of Wenlin 3.x was that its English dictionary wasn’t very large. But Wenlin 4.0 includes not only the ABC Chinese-English Comprehensive Dictionary but also the excellent new ABC English-Chinese, Chinese-English Dictionary (now finally in stock in the printed version).
  • The flashcards are now set up to handle not just individual characters but polysyllabic words.
  • There’s full Unicode Unihan 6.0 support for more than 75,000 Chinese characters.
  • And for those who think 75,000 just isn’t enough, users can now access Wenlin’s CDL technology. Through this, users can create new, variant, and rare characters; moreover, these can be published and shared with other Wenlin users or CDL-friendly devices.
  • Seal script versions of more than 11,000 characters are provided.
  • Wenlin contains an e-edition of the Shuowen Jiezi (Shu?wén Ji?zì / ???? / ????).
  • Coders will be interested to know that Wenlin appears to be headed toward becoming open-source.
  • Both Mandarin and English entries are marked with grade levels, which aids learners by indicating relative frequency of use. The levels for Mandarin words are based on the Hanyu Shuiping Kaoshi (Hàny? Sh?ipíng K?oshì / ?????? / ?????? / HSK).

The full version (i.e., the CD with the program comes in a box and is likely packaged with a hard copy of the manual) is US$199, or US$179 if you download it from the Wenlin Web store. Upgrades from 3.x cost US$49.

For more information, see the summary of features and outline of what’s new in Wenlin 4.0.

screenshot from Wenlin 4.0 -- click for larger version

Xin Tang 4

cover of issue number 4 of the journal 'Xin Tang (New China)'The fourth issue of Xin Tang is now online.

For those of you wondering why Xin Tang is spelled Xin Talng on the cover, that’s because parts of this particular issue use a tonal-spelling variation of Hanyu Pinyin, as follows.

Simple rules for tonal spelling

  1. ma (?) / ling (?)
  2. mal (?) / lilng (?)
  3. maa (?) / liing (?)
  4. mah (?) / lihng (?)
  5. “‘” biaaoshih qingsheng, kee’shi “‘de” dou –> “d”.

Here, for example, is a message from the publisher.

Colng zheih yihqi qii SHIN TARNG gaai weil XIN TALNG, shiiyohng d welnzih yii Pin Yin (jiaan xiee PY) weil jichuu. Duobahn d welnzhang yohng yooudiaoh PY xiee. Biaodiaoh faa qiing kahn fengmiahn erh xiah’tou d jiaandan shuomilng.

The same passage in Pinyin with tone marks:

Cóng zhèi yì q? q? SHIN TARNG g?i wéi XIN TANG, sh?yòng d wénzì y? P?n Y?n (ji?n xi? PY) wéi j?ch?. Du?bàn d wénzh?ng yòng y?udiào PY xi?. Bi?odiào f? q?ng kàn f?ngmiàn èr xià’tou d ji?nd?n shu?míng.

Not all of the romanization in this issue follows that form. Some has no special spellings but instead uses tone marks. Some has no tone marks. Give ‘em all a try and see what you think.

Xin Tang 4 (PDF)

Xin Tang 6

cover of Xin Tang, no. 6My previous post linked to a new HTML version of Homographobia, an essay by John DeFrancis. The work was first published in November 1985, in the sixth issue of Xin Tang (New China).

Xin Tang (X?n Táng) is an especially interesting journal in that it is primarily in Mandarin written in romanization. A variety of romanization systems and methods are employed over the course of the journal. Indeed, over the course of its run one can see many questions of systems and orthographies being worked out.

I want to stress, though, that the journal does not restrict itself to material of interest only to romanization specialists. It also features poetry, illustrated stories, philosophy, letters to the editor, children’s material, and much more.

English and a few Chinese characters are also found; and there are even articles in languages such as Turkish (with Mandarin and English translations).

Most of what appears in English is also translated into Mandarin — romanized Mandarin, of course. So DeFrancis’s essay also appears, appropriately, in Pinyin:

Homographobia is a disorder characterized by an irrational fear of ambiguity when individual lexical items which are now distinguished graphically lose their distinctive features and become identical if written phonemically. The seriousness of the disorder appears to be in direct proportion to the increase in number of items with identical spelling that phonemic rendering might bring about….

Tongyinci-kongjuzheng shi yi zhong xinli shang d shichang, tezheng shi huluande haipa yong pinyin zhuanxie dangqing kao zixing fende hen qingchu d cir hui shiqu tamend bianbiexing. Kan qilai, zhei ge bing d yanzhongxing gen pinyin shuxie keneng zaocheng d tongxing pinshi shuliang d zengjia cheng zhengbi….

All of the issue with the DeFrancis essay is now online: Xin Tang no. 6.

illustration of a dragon reading a copy of Xin Tang, from an illustrated story
Note the occasional employment of a tonal spelling (shuui).

How to create Hanyu Pinyin subtitles

Since posting about the Pinyin subtitles for Crouching Tiger, Hidden Dragon and The Story of Stuff I have received several messages inquiring about how someone might make Pinyin subtitles themselves. So I might as well put the answer online.

Although at the present stage of software implementation subtitle conversion isn’t as simple as pushing a button, the process is not particularly difficult, assuming you have a good source text to work from. But this does require some time and the right tools.

The Right Tools

The most important tool is, of course, the one that performs the conversion to Hanyu Pinyin. And it’s crucial to keep in mind that not all Pinyin converters are created equal; in fact, the vast majority of so-called Pinyin converters are best avoided entirely. The world does not need any more texts in the hobbled, poorly written mess that many people erroneously think of as Hanyu Pinyin; but it very much needs texts in real Hanyu Pinyin. So don’t waste your time with a program that doesn’t do a good job of word parsing, etc.

At present the clear front-runner for converting Chinese characters to Hanyu Pinyin texts (real Hanyu Pinyin texts) with a minimum need for user assistance is Key Chinese (Windows and Mac). The demo version is fully functional for 30 days. Key’s considerably less expensive “Hanzi To Pinyin With Tones Conversion Utility” for MS Word texts (also with a 30-day demo) would probably also work well, though I haven’t tried it myself.

Wenlin (Windows and Mac) is another excellent program that can produce properly spelled and word-parsed Hanyu Pinyin. But it requires users to run some disambiguation themselves, which can take a lot of time when you’re talking about something with as much text as a screenplay. Nonetheless, Wenlin’s incorporation of John DeFrancis’s ABC Chinese-English Comprehensive Dictionary makes it a helpful reference when performing post-conversion checks. Also, especially if one does not have Key, Wenlin — even the function-limited but non-expiring demo version — is useful for handling some adjustments (such as removing tone marks or providing a workaround when dealing with programs that don’t handle Chinese characters well).

You’ll also need a Unicode-friendly text editor with good support of regular expressions (to allow wildcard searches). I like Em Editor, which is Windows based. But lots of other programs would work. One could even use MS Word if so inclined.

Finally, having subtitles in an additional language (usually but not necessarily English) is often desirable, not just for others who would use these subtitles but for yourself as you create the Pinyin subtitles. But often the subtitles one may find in Mandarin are not in synch with those in another language. Software can fix this problem. But I don’t have enough experience with this to recommend certain programs over others.

To sum up, the tools I recommend for creating Hanyu Pinyin subtitles are

  1. Key Chinese
  2. Wenlin
  3. EmEditor (or another Unicode-friendly text editor)
  4. a subtitle synchronizer

Actually, just the first one, Key, is sufficient to produce Pinyin subtitles. But in my experience using a combination of all four programs is preferable.

Now it’s time to get down to business.

The Main Steps

  1. acquire source-version subtitles
  2. synchronize subtitle files
  3. identify names of the movie’s characters (dramatis personae)
  4. perform initial conversion of subtitles in Chinese characters to Pinyin
  5. double check the results and perform necessary cleanup
  6. create additional version without tone marks
  7. share your work

1. Acquire subtitles for conversion and reference

At present the most useful site for finding Mandarin subtitles written in Chinese characters is probably Shooter. You may need to try searching for your desired title in both simplified and traditional characters. Also, be aware that movies — especially movies not filmed in Mandarin — often have different names in China, Taiwan, Hong Kong, etc.

You may find it useful to look for subtitles in other languages, too. Shooter can be useful for that, though you may have better luck finding English subtitles at Opensubtitles.org or similar English-language sites.

One can often find different subtitle files for the same movie, so you may wish to examine more than one for quality. Another thing that’s worth keeping in mind: Converting from traditional Chinese characters to simplified Chinese characters is less problematic than vice versa.

2. Synchronize subtitle files

Once you have the files, you should synchronize them with each other according to the directions for the particular program you are using.

If the program you’re using for this chokes on Chinese characters, though, you’ll need to take a couple extra steps. First, convert the Chinese characters to Unicode numerical character references using either Pinyin Info’s NCR conversion tool or Wenlin (full or demo version). The reason for this is that even synchronizers that screw up “???” should be able to handle the NCR equivalent: “李慕白”.

In Wenlin,
Edit –> Make transformed copy –> Encode &#; [decimal]

Take the NCR text and synchronize the files. After you get this taken care of, reconvert to Chinese characters.

In Wenlin,
Edit –> Make transformed copy –> Decode &#;

3. identify names of the movie’s characters

You must teach your software know which strings of Hanzi represent names. For example, it’s crucial for clarity that the character name “???” is written “L? Mùbái” rather than as “l? mù bái“. This part takes some time up front. But do not skip this step, because it is not only crucial but will save a lot of trouble in the long run.

Before doing this, however, people may want to refamiliarize themselves with Hanyu Pinyin’s rules for proper nouns (PDF). Note especially what is supposed to be capitalized and what isn’t.

The Mandarin version of Wikipedia is one resource that can be helpful in identifying the names of at least the main characters in the movie. But you’ll want to look for more names and forms than will be listed there. Keep in mind that characters aren’t always addressed by their full names. You need to look for other forms as well (e.g., in Crouching Tiger, Hidden Dragon Li Mubai is sometimes referred to as “Li Mubai” but other times as “Li ye” or simply as “Mubai”) and enter them.

English subtitles can be very useful for locating most proper nouns in the text. (Hooray for word parsing and capitalization of proper nouns!) The following search of an English subtitle file should help pinpoint the location of proper nouns.

find (with “Match Case” and “Use Regular Expressions” checked):
[^\.]\s[A-Z][a-z]

in MS Word, find (with “Use wildcards” checked):
[!\.] [A-Z][a-z]

Since you’ve already synchronized your subtitles, you’ll easily be able to find the corresponding point in the Mandarin subtitles by looking at the time the line appears.

As you gather the names, or after you compile the full list, add your findings to the Pinyin converter’s user dictionary. In Key, perform Language –> Add Record, then fill in the Hanzi and Pinyin fields.

4. Perform initial conversion to Pinyin

OK, I know you’re eager to run the conversion and see all of those Hanzi turn into lovely Hanyu Pinyin. But there’s one quick step you need to do first. If you’re using Key Chinese, the program won’t make use of all of those character names you just painstakingly added to the user dictionary unless you first run “linguistic reconstruction” on the subtitles you wish to convert:
Language –> Linguistic Reconstruction

Now you’re ready for the big step:
Language –> Convert to Pinyin

5. Double check the results and perform necessary cleanup

Unfortunately, most Pinyin converters — even the best — tend to be lazy about inserting spaces in some of the places they belong, such as around numeric and alphabetic strings. For example, “?3?22????????5?31??????” will generally convert to something that looks like this:
“zì3yuè22rì (X?ngq?y?) q? zhì5yuè31rì (X?ngq?y?)”.
But it should look like this:
“zì 3 yuè 22 rì (X?ngq?y?) q? zhì 5 yuè 31 rì (X?ngq?y?)”.

To fix this in your Pinyin text, run the following regular expression in EmEditor. Make sure “Match Case” is not checked.
find:
([a-z?á?à?é?è?í?ì?ó?ò?ú?ù????])([0-9]+)([a-z?á?à?é?è?í?ì?ó?ò?ú?ù????])

replace:
\1 \2 \3

If you do this in Word, you’ll need to use the following instead in your wildcard search.
find:
([A-Za-z?Á?À?É?È?Í?Ì?Ó?Ò?Ú?Ù?????á?à?é?è?í?ì?ó?ò?ú?ù????])([0-9]{1,})([A-Za-z?Á?À?É?È?Í?Ì?Ó?Ò?Ú?Ù?????á?à?é?è?í?ì?ó?ò?ú?ù????])

replace:
\1 \2 \3

The rest of cleanup work usually involves you simply reading through the text, looking for errors, perhaps while listening to the movie.

6. Create additional version without tone marks

If you have Key, this is very easy: Highlight the entire text, then
Format –> Strip Tone Marks.

And you’re done, though because Key keeps u-umlaut as such, if your television or other device doesn’t show the letter ü correctly you may wish to convert “ü” to “v”.

If you don’t have Key or access to another program that can do the same thing as easily, then use a combination of Wenlin (again, even the demo will do what you need) and a text editor. First, paste your Pinyin text into Wenlin. Then select all of the text and perform
Edit –> Make transformed copy… –> Replace tone marks with 1-4

Copy and paste the results into a new document in your text editor. Then run the following search-and-replace. Make certain the “Use Regular Expressions” or “Use Wildcards” box is checked.

find:
([A-Za-z])([1-4])

replace with:
\1
Then click “Replace All”.

What this looks like in EmEditor:
image showing the search-and-replace dialog box for the above

What this looks like in MS Word:

7. Share your work

It’s much better if people can concentrate on producing new material rather than having to redo things others have already taken care of. So if you make a good Hanyu Pinyin version of something, please let me know.

Pinyin subtitles for Crouching Tiger, Hidden Dragon

Er, someone has created Hanyu Pinyin subtitles for the film Crouching Tiger, Hidden Dragon (Wòh?cánglóng / ???? / ????). They’re in UTF-8 (Unicode) and come in two varieties: one with tone marks (link above), the other without. The latter would be useful primarily for those who have trouble getting diacritics to appear properly, such as many of those watching the movie through a TV hooked up to a DivX DVD player.

The set of subtitles also includes English and Mandarin in Chinese characters (both traditional and simplified versions).

The subtitles might seem to go by a bit quickly. But that’s generally because people don’t have much experience reading Hanyu Pinyin. (Also, the English subtitles leave out a lot. But the Pinyin ones are comprehensive.) Practice reading and you’ll get much faster at it.

Remember to use these only for good (e.g., practice reading Pinyin, Mandarin learning, helping those with problems reading Chinese characters) and not bad (e.g., piracy).

still from the movie, showing the subtitled text of Li Mubai saying 'Jianghu li wohucanglong'

Google Maps switches to Hanyu Pinyin for Taiwan (sloppily)

Until very recently, Google Maps gave street names in Taiwan in Tongyong Pinyin — most of the time, at least. This was the case even for Taipei, which most definitely has long used Hanyu Pinyin, not Tongyong Pinyin. The romanization on Google Maps was really a hodgepodge in the maps of Taiwan. And it’s still kind of a mess; but now it’s at least more consistent — and more consistent in Hanyu Pinyin.

First the good. In Google Maps:

  • Hanyu Pinyin, not Tongyong Pinyin, is now used for street names throughout Taiwan
  • Tone marks are indicated. (Previous maps with Tongyong did not indicate tones.)

Now the bad, and unfortunately there’s a lot of it and it’s very bad indeed:

  • The Hanyu Pinyin is given as Bro Ken Syl La Bles. (Terrible! Also, this is a new style for Google Maps. Street names in Tongyong were styled properly: e.g., Minsheng, not Min Sheng.)
  • The names of MRT stations remain incorrectly presented. For example, what is referred to in all MRT stations and on all MRT maps as “NTU Hospital” is instead referred to in broken Pinyin as “Tái Dà Y? Yuàn” (in proper Pinyin this would be Tái-Dà Y?yuàn); and “Xindian City Hall” (or “Office” — bleah) is marked as X?n Diàn Shì G?ng Su? (in proper Pinyin: “X?ndiàn Shìg?ngsu?” or perhaps “X?ndiàn Shì G?ngsu?“). Most but not all MRT stations were already this incorrect way (in Hanyu Pinyin rather than Tongyong) in Google Maps.
  • Errors in romanization point to sloppy conversions. For example, an MRT station in Banqiao is labeled X?n Bù rather than as X?np?. (? is one of those many Chinese characters with multiple Mandarin pronunciations.)
  • Tongyong Pinyin is still used in the names of most cities and townships (e.g., Banciao, not Banqiao).

Screenshot from earlier this evening, showing that Tongyong Pinyin is still being used in Google Maps for some city and district names (e.g., Gueishan, Sinjhuang, Banciao, Jhonghe, Sindian, and Jhongjheng rather than Hanyu Pinyin’s Guishan, Xinzhuang, Banqiao, Zhonghe, Xindian, and Zhongzheng, respectively).
map of Taipei area, with names as shown above

I don’t have any old screenshots of my own available at the moment, so for now I’ll refer you to an image that Fili used in an old post of his. Compare that with this screenshot I took a few minutes ago from Google Maps of the same section of Tainan:
tainan_google_maps2

Note especially how the name of the junior high school is presented.

  • Previously “Jian Xing Junior High School”.
  • Now “Jiàn Xìng Jr High School”.

This is typical of how in old maps some things were labeled (poorly) in Hanyu Pinyin. (Words, not bro ken syl la bles, are the basis for Pinyin orthography. This is a big deal, not a minor error.) And now such places are still labeled poorly in Hanyu Pinyin, but with the addition of tone marks.

I’d like to return to the point earlier on sloppy conversions. Surprisingly, ??? is given as “Chéng Do? Road” rather than as “Chéngd? Road“.
screenshot from Google Maps of 'Cheng Dou [sic] Rd', near Taipei's Ximending
Although “Xinpu” might not be the sort of name to be contained in some romanization databases, there is nothing in the least obscure about Chengdu, the name of a city of some 11 million people. Google Translate certainly knows the right thing to do with ???:
screenshot from Google Translate, showing how Google will translate '???' as 'Chengdu Rd'

But Google Maps doesn’t get this simple point right, which likely points to outsourcing. Why would Google do this? And why wouldn’t it ensure that a better job was done? Because, really, so far the long-overdue conversion to Hanyu Pinyin in Google Maps for Taiwan is something of a botch.