How to create Hanyu Pinyin subtitles

Since posting about the Pinyin subtitles for Crouching Tiger, Hidden Dragon and The Story of Stuff I have received several messages inquiring about how someone might make Pinyin subtitles themselves. So I might as well put the answer online.

Although at the present stage of software implementation subtitle conversion isn’t as simple as pushing a button, the process is not particularly difficult, assuming you have a good source text to work from. But this does require some time and the right tools.

The Right Tools

The most important tool is, of course, the one that performs the conversion to Hanyu Pinyin. And it’s crucial to keep in mind that not all Pinyin converters are created equal; in fact, the vast majority of so-called Pinyin converters are best avoided entirely. The world does not need any more texts in the hobbled, poorly written mess that many people erroneously think of as Hanyu Pinyin; but it very much needs texts in real Hanyu Pinyin. So don’t waste your time with a program that doesn’t do a good job of word parsing, etc.

At present the clear front-runner for converting Chinese characters to Hanyu Pinyin texts (real Hanyu Pinyin texts) with a minimum need for user assistance is Key Chinese (Windows and Mac). The demo version is fully functional for 30 days. Key’s considerably less expensive “Hanzi To Pinyin With Tones Conversion Utility” for MS Word texts (also with a 30-day demo) would probably also work well, though I haven’t tried it myself.

Wenlin (Windows and Mac) is another excellent program that can produce properly spelled and word-parsed Hanyu Pinyin. But it requires users to run some disambiguation themselves, which can take a lot of time when you’re talking about something with as much text as a screenplay. Nonetheless, Wenlin’s incorporation of John DeFrancis’s ABC Chinese-English Comprehensive Dictionary makes it a helpful reference when performing post-conversion checks. Also, especially if one does not have Key, Wenlin — even the function-limited but non-expiring demo version — is useful for handling some adjustments (such as removing tone marks or providing a workaround when dealing with programs that don’t handle Chinese characters well).

You’ll also need a Unicode-friendly text editor with good support of regular expressions (to allow wildcard searches). I like Em Editor, which is Windows based. But lots of other programs would work. One could even use MS Word if so inclined.

Finally, having subtitles in an additional language (usually but not necessarily English) is often desirable, not just for others who would use these subtitles but for yourself as you create the Pinyin subtitles. But often the subtitles one may find in Mandarin are not in synch with those in another language. Software can fix this problem. But I don’t have enough experience with this to recommend certain programs over others.

To sum up, the tools I recommend for creating Hanyu Pinyin subtitles are

  1. Key Chinese
  2. Wenlin
  3. EmEditor (or another Unicode-friendly text editor)
  4. a subtitle synchronizer

Actually, just the first one, Key, is sufficient to produce Pinyin subtitles. But in my experience using a combination of all four programs is preferable.

Now it’s time to get down to business.

The Main Steps

  1. acquire source-version subtitles
  2. synchronize subtitle files
  3. identify names of the movie’s characters (dramatis personae)
  4. perform initial conversion of subtitles in Chinese characters to Pinyin
  5. double check the results and perform necessary cleanup
  6. create additional version without tone marks
  7. share your work

1. Acquire subtitles for conversion and reference

At present the most useful site for finding Mandarin subtitles written in Chinese characters is probably Shooter. You may need to try searching for your desired title in both simplified and traditional characters. Also, be aware that movies — especially movies not filmed in Mandarin — often have different names in China, Taiwan, Hong Kong, etc.

You may find it useful to look for subtitles in other languages, too. Shooter can be useful for that, though you may have better luck finding English subtitles at Opensubtitles.org or similar English-language sites.

One can often find different subtitle files for the same movie, so you may wish to examine more than one for quality. Another thing that’s worth keeping in mind: Converting from traditional Chinese characters to simplified Chinese characters is less problematic than vice versa.

2. Synchronize subtitle files

Once you have the files, you should synchronize them with each other according to the directions for the particular program you are using.

If the program you’re using for this chokes on Chinese characters, though, you’ll need to take a couple extra steps. First, convert the Chinese characters to Unicode numerical character references using either Pinyin Info’s NCR conversion tool or Wenlin (full or demo version). The reason for this is that even synchronizers that screw up “李慕白” should be able to handle the NCR equivalent: “李慕白”.

In Wenlin,
Edit –> Make transformed copy –> Encode &#; [decimal]

Take the NCR text and synchronize the files. After you get this taken care of, reconvert to Chinese characters.

In Wenlin,
Edit –> Make transformed copy –> Decode &#;

3. identify names of the movie’s characters

You must teach your software know which strings of Hanzi represent names. For example, it’s crucial for clarity that the character name “李慕白” is written “Lǐ Mùbái” rather than as “lǐ mù bái“. This part takes some time up front. But do not skip this step, because it is not only crucial but will save a lot of trouble in the long run.

Before doing this, however, people may want to refamiliarize themselves with Hanyu Pinyin’s rules for proper nouns (PDF). Note especially what is supposed to be capitalized and what isn’t.

The Mandarin version of Wikipedia is one resource that can be helpful in identifying the names of at least the main characters in the movie. But you’ll want to look for more names and forms than will be listed there. Keep in mind that characters aren’t always addressed by their full names. You need to look for other forms as well (e.g., in Crouching Tiger, Hidden Dragon Li Mubai is sometimes referred to as “Li Mubai” but other times as “Li ye” or simply as “Mubai”) and enter them.

English subtitles can be very useful for locating most proper nouns in the text. (Hooray for word parsing and capitalization of proper nouns!) The following search of an English subtitle file should help pinpoint the location of proper nouns.

find (with “Match Case” and “Use Regular Expressions” checked):
[^\.]\s[A-Z][a-z]

in MS Word, find (with “Use wildcards” checked):
[!\.] [A-Z][a-z]

Since you’ve already synchronized your subtitles, you’ll easily be able to find the corresponding point in the Mandarin subtitles by looking at the time the line appears.

As you gather the names, or after you compile the full list, add your findings to the Pinyin converter’s user dictionary. In Key, perform Language –> Add Record, then fill in the Hanzi and Pinyin fields.

4. Perform initial conversion to Pinyin

OK, I know you’re eager to run the conversion and see all of those Hanzi turn into lovely Hanyu Pinyin. But there’s one quick step you need to do first. If you’re using Key Chinese, the program won’t make use of all of those character names you just painstakingly added to the user dictionary unless you first run “linguistic reconstruction” on the subtitles you wish to convert:
Language –> Linguistic Reconstruction

Now you’re ready for the big step:
Language –> Convert to Pinyin

5. Double check the results and perform necessary cleanup

Unfortunately, most Pinyin converters — even the best — tend to be lazy about inserting spaces in some of the places they belong, such as around numeric and alphabetic strings. For example, “自3月22日(星期一)起至5月31日(星期一)” will generally convert to something that looks like this:
“zì3yuè22rì (Xīngqīyī) qǐ zhì5yuè31rì (Xīngqīyī)”.
But it should look like this:
“zì 3 yuè 22 rì (Xīngqīyī) qǐ zhì 5 yuè 31 rì (Xīngqīyī)”.

To fix this in your Pinyin text, run the following regular expression in EmEditor. Make sure “Match Case” is not checked.
find:
([a-zāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ])([0-9]+)([a-zāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ])

replace:
\1 \2 \3

If you do this in Word, you’ll need to use the following instead in your wildcard search.
find:
([A-Za-zĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ])([0-9]{1,})([A-Za-zĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ])

replace:
\1 \2 \3

The rest of cleanup work usually involves you simply reading through the text, looking for errors, perhaps while listening to the movie.

6. Create additional version without tone marks

If you have Key, this is very easy: Highlight the entire text, then
Format –> Strip Tone Marks.

And you’re done, though because Key keeps u-umlaut as such, if your television or other device doesn’t show the letter ü correctly you may wish to convert “ü” to “v”.

If you don’t have Key or access to another program that can do the same thing as easily, then use a combination of Wenlin (again, even the demo will do what you need) and a text editor. First, paste your Pinyin text into Wenlin. Then select all of the text and perform
Edit –> Make transformed copy… –> Replace tone marks with 1-4

Copy and paste the results into a new document in your text editor. Then run the following search-and-replace. Make certain the “Use Regular Expressions” or “Use Wildcards” box is checked.

find:
([A-Za-z])([1-4])

replace with:
\1
Then click “Replace All”.

What this looks like in EmEditor:
image showing the search-and-replace dialog box for the above

What this looks like in MS Word:

7. Share your work

It’s much better if people can concentrate on producing new material rather than having to redo things others have already taken care of. So if you make a good Hanyu Pinyin version of something, please let me know.

Pinyin subtitles for ‘The Story of Stuff’

screenshot from the video, showing Pinyin subtitles: Shìde, shìde, shìde: w?men quánd?u yào huísh?u, k?x? gu?ng kào huísh?u hái bùgòu.

The Story of Stuff is a 20-minute video on the costs and absurdities of having a culture wrapped up in unchecked consumerism. It gained especially wide attention after the New York Times published a front-page article about it. A related book was released earlier this month.

The entire video can be downloaded freely in high- and low-resolution versions. And now there’s a collection of subtitles of possible interest to many readers of Pinyin News.

The zip file contains seven subtitle files:

  • Hanyu Pinyin with tone marks
  • Hanyu Pinyin without tone marks
  • traditional Chinese characters (Unicode)
  • traditional Chinese characters (Big5)
  • simplified Chinese characters (Unicode)
  • simplified Chinese characters (GB)
  • English

The star of the video, author Annie Leonard, has a lot to get through in just 20 or so minutes, so many people may find it easier, at least at first, to read the Pinyin subtitles that do not include tone marks.

Pinyin subtitles for Crouching Tiger, Hidden Dragon

Er, someone has created Hanyu Pinyin subtitles for the film Crouching Tiger, Hidden Dragon (Wòhǔcánglóng / 臥虎藏龍 / 卧虎藏龙). They’re in UTF-8 (Unicode) and come in two varieties: one with tone marks (link above), the other without. The latter would be useful primarily for those who have trouble getting diacritics to appear properly, such as many of those watching the movie through a TV hooked up to a DivX DVD player.

The set of subtitles also includes English and Mandarin in Chinese characters (both traditional and simplified versions).

The subtitles might seem to go by a bit quickly. But that’s generally because people don’t have much experience reading Hanyu Pinyin. (Also, the English subtitles leave out a lot. But the Pinyin ones are comprehensive.) Practice reading and you’ll get much faster at it.

Remember to use these only for good (e.g., practice reading Pinyin, Mandarin learning, helping those with problems reading Chinese characters) and not bad (e.g., piracy).

still from the movie, showing the subtitled text of Li Mubai saying 'Jianghu li wohucanglong'

recent milestones for Sino-Platonic Papers

The Web site for Sino-Platonic Papers, Professor Victor Mair’s iconoclastic journal, has expanded to the point that, as of the most recent batch of reissues, it offers more than half of the journal’s 198 (and counting) issues in full and for free. So if you haven’t visited that site recently you might want to have another look.

I’ll mention just a few of the recent additions:

Other recent milestones for SPP include

Below: A chart from SPP 198, Aramaic Script Derivatives in Central Eurasia, by Doug Hitch.
chart of scripts derived from Aramaic. See SPP 198 (the link for this image) for a version of this chart with machine-readable text.

How to learn real Mandarin: an anecdote

The following is a guest post by Professor Victor H. Mair of the University of Pennsylvania’s Department of East Asian Languages and Civilizations.

The personal names used in the original correspondence have been changed to generational designations.

Compared to the Hànzì-centric pedagogical approach which forces little children to memorize extremely difficult and complicated characters like 老鼠 and 蝴蝶 instead of teaching them lǎoshǔ and húdié, today I received some more hopeful and sane news.

A friend of mine is teaching her grandson Mandarin. The way she is doing it is to write out the Xī yóu jì (Journey to the West) in a simple báihuà paraphrase using Pinyin only (with glosses in English for new vocabulary). My friend is a first-generation immigrant to America, and her daughter married a German who was studying in the United States, so that makes the grandson third-generation Chinese-American/German.

The other day, the grandson asked his mom out of the blue: “What’s the difference between shíjiān, shídài, and shífèn?” My friend, the grandmother, explained to me that all of these terms were in the Pinyin text that she had prepared for her grandson, and that she had glossed them as “time” or “period.” She said that the boy’s mother was very pleased, and she was tickled too, because the boy had discerned the common element shí by himself. As my friend (the grandmother) put it, “He spends very little time on Chinese, so we were pleasantly surprised.”

Hearing this account from my friend, I wrote to her: “Thank you so much for the TRULY WONDERFUL story you wrote about your grandson. This is how to learn real Chinese!!!! And you are being a real Chinese teacher to teach your grandson this way. And I’m also happy that your daughter appreciates what you and her son are doing together. Tell your grandson I’m really impressed at the intelligence of his question.”

Hoklo dictionaries: a list

The newly redesigned Tailingua has just issued a useful list of dictionaries of the Taiwanese language and related dialects (PDF).

Here’s a random sample:

  • Dyer, Samuel 萊撒母耳 (1838 ). A Vocabulary of the Hok-keen Dialect as Spoken in the County of Tsheang- Tshew [漳州音字典]. Malacca: Anglo-Chinese College Press.
  • Embree, Bernard L.M. 晏寶理 (1973). A Dictionary of Southern Min [閩南語英語辭典]. Kowloon: Hong Kong Language Institute.
  • Fùxīng wénhuà shìyèshè 復興文化事業社 (2004). Táiwān mǔyǔ yīnbiāo zìdiǎn 臺灣母語音標字典 [Taiwanese mother tongue pronunciation dictionary]. Táinán: Fùxīng Wénhuà Shìyèshè 復興文化 事業社.
  • Hare, G.T. (1904). The Hokkien Vernacular [福建白話英文字典]. Kuala Lumpur: Straits Settlements and Selangor Government Printing Offices.
  • Hóng Guóliáng 洪國良 (2004). Héluòyǔ yīnzì duìzhào diǎn 河洛語音字對照典 [Comparative dictionary of Ho-lo pronunciation]. Gāoxióng: Fùwén 復文.
  • Hóng Hóngyuán 洪宏元 (2009). Xuéshēng Tái–Huá shuāngyǔ huóyòng cídiǎn 學生台華雙語活用辭典 [Bilingual everyday Taiwanese–Mandarin dictionary for students]. Táiběi: Wǔ Nán Túshū Chūbǎn Yǒuxiàn Gōngsī 五南圖書出版有限公司.
  • Hú Xīnlín 胡鑫麟 (1994). Shíyòng Táiyǔ xiǎo cídiǎn 實用臺語小辭典 [Practical pocket Taiwanese dictionary]. Táiběi: Zìlì Wǎnbào Chūbǎnbù 自立晚報出版部.

China and U.S. study abroad programs: update

In one of my posts about a year ago, China and U.S. study abroad programs (Pinyin News, Nov. 23, 2008), I noted that China had become the fifth most popular destination for U.S. students in study abroad programs.

More recent data show that the China has remained in fifth place. In fact, the order in the top ten list has not changed, though the figures for each of the countries have increased.

Top 10 destinations for study abroad by U.S. students in the 2006-07 and 2007-08 school years
China shown as the fifth most popular destination for study abroad. The top destination is the U.K., followed by Italy, Spain, and France. See the link to my source material for the actual numbers.

Growth for China as a destination, however, remained strong, at 19.0 percent, while study abroad as a whole increased 8.5 percent. Top growth, however, belonged to India, followed by Austria, then China, and Ireland. If China continues to grow at such rates as a destination, it could knock France out of fourth place in a few years, which would be a dramatic development.

10 highest growth rates for destinations for study abroad by U.S. students (comparing the 2007-08 school year with the 2006-07 school year)

China now accounts for 5 percent of U.S. study abroad, which has helped Asia’s overall growth as a destination region.

Percent of study abroad performed in Asia, 1996-2007
chart showing percentage of study abroad in Asia flat at about 6% from 1996-2000, with growth increasing since 2003 to the present 11.1% for the 2007-08 school year

Some predictions for the next installment:

  • Economic woes are probably going to reduce the rate of study abroad, though that may benefit China, relatively speaking, as students opt for it over more expensive destinations like the U.K. and France.
  • Terrorism could affect India’s numbers, though I expect them to continue to increase dramatically over the long term.
  • And should China reevaluate its currency, that could slow its growth as a destination for U.S. students.

source: Open Doors Report 2009