Combining Pinyin and Chinese character subtitles

With any luck, this will be the last post for some time in my none too exciting but hopefully useful series on technical aspects of creating Pinyin subtitles.

Some people like to have Pinyin subtitles and Hanzi subtitles appear at the same time. Although I think that’s generally a bad idea (too much text to get through quickly that way, people would benefit from becoming accustomed to reading Pinyin texts as Pinyin texts, etc.), I’ll go ahead and offer instructions on how to make Pinyin subtitles appear above Chinese character subtitles.

These directions are for Microsoft Word, though other programs could be used instead.

Using Word, open copies of the two subtitle files you’d like to combine.

To get the alignment between the two files to match when they’re combined, it’s important that each subtitle entry is only one line long. You can check for possible instances of multi-line subtitles with a wildcard search (CTRL+H –> More –> Use wildcards).

Find what (with “Use wildcards” checked):
([!0-9])^13([!0-9^13])

If that search finds any multi-line subtitles, you’ll need to temporarily adjust those lines in both subtitle files, as follows:

Find what (with “Use wildcards” checked):
([!0-9])^13([!0-9^13])

Replace with:
\1|\2

Again, be sure to run that search-and-replace in both subtitle files. You’ll replace the “|” with a RETURN later.

Next, in the file with the Chinese characters (not the Pinyin file) strip out everything except for the text of the subtitles, leaving just the Hanzi text. (I wrote about this earlier in How to strip subtitle files down to text. The method is also useful for removing such information if you want to create the text of the screenplay.)

Find what (with “Use wildcards” checked):
^13[0-9:\,\-\> ]{1,}^13

Replace with:
^p

Note: You may need to run the above “replace all” twice for Word to catch everything.

You should have something that looks like this (with paragraph marks shown):


喲! 李爺來啦¶

李爺來啦¶

秀蓮¶

秀蓮¶

秀蓮,李慕白來啦¶

Now add extra lines, so the lines with Chinese characters will fit into the new document in the correct places.

Find what (with “Use wildcards” checked):
^13^13

Replace with:
^p^p^p^p^p

Delete the very first line — the one with the “1″ in it. Then add three blank lines above this.

You should have something that looks like this (with paragraph marks shown):




喲! 李爺來啦¶




李爺來啦¶




秀蓮¶

Select all (CTRL+A). Then convert this to a table:
Table –> Convert –> Text to Table

Now switch to the Pinyin subtitles file.

First, add the extra lines blank lines into which you will later insert the Chinese characters that correspond with the Pinyin.

Find what (with “Use wildcards” checked):
^13^13

Replace with:
^p^p^p

Convert the Pinyin subtitles to a table:
CTRL+A
Table –> Convert –> Text to Table

Switch back to the Chinese character file. Copy the table there and paste it to the right of the table with the Pinyin text.

You should have something that looks like this:

1  
00:00:49,000 –> 00:00:51,500  
Yō! Lǐ yé lái la  
  喲! 李爺來啦
   
2  
00:00:52,200 –> 00:00:53,600  
Lǐ yé lái la  
  李爺來啦
   
3  
00:01:06,900 –> 00:01:08,400  
Xiùlián  
  秀蓮
   
4  
00:01:09,000 –> 00:01:10,400  
Xiùlián  
  秀蓮

Next, change this back into text:
Table –> Convert –> Table to Text

Remove the tabs:
Find what:
^t

Replace with:
[leave blank]

If you combined any lines earlier, break them apart now:
Find what:
|

Replace with:
^p

Your document should now look like this:

1
00:00:49,000 –> 00:00:51,500
Yō! Lǐ yé lái la
喲! 李爺來啦

2
00:00:52,200 –> 00:00:53,600
Lǐ yé lái la
李爺來啦

3
00:01:06,900 –> 00:01:08,400
Xiùlián
秀蓮

4
00:01:09,000 –> 00:01:10,400
Xiùlián
秀蓮

Save the file as plain text (*.txt), not as a Word document (*.doc). Then later rename this to give it the correct file extension (probably *.srt).

See also:

How to strip subtitle files down to text

Subtitle files are wonderful things. But for those times when you want to just read the text by itself and not bother with the movie (for example, if you want to prepare a script), they can look a little cluttered — what with all of that extra timing information.

1
00:00:49,000 –> 00:00:51,500
Yo! Li ye lai la

2
00:00:52,200 –> 00:00:53,600
Li ye lai la

3
00:01:06,900 –> 00:01:08,400
Xiulian

The directions below for how to remove all of the extra numbers, etc., refer to Microsoft Word, since most people already have that tool.

To strip out everything except for the text of the subtitles, run the following wildcard search (CTRL+H –> More –> Use wildcards).

Find what:
^13[0-9:\,\-\> ]{1,}^13

Replace with:
^p

Replace all.

Note: You may need to run the above “replace all” twice. Also, unless you add an extra return at the top of the document you’ll need to clean up the first entry by hand.

The above search-and-replace will yield

Yo! Li ye lai la

Li ye lai la

Xiulian

If, however, you want to at least temporarily keep the basic timing information (such as to help you identify scene boundaries more quickly), you can do so as follows.

Find what (wildcards):
^13[0-9]{1,}^13([0-9\:]{1,})([0-9\:\-\> \,]{1,})^13

Replace with:
^p\1^p

Again, unless you add an extra return at the top of the document you’ll need to clean up the first entry by hand.

This will result in the document looking like this:

00:00:49
Yo! Li ye lai la

00:00:52
Li ye lai la

00:01:06
Xiulian

Once you’re through with the timing information, you can strip it out using the first search-and-replace above.

How to create Hanyu Pinyin subtitles

Since posting about the Pinyin subtitles for Crouching Tiger, Hidden Dragon and The Story of Stuff I have received several messages inquiring about how someone might make Pinyin subtitles themselves. So I might as well put the answer online.

Although at the present stage of software implementation subtitle conversion isn’t as simple as pushing a button, the process is not particularly difficult, assuming you have a good source text to work from. But this does require some time and the right tools.

The Right Tools

The most important tool is, of course, the one that performs the conversion to Hanyu Pinyin. And it’s crucial to keep in mind that not all Pinyin converters are created equal; in fact, the vast majority of so-called Pinyin converters are best avoided entirely. The world does not need any more texts in the hobbled, poorly written mess that many people erroneously think of as Hanyu Pinyin; but it very much needs texts in real Hanyu Pinyin. So don’t waste your time with a program that doesn’t do a good job of word parsing, etc.

At present the clear front-runner for converting Chinese characters to Hanyu Pinyin texts (real Hanyu Pinyin texts) with a minimum need for user assistance is Key Chinese (Windows and Mac). The demo version is fully functional for 30 days. Key’s considerably less expensive “Hanzi To Pinyin With Tones Conversion Utility” for MS Word texts (also with a 30-day demo) would probably also work well, though I haven’t tried it myself.

Wenlin (Windows and Mac) is another excellent program that can produce properly spelled and word-parsed Hanyu Pinyin. But it requires users to run some disambiguation themselves, which can take a lot of time when you’re talking about something with as much text as a screenplay. Nonetheless, Wenlin’s incorporation of John DeFrancis’s ABC Chinese-English Comprehensive Dictionary makes it a helpful reference when performing post-conversion checks. Also, especially if one does not have Key, Wenlin — even the function-limited but non-expiring demo version — is useful for handling some adjustments (such as removing tone marks or providing a workaround when dealing with programs that don’t handle Chinese characters well).

You’ll also need a Unicode-friendly text editor with good support of regular expressions (to allow wildcard searches). I like Em Editor, which is Windows based. But lots of other programs would work. One could even use MS Word if so inclined.

Finally, having subtitles in an additional language (usually but not necessarily English) is often desirable, not just for others who would use these subtitles but for yourself as you create the Pinyin subtitles. But often the subtitles one may find in Mandarin are not in synch with those in another language. Software can fix this problem. But I don’t have enough experience with this to recommend certain programs over others.

To sum up, the tools I recommend for creating Hanyu Pinyin subtitles are

  1. Key Chinese
  2. Wenlin
  3. EmEditor (or another Unicode-friendly text editor)
  4. a subtitle synchronizer

Actually, just the first one, Key, is sufficient to produce Pinyin subtitles. But in my experience using a combination of all four programs is preferable.

Now it’s time to get down to business.

The Main Steps

  1. acquire source-version subtitles
  2. synchronize subtitle files
  3. identify names of the movie’s characters (dramatis personae)
  4. perform initial conversion of subtitles in Chinese characters to Pinyin
  5. double check the results and perform necessary cleanup
  6. create additional version without tone marks
  7. share your work

1. Acquire subtitles for conversion and reference

At present the most useful site for finding Mandarin subtitles written in Chinese characters is probably Shooter. You may need to try searching for your desired title in both simplified and traditional characters. Also, be aware that movies — especially movies not filmed in Mandarin — often have different names in China, Taiwan, Hong Kong, etc.

You may find it useful to look for subtitles in other languages, too. Shooter can be useful for that, though you may have better luck finding English subtitles at Opensubtitles.org or similar English-language sites.

One can often find different subtitle files for the same movie, so you may wish to examine more than one for quality. Another thing that’s worth keeping in mind: Converting from traditional Chinese characters to simplified Chinese characters is less problematic than vice versa.

2. Synchronize subtitle files

Once you have the files, you should synchronize them with each other according to the directions for the particular program you are using.

If the program you’re using for this chokes on Chinese characters, though, you’ll need to take a couple extra steps. First, convert the Chinese characters to Unicode numerical character references using either Pinyin Info’s NCR conversion tool or Wenlin (full or demo version). The reason for this is that even synchronizers that screw up “李慕白” should be able to handle the NCR equivalent: “李慕白”.

In Wenlin,
Edit –> Make transformed copy –> Encode &#; [decimal]

Take the NCR text and synchronize the files. After you get this taken care of, reconvert to Chinese characters.

In Wenlin,
Edit –> Make transformed copy –> Decode &#;

3. identify names of the movie’s characters

You must teach your software know which strings of Hanzi represent names. For example, it’s crucial for clarity that the character name “李慕白” is written “Lǐ Mùbái” rather than as “lǐ mù bái“. This part takes some time up front. But do not skip this step, because it is not only crucial but will save a lot of trouble in the long run.

Before doing this, however, people may want to refamiliarize themselves with Hanyu Pinyin’s rules for proper nouns (PDF). Note especially what is supposed to be capitalized and what isn’t.

The Mandarin version of Wikipedia is one resource that can be helpful in identifying the names of at least the main characters in the movie. But you’ll want to look for more names and forms than will be listed there. Keep in mind that characters aren’t always addressed by their full names. You need to look for other forms as well (e.g., in Crouching Tiger, Hidden Dragon Li Mubai is sometimes referred to as “Li Mubai” but other times as “Li ye” or simply as “Mubai”) and enter them.

English subtitles can be very useful for locating most proper nouns in the text. (Hooray for word parsing and capitalization of proper nouns!) The following search of an English subtitle file should help pinpoint the location of proper nouns.

find (with “Match Case” and “Use Regular Expressions” checked):
[^\.]\s[A-Z][a-z]

in MS Word, find (with “Use wildcards” checked):
[!\.] [A-Z][a-z]

Since you’ve already synchronized your subtitles, you’ll easily be able to find the corresponding point in the Mandarin subtitles by looking at the time the line appears.

As you gather the names, or after you compile the full list, add your findings to the Pinyin converter’s user dictionary. In Key, perform Language –> Add Record, then fill in the Hanzi and Pinyin fields.

4. Perform initial conversion to Pinyin

OK, I know you’re eager to run the conversion and see all of those Hanzi turn into lovely Hanyu Pinyin. But there’s one quick step you need to do first. If you’re using Key Chinese, the program won’t make use of all of those character names you just painstakingly added to the user dictionary unless you first run “linguistic reconstruction” on the subtitles you wish to convert:
Language –> Linguistic Reconstruction

Now you’re ready for the big step:
Language –> Convert to Pinyin

5. Double check the results and perform necessary cleanup

Unfortunately, most Pinyin converters — even the best — tend to be lazy about inserting spaces in some of the places they belong, such as around numeric and alphabetic strings. For example, “自3月22日(星期一)起至5月31日(星期一)” will generally convert to something that looks like this:
“zì3yuè22rì (Xīngqīyī) qǐ zhì5yuè31rì (Xīngqīyī)”.
But it should look like this:
“zì 3 yuè 22 rì (Xīngqīyī) qǐ zhì 5 yuè 31 rì (Xīngqīyī)”.

To fix this in your Pinyin text, run the following regular expression in EmEditor. Make sure “Match Case” is not checked.
find:
([a-zāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ])([0-9]+)([a-zāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ])

replace:
\1 \2 \3

If you do this in Word, you’ll need to use the following instead in your wildcard search.
find:
([A-Za-zĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ])([0-9]{1,})([A-Za-zĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜ])

replace:
\1 \2 \3

The rest of cleanup work usually involves you simply reading through the text, looking for errors, perhaps while listening to the movie.

6. Create additional version without tone marks

If you have Key, this is very easy: Highlight the entire text, then
Format –> Strip Tone Marks.

And you’re done, though because Key keeps u-umlaut as such, if your television or other device doesn’t show the letter ü correctly you may wish to convert “ü” to “v”.

If you don’t have Key or access to another program that can do the same thing as easily, then use a combination of Wenlin (again, even the demo will do what you need) and a text editor. First, paste your Pinyin text into Wenlin. Then select all of the text and perform
Edit –> Make transformed copy… –> Replace tone marks with 1-4

Copy and paste the results into a new document in your text editor. Then run the following search-and-replace. Make certain the “Use Regular Expressions” or “Use Wildcards” box is checked.

find:
([A-Za-z])([1-4])

replace with:
\1
Then click “Replace All”.

What this looks like in EmEditor:
image showing the search-and-replace dialog box for the above

What this looks like in MS Word:

7. Share your work

It’s much better if people can concentrate on producing new material rather than having to redo things others have already taken care of. So if you make a good Hanyu Pinyin version of something, please let me know.

Pinyin subtitles for Crouching Tiger, Hidden Dragon

Er, someone has created Hanyu Pinyin subtitles for the film Crouching Tiger, Hidden Dragon (Wòhǔcánglóng / 臥虎藏龍 / 卧虎藏龙). They’re in UTF-8 (Unicode) and come in two varieties: one with tone marks (link above), the other without. The latter would be useful primarily for those who have trouble getting diacritics to appear properly, such as many of those watching the movie through a TV hooked up to a DivX DVD player.

The set of subtitles also includes English and Mandarin in Chinese characters (both traditional and simplified versions).

The subtitles might seem to go by a bit quickly. But that’s generally because people don’t have much experience reading Hanyu Pinyin. (Also, the English subtitles leave out a lot. But the Pinyin ones are comprehensive.) Practice reading and you’ll get much faster at it.

Remember to use these only for good (e.g., practice reading Pinyin, Mandarin learning, helping those with problems reading Chinese characters) and not bad (e.g., piracy).

still from the movie, showing the subtitled text of Li Mubai saying 'Jianghu li wohucanglong'

Wiki for collaborative Pinyin projects

I have long wanted to expand the range of materials available in and about Pinyin. Possibilities for projects include:

  • Hanyu Pinyin subtitles for movies and videos
  • Hanyu Pinyin versions of Mandarin plays (for example, Cháguǎn, by Lǎo Shě)
  • translations into Mandarin (Hanzi and/or Pinyin) of parts of this site

I can do a lot of the work — in fact, as is my habit, I’ve begun all sorts of such projects but haven’t finished them — but can’t do all of it myself. So I’ve been mulling the idea of setting up a Pinyin-related wiki here on Pinyin.info or perhaps on a spinoff site I set up, which would allow you, o reader, to get involved (a little or a lot, depending on your desire and amount of free time).

I’m thinking that texts could be worked on with the aid of Wenlin, since even contributors without the full version of that enormously useful program could use its free demo to select disambiguation choices in cases of word-parsing ambiguities or characters with multiple pronunciations.

For example, if one were using Wenlin to convert the following into Pinyin,

我在朦胧中,眼前展开一片海边碧绿的沙地来,上面深蓝的天空中挂着一轮金黄的圆月。我想:希望本是无所谓有,无所谓无的。这正如地上的路;其实地上本没有路,走的人多了,也便成了路。

one would first need to choose between potentially ambiguous word boundaries

|我 | 在 | 朦胧 | 中,眼前 | 展开 | 一 | 片 | 海边 | 【◎Fix:◎碧绿 | 的;◎碧 | 绿的】 | 沙地 | 来,上面 | 深蓝 | 的 | 【◎Fix:◎天空 | 中;◎天 | 空中】 | 挂着 | 一 | 轮 | 金黄 | 的 | 圆月。我 | 想:希望 | 本 | 是 | 无所谓 | 有,无所谓 | 无 | 的。这 | 正如 | 【◎Fix:◎地上 | 的;◎地 | 上的】 | 路;其实 | 地上 | 本 | 没有 | 路,走 | 的 | 人 | 多 | 了,也 | 便 | 成了 | 路。

and then take care of items with multiple pronunciations

Wǒ zài ménglóng 【◎Fix:◎zhōng;◎zhòng】, yǎnqián zhǎnkāi yī 【◎Fix:◎piàn;◎piān】 hǎibiān bìlǜ de shādì lái, shàngmian shēnlán de tiānkōng 【◎Fix:◎zhōng;◎zhòng】 guàzhe yī lún jīnhuáng de yuányuè. Wǒ xiǎng: xīwàng běn shì wúsuǒwèi yǒu, wúsuǒwèi 【◎Fix:◎wú;◎mó】 de. Zhè zhèngrú 【◎Fix:◎dìshang;◎dìshàng】 de lù; qíshí 【◎Fix:◎dìshang;◎dìshàng】 běn méiyǒu lù, zǒu de rén duō 【◎Fix:◎le;◎liǎo;◎liāo;◎liào;◎liáo】, yě 【◎Fix:◎biàn;◎pián】 chéngle lù.

I’d prefer to keep things generally on the right side of copyright laws but am also hopeful that those may not be too onerous in the case of Pinyin versions and that Taiwan’s laws may put the situation more in our favor than might be the case elsewhere. Information about the legal situation would be greatly appreciated.

So, is anyone interested in helping out? Have advice? Success/horror stories about wiki projects? Suggestions for additional material?

some common character slips in China

Joel of Danwei has translated the gist of a list of the top errors in Mandarin use for 2006, as submitted by the readers of Yǎowénjiáozì (咬文嚼字), a magazine in China. (Yǎowénjiáozì is tricky to translate. Maybe “Pedantry,” though that sounds a bit harsh.)

I’ve reproduced the errors relating specifically to character use (7 out of 10), making the characters larger in order to help make the distinctions clearer. See Joel’s post for details.

  1. (xiàng) instead of (xiàng)
  2. 丙戍年 (bǐng shù nián) instead of 丙戌年 (bǐngxū nián)
  3. 神州[六号] (Shénzhōu [liù hào]) instead of 神舟[六号] (Shén Zhōu [liù hào]) (Those responsible for naming the spacecraft, however, certainly intended the name to remind people of “the Divine Land” (Shénzhōu, 神州, i.e. China).)
  4. () instead of ()
  5. 美發 (měi fā) instead of 美髮 (měifà) (The characters 發 () and 髮 () were both given the simplified form of 发, so people in China often end up with the wrong character when trying to use the traditional form of 发.)
  6. 启示 (qǐshì) instead of 启事 (qǐshì)
  7. 哈蜜瓜 (hā mì guā) instead of 哈密瓜 (Hāmìguā)

sources:

Shanghai theater puts on play in Shanghainese

It’s a sad situation that it’s newsworthy when a play is presented in the native language of most of those in one of the world’s largest cities. But in this case it’s also an occasion for hope.

Recently, for the first time in decades, a drama primarily in Shanghainese was presented in Shanghai. (I would guess that local operas, however, have been performed in Shanghainese with little interruption.) Unfortunately, as the Shanghaiist reports, there were some problems with this production of 《乌鸦与麻雀》 (Mandarin title: Wūyā yǔ Máquè; English title: The Crow and the Sparrow).

[T]he blame is being assigned to the fact that the production was too hastily prepared, leading them to overlook things like subtitles.

You might ask, why, if most of the dialogue is in Shanghainese, would people other than non-locals need subtitles? It turns out that aside from standard Shanghai dialect, Ningbo, Suzhou, Shandong and other dialects were also thrown in—the story takes place during the Republic period (1911-1949) at a time when many immigrants were first putting down roots in Shanghai. The production team also prepared a putonghua version of the play, which they used during the last performance here and will use if they take the play to other parts of China. All in all, it seemed as if this was a less than ideal way to restart this tradition.

Nonetheless, I’m encouraged that the authorities allowed this play to be staged in Shanghainese. Perhaps its roots as a popular film from the late 1940s and its anti-KMT storyline helped it get by the censors.

The Shanghaiist also mentions an interesting-sounding book: Rendering the Regional: Local Language in Contemporary Chinese Media, by Edward M. Gunn. The introduction (663 KB PDF) is available online. I look forward to reading the entire book once I can find it in a library or locate an inexpensive copy.

sources:

learn kanji through noh?

Studying kanji while taking in a Japanese noh drama — what could more exciting? Heh.

A common problem for those new to Japanese traditional performing arts is that–even for native Japanese speakers–it is hard to understand the story and old-fashioned language used in noh recitation or gidayu, a form of narrative chanting that accompanies bunraku performances. With a view to solving this problem, there has been a marked increase in productions using Japanese subtitles at the National Theatre in Tokyo and National Bunraku Theatre in Osaka. The National Noh Theatre in Tokyo also plans to make greater use of subtitles on screens it will introduce in autumn.

The new computer-controlled system to be introduced at the National Noh Theatre in Tokyo, where prior improvements to seats and other theater facilities are scheduled for completion in August and September, will allow Japanese subtitles to be displayed on flat-panel screens installed in seat backs.

“We will provide Japanese and English subtitles for the time being, although the system will allow us to use four channels in total,” said an official at the noh theater. Noh recitation will be displayed as it is in Japanese, while the plot of the play and a briefing on scenes will be provided in English along with a translation of the recitation….

Some bunraku performers at first questioned why Japanese subtitles were necessary since most audience members are Japanese.

“But they don’t voice such objections any more. Some even say the subtitles are useful in learning kanji…,” said Takemoto Sumitayu, a bunraku narrator and a living national treasure.

The National Bunraku Theatre hopes that the service “will help overcome the image of traditional performing arts as hard to understand.”

I suppose as long as the chairback is below the stage, the text would still be subtitling. But I can’t help but wonder if there’s a more precise term. It’s not likely to be real captioning. And what’s the word for texts that are presented on the sides of stages?

source: Does Japanese theater need Japanese subtitles?, Daily Yomiuri, July 8, 2006