Taiwanese-English, English-Taiwanese dictionaries posted

Maryknoll Language Service Center has put online the complete texts of its Taiwanese-English and English-Taiwanese dictionaries. Better still, these have been released under a Creative Commons license. These are a terrific resource for anyone who’s interested in Hoklo.

Maryknoll deserves praise for this great work. Thanks are due, too, to Tailingua, which I know has been working behind the scenes to help make this happen.

From the English Amoy Dictionary (???????):
screenshot from the English-Taiwanese dictionary

And from the Taiwanese-English Dictionary (??????):
screenshot from the dictionary

source: Maryknoll dictionaries now free to download, Tailingua, June 17, 2010

OMG, it’s Hanzified English

Taiwanese movie poster in Mandarin for 'Date Night', a.k.a. '?????'In Taiwan, the new movie Date Night has been given the Mandarin title Yu?huì o mài gà (?????/?????).

Yu?huì is simply the word for “date.” The interesting part is “o mài gà” (???), which is a Mandarinized form of the English “oh my god.” (I wonder if this, being written in Hanzi despite still being basically English, would pass China’s new need for supposed purity.)

Most people here — especially those younger than about 40 — would simply write “oh my god” (or, less frequently, “o my god”) in English in the middle of an otherwise Mandarin text. (I’ll spare everyone the chart of Google searches; but it backs this up.) But brevity is standard in movie titles here, and “???” is a lot more compact on a movie poster than “oh my god.” This, however, raises the question of why “???” instead of the equally concise “OMG”. I don’t know the answer to that. But the path of lettered words in Mandarin is certainly not without twists and turns.

Like most other uses of Hanzified English, the results are not entirely faithful to the original sounds.

Mandarin’s ou would be a closer phonetic fit than o for the English “oh”.
There’s ?u (?/?), a surname. But most of the time this Chinese character is pronounced q? (being one of those many Chinese characters with multiple pronunciations), so that certainly wouldn’t work well. There’s ?u, which has a more clearly phonetic Hanzi (?/?), but which has to do with vomit (?utù/??/??). Another possible choice would be ?u (?/?); but that is associated mainly with Europe and doesn’t get used much as a phonetic component in non-Europe-related loan words outside the word for ohm: ?um? (??/??).

Mài (the Mandarin word for wheat), unlike most other Mandarin morphemes pronounced mai (various tones), gets used phonetically in lots of various loan words, such as Màid?ngláo (McDonald’s/???/???), Màiji? (Mecca/??/??), D?nmài (Denmark/??/??), and K?màilóng (Cameroon/???/???). So its use is to be expected, though semantically there’s no link. And mài is certainly a better fit for the English my than it is for the Mc of McDonald’s, the Mec of Mecca, the mark of Denmark, or the me of Cameroon.

For ga there’s not a lot of choice. ? is often seen in the phonetic loan g?lí (curry). The biggest problem here is that the same ? is also used as k? in a different, common phonetic loan: k?f?i (coffee). There’s ?; but, like ?, it’s not exactly a well-known character.

Anyway, I could go on for a long time listing various possibilities. But the main point is that Chinese characters just don’t do well at this sort of thing.

As for Pinyin, I suppose the orthography could get interesting: o mài gà, o màigà, omài gà, or omàigà. But a Pinyin orthography would probably simply encourage people to write this in the original: oh my god.

BTW, you may wish to try the following experiment. The in o mài gà is most often seen in writing the word g?ngà (??/??), which means awkward/embarrassed. Ask native speakers of Mandarin to write g?ngà in Hanzi for you by hand without using a dictionary, a computer, or any other form of assistance. I bet that most people — even those with university degrees — won’t be able to write this common, ordinary word correctly.

And for lagniappe, the character ? is also sometimes seen in written Taiwanese as the equivalent of Mandarin’s ji? (?/add). I spotted an example of this just the other day on a cafe sign (in the sense of “buy something and ga something else for a special price”) but didn’t have a camera with me.

How to create Hanyu Pinyin subtitles

Since posting about the Pinyin subtitles for Crouching Tiger, Hidden Dragon and The Story of Stuff I have received several messages inquiring about how someone might make Pinyin subtitles themselves. So I might as well put the answer online.

Although at the present stage of software implementation subtitle conversion isn’t as simple as pushing a button, the process is not particularly difficult, assuming you have a good source text to work from. But this does require some time and the right tools.

The Right Tools

The most important tool is, of course, the one that performs the conversion to Hanyu Pinyin. And it’s crucial to keep in mind that not all Pinyin converters are created equal; in fact, the vast majority of so-called Pinyin converters are best avoided entirely. The world does not need any more texts in the hobbled, poorly written mess that many people erroneously think of as Hanyu Pinyin; but it very much needs texts in real Hanyu Pinyin. So don’t waste your time with a program that doesn’t do a good job of word parsing, etc.

At present the clear front-runner for converting Chinese characters to Hanyu Pinyin texts (real Hanyu Pinyin texts) with a minimum need for user assistance is Key Chinese (Windows and Mac). The demo version is fully functional for 30 days. Key’s considerably less expensive “Hanzi To Pinyin With Tones Conversion Utility” for MS Word texts (also with a 30-day demo) would probably also work well, though I haven’t tried it myself.

Wenlin (Windows and Mac) is another excellent program that can produce properly spelled and word-parsed Hanyu Pinyin. But it requires users to run some disambiguation themselves, which can take a lot of time when you’re talking about something with as much text as a screenplay. Nonetheless, Wenlin’s incorporation of John DeFrancis’s ABC Chinese-English Comprehensive Dictionary makes it a helpful reference when performing post-conversion checks. Also, especially if one does not have Key, Wenlin — even the function-limited but non-expiring demo version — is useful for handling some adjustments (such as removing tone marks or providing a workaround when dealing with programs that don’t handle Chinese characters well).

You’ll also need a Unicode-friendly text editor with good support of regular expressions (to allow wildcard searches). I like Em Editor, which is Windows based. But lots of other programs would work. One could even use MS Word if so inclined.

Finally, having subtitles in an additional language (usually but not necessarily English) is often desirable, not just for others who would use these subtitles but for yourself as you create the Pinyin subtitles. But often the subtitles one may find in Mandarin are not in synch with those in another language. Software can fix this problem. But I don’t have enough experience with this to recommend certain programs over others.

To sum up, the tools I recommend for creating Hanyu Pinyin subtitles are

  1. Key Chinese
  2. Wenlin
  3. EmEditor (or another Unicode-friendly text editor)
  4. a subtitle synchronizer

Actually, just the first one, Key, is sufficient to produce Pinyin subtitles. But in my experience using a combination of all four programs is preferable.

Now it’s time to get down to business.

The Main Steps

  1. acquire source-version subtitles
  2. synchronize subtitle files
  3. identify names of the movie’s characters (dramatis personae)
  4. perform initial conversion of subtitles in Chinese characters to Pinyin
  5. double check the results and perform necessary cleanup
  6. create additional version without tone marks
  7. share your work

1. Acquire subtitles for conversion and reference

At present the most useful site for finding Mandarin subtitles written in Chinese characters is probably Shooter. You may need to try searching for your desired title in both simplified and traditional characters. Also, be aware that movies — especially movies not filmed in Mandarin — often have different names in China, Taiwan, Hong Kong, etc.

You may find it useful to look for subtitles in other languages, too. Shooter can be useful for that, though you may have better luck finding English subtitles at Opensubtitles.org or similar English-language sites.

One can often find different subtitle files for the same movie, so you may wish to examine more than one for quality. Another thing that’s worth keeping in mind: Converting from traditional Chinese characters to simplified Chinese characters is less problematic than vice versa.

2. Synchronize subtitle files

Once you have the files, you should synchronize them with each other according to the directions for the particular program you are using.

If the program you’re using for this chokes on Chinese characters, though, you’ll need to take a couple extra steps. First, convert the Chinese characters to Unicode numerical character references using either Pinyin Info’s NCR conversion tool or Wenlin (full or demo version). The reason for this is that even synchronizers that screw up “???” should be able to handle the NCR equivalent: “李慕白”.

In Wenlin,
Edit –> Make transformed copy –> Encode &#; [decimal]

Take the NCR text and synchronize the files. After you get this taken care of, reconvert to Chinese characters.

In Wenlin,
Edit –> Make transformed copy –> Decode &#;

3. identify names of the movie’s characters

You must teach your software know which strings of Hanzi represent names. For example, it’s crucial for clarity that the character name “???” is written “L? Mùbái” rather than as “l? mù bái“. This part takes some time up front. But do not skip this step, because it is not only crucial but will save a lot of trouble in the long run.

Before doing this, however, people may want to refamiliarize themselves with Hanyu Pinyin’s rules for proper nouns (PDF). Note especially what is supposed to be capitalized and what isn’t.

The Mandarin version of Wikipedia is one resource that can be helpful in identifying the names of at least the main characters in the movie. But you’ll want to look for more names and forms than will be listed there. Keep in mind that characters aren’t always addressed by their full names. You need to look for other forms as well (e.g., in Crouching Tiger, Hidden Dragon Li Mubai is sometimes referred to as “Li Mubai” but other times as “Li ye” or simply as “Mubai”) and enter them.

English subtitles can be very useful for locating most proper nouns in the text. (Hooray for word parsing and capitalization of proper nouns!) The following search of an English subtitle file should help pinpoint the location of proper nouns.

find (with “Match Case” and “Use Regular Expressions” checked):

in MS Word, find (with “Use wildcards” checked):
[!\.] [A-Z][a-z]

Since you’ve already synchronized your subtitles, you’ll easily be able to find the corresponding point in the Mandarin subtitles by looking at the time the line appears.

As you gather the names, or after you compile the full list, add your findings to the Pinyin converter’s user dictionary. In Key, perform Language –> Add Record, then fill in the Hanzi and Pinyin fields.

4. Perform initial conversion to Pinyin

OK, I know you’re eager to run the conversion and see all of those Hanzi turn into lovely Hanyu Pinyin. But there’s one quick step you need to do first. If you’re using Key Chinese, the program won’t make use of all of those character names you just painstakingly added to the user dictionary unless you first run “linguistic reconstruction” on the subtitles you wish to convert:
Language –> Linguistic Reconstruction

Now you’re ready for the big step:
Language –> Convert to Pinyin

5. Double check the results and perform necessary cleanup

Unfortunately, most Pinyin converters — even the best — tend to be lazy about inserting spaces in some of the places they belong, such as around numeric and alphabetic strings. For example, “?3?22????????5?31??????” will generally convert to something that looks like this:
“zì3yuè22rì (X?ngq?y?) q? zhì5yuè31rì (X?ngq?y?)”.
But it should look like this:
“zì 3 yuè 22 rì (X?ngq?y?) q? zhì 5 yuè 31 rì (X?ngq?y?)”.

To fix this in your Pinyin text, run the following regular expression in EmEditor. Make sure “Match Case” is not checked.

\1 \2 \3

If you do this in Word, you’ll need to use the following instead in your wildcard search.

\1 \2 \3

The rest of cleanup work usually involves you simply reading through the text, looking for errors, perhaps while listening to the movie.

6. Create additional version without tone marks

If you have Key, this is very easy: Highlight the entire text, then
Format –> Strip Tone Marks.

And you’re done, though because Key keeps u-umlaut as such, if your television or other device doesn’t show the letter ü correctly you may wish to convert “ü” to “v”.

If you don’t have Key or access to another program that can do the same thing as easily, then use a combination of Wenlin (again, even the demo will do what you need) and a text editor. First, paste your Pinyin text into Wenlin. Then select all of the text and perform
Edit –> Make transformed copy… –> Replace tone marks with 1-4

Copy and paste the results into a new document in your text editor. Then run the following search-and-replace. Make certain the “Use Regular Expressions” or “Use Wildcards” box is checked.


replace with:
Then click “Replace All”.

What this looks like in EmEditor:
image showing the search-and-replace dialog box for the above

What this looks like in MS Word:

7. Share your work

It’s much better if people can concentrate on producing new material rather than having to redo things others have already taken care of. So if you make a good Hanyu Pinyin version of something, please let me know.

Pinyin subtitles for ‘The Story of Stuff’

screenshot from the video, showing Pinyin subtitles: Shìde, shìde, shìde: w?men quánd?u yào huísh?u, k?x? gu?ng kào huísh?u hái bùgòu.

The Story of Stuff is a 20-minute video on the costs and absurdities of having a culture wrapped up in unchecked consumerism. It gained especially wide attention after the New York Times published a front-page article about it. A related book was released earlier this month.

The entire video can be downloaded freely in high- and low-resolution versions. And now there’s a collection of subtitles of possible interest to many readers of Pinyin News.

The zip file contains seven subtitle files:

  • Hanyu Pinyin with tone marks
  • Hanyu Pinyin without tone marks
  • traditional Chinese characters (Unicode)
  • traditional Chinese characters (Big5)
  • simplified Chinese characters (Unicode)
  • simplified Chinese characters (GB)
  • English

The star of the video, author Annie Leonard, has a lot to get through in just 20 or so minutes, so many people may find it easier, at least at first, to read the Pinyin subtitles that do not include tone marks.

Google Maps switches to Hanyu Pinyin for Taiwan (sloppily)

Until very recently, Google Maps gave street names in Taiwan in Tongyong Pinyin — most of the time, at least. This was the case even for Taipei, which most definitely has long used Hanyu Pinyin, not Tongyong Pinyin. The romanization on Google Maps was really a hodgepodge in the maps of Taiwan. And it’s still kind of a mess; but now it’s at least more consistent — and more consistent in Hanyu Pinyin.

First the good. In Google Maps:

  • Hanyu Pinyin, not Tongyong Pinyin, is now used for street names throughout Taiwan
  • Tone marks are indicated. (Previous maps with Tongyong did not indicate tones.)

Now the bad, and unfortunately there’s a lot of it and it’s very bad indeed:

  • The Hanyu Pinyin is given as Bro Ken Syl La Bles. (Terrible! Also, this is a new style for Google Maps. Street names in Tongyong were styled properly: e.g., Minsheng, not Min Sheng.)
  • The names of MRT stations remain incorrectly presented. For example, what is referred to in all MRT stations and on all MRT maps as “NTU Hospital” is instead referred to in broken Pinyin as “Tái Dà Y? Yuàn” (in proper Pinyin this would be Tái-Dà Y?yuàn); and “Xindian City Hall” (or “Office” — bleah) is marked as X?n Diàn Shì G?ng Su? (in proper Pinyin: “X?ndiàn Shìg?ngsu?” or perhaps “X?ndiàn Shì G?ngsu?“). Most but not all MRT stations were already this incorrect way (in Hanyu Pinyin rather than Tongyong) in Google Maps.
  • Errors in romanization point to sloppy conversions. For example, an MRT station in Banqiao is labeled X?n Bù rather than as X?np?. (? is one of those many Chinese characters with multiple Mandarin pronunciations.)
  • Tongyong Pinyin is still used in the names of most cities and townships (e.g., Banciao, not Banqiao).

Screenshot from earlier this evening, showing that Tongyong Pinyin is still being used in Google Maps for some city and district names (e.g., Gueishan, Sinjhuang, Banciao, Jhonghe, Sindian, and Jhongjheng rather than Hanyu Pinyin’s Guishan, Xinzhuang, Banqiao, Zhonghe, Xindian, and Zhongzheng, respectively).
map of Taipei area, with names as shown above

I don’t have any old screenshots of my own available at the moment, so for now I’ll refer you to an image that Fili used in an old post of his. Compare that with this screenshot I took a few minutes ago from Google Maps of the same section of Tainan:

Note especially how the name of the junior high school is presented.

  • Previously “Jian Xing Junior High School”.
  • Now “Jiàn Xìng Jr High School”.

This is typical of how in old maps some things were labeled (poorly) in Hanyu Pinyin. (Words, not bro ken syl la bles, are the basis for Pinyin orthography. This is a big deal, not a minor error.) And now such places are still labeled poorly in Hanyu Pinyin, but with the addition of tone marks.

I’d like to return to the point earlier on sloppy conversions. Surprisingly, ??? is given as “Chéng Do? Road” rather than as “Chéngd? Road“.
screenshot from Google Maps of 'Cheng Dou [sic] Rd', near Taipei's Ximending
Although “Xinpu” might not be the sort of name to be contained in some romanization databases, there is nothing in the least obscure about Chengdu, the name of a city of some 11 million people. Google Translate certainly knows the right thing to do with ???:
screenshot from Google Translate, showing how Google will translate '???' as 'Chengdu Rd'

But Google Maps doesn’t get this simple point right, which likely points to outsourcing. Why would Google do this? And why wouldn’t it ensure that a better job was done? Because, really, so far the long-overdue conversion to Hanyu Pinyin in Google Maps for Taiwan is something of a botch.

Ba Jin in Pinyin, with audio

illustration of two young men under an umbrella -- from Ba Jin's 'Family'This bit of news is simply wonderful. As part of Sinolingua‘s Abridged Chinese Classic Series, all three volumes in B? J?n‘s “torrents” trilogy (J?liú s?nbùq? / ?????) are now available in abridged editions in word-parsed Hanyu Pinyin (with Chinese characters underneath), along with a few notes in English and mp3 files of the text being read aloud.

These books would make great material for those who are

  • studying Mandarin
  • trying to memorize Chinese characters
  • learning Hanyu Pinyin
  • wanting to read something in Mandarin that isn’t too damn hard but isn’t a children’s book either
  • looking for something to read in Mandarin that doesn’t require much or even any knowledge of Chinese characters (ABCs and other “overseas Chinese,” take note!)

Through the generousity of the publisher, Pinyin.info now offers sample chapters from each of these three classics of twentieth-century Chinese literature along with audio files of the text being read aloud.

I’m very pleased to offer samples from these books on this site and hope these editions will be enjoyed by many readers worldwide and become standard texts in many classrooms.

books bought in Beijing

cover of a book by Zhou YouguangI didn’t have any luck finding anything in Sin Wenz (L?d?nghuà X?n Wénzì / ??????), despite trips to several large used book stores. (Fortunately, the Internet is now providing some leads. Thanks, Brendan and Joel!) But I did find some other books to bring home.

I acquired lots of books by Zhou Youguang, not all of which focus primarily on linguistics:

Other than the Zhou Youguang books, here are my favorite finds of the trip, as they are for the most part in correctly word-parsed Hanyu Pinyin (with Hanzi underneath), along with a few notes in English:

I’ll soon be posting more about the above books with Pinyin, so watch this site for updates. Really, this is gonna be good.

Although this collection of Y.R. Chao says it’s volume 15, it’s actually two books:

  • Zhào Yuánrèn quánjí, dì 15 juàn (??????15?)

Some more titles:

  • Measured Words: The Development of Objective Language Testing, by Bernard Spolsky
  • P?t?nghuà shu?píng cèshì shísh? g?ngyào (???????????). Now with the great smell of beer! Sorry, Brendan, I owe you one — more than one, actually.

The following I bought because Yin Binyong, the scholar primarily responsible for Hanyu Pinyin’s orthography, is the author of these titles from Sinolingua’s series of Bóg?t?ngj?n xué Hàny? cóngsh? (“Gems of the Chinese Language through the Ages” (their translation)), all of which are in Mandarin (Hanzi) and English, with Pinyin only for the sayings being illustrated:

cover of 'Chinese-English Dictionary of Polyphonic Characters' (?????????)cover of 'Putonghua shuiping ceshi shishi gangyao' (???????????)cover of 'Xinhua pinxie cidian'


And finally:

Of course I already have that one — more than one copy, in fact. But it’s always good to have more than one spare when it comes to one of the two most important books on Pinyin orthography. I really need to follow up on my requests to use excerpts from this book, as it is the only major title missing from my list of romanization-related books (though it’s in Mandarin only).

sign in a Beijing bookstore reading 'Education Theury' [sic]