China’s Cultural Revolution, Pinyin, and other romanizations

Some people have the idea that because during the Cultural Revolution the Red Guards went about destroying much of China’s cultural heritage, they must have attacked Chinese characters and supported Pinyin. This idea is wrong. During that terrible time Pinyin was attacked, like so much else that was good in China.

With the fortieth anniversary of the beginning of the Cultural Revolution upon us, this might be a good time to bring out this selection from The Chinese Language: Fact and Fantasy, by John DeFrancis:

In view of the fact that separate alphabetic treatment for the regionalects has been a virtually tabooed subject since 1949, it comes as a surprise that among the revelations following the downfall of the Gang of Four is an account by Prof. Huang Diancheng of Amoy University of the adaptation of Pinyin to the Southern Min speech of Amoy and its use in the production of anti-illiteracy textbooks and other activities. Huang reports that during the Cultural Revolution people possessing materials in Min alphabetic writing were denounced as “foreign lackeys” and were forced to take the material out to the street, kneel down alongside them, set them afire, and reduce them to ashes. Elsewhere repression of Pinyin in any form was undertaken by xenophobic Red Guards, themselves staunch supporters of character simplification, who tore down street signs written in Pinyin as evidence of subservience to foreigners.

The Nazi-like book-burning episode and other acts against the use of Pinyin are fitting testimony of the repression exercised against activities concerned with fundamental issues in Chinese writing reform. In these actions the positive idea that China should stand on its own feet without demeaning reliance on foreign aid was expressed in its most xenophobic form as a sort of anti-intellectual blood-and-soil nativism that constitutes a danger, still present, of a Chinese-style fascism. The young student storm troopers who sought to humble the old-time intellectuals, far from following Lu Xun in embracing the one system of writing that would have done the most to equalize things between illiterates and all those who had received an education, supported instead the lesser reform of character simplification that might enhance their own position relative to the older generation.

evolution of simplified Chinese characters: dissertation

Stockholm University’s Department of Oriental Languages has just released Long Story of Short Forms: The Evolution of Simplified Chinese Characters (10.4 MB PDF), a Ph.D. dissertation by Roar Bökset.

Here is the abstract:

A script reform was carried out in China between 1955 and 1964 by simplifying the shape of a number of characters. Most of the simplified forms adopted had already been in popular use for a long time before this reform, while a few were invented for the occasion.

One objective of this dissertation is to estimate the proportion of invented forms. To this end, use of simplified variants before 1955 was surveyed. Pre-reform writing turned out to be more heterogeneous than expected. In fact, already Han dynasty (206 BC-AD 220) handwriting differed considerably from the norms set up by contemporary dictionaries and model texts.

One aim of the script reform was to unify writing habits and make them conform better with established norms. To evaluate the Script Reform Committee’s success in this field, this dissertation surveys the use of different unofficial short forms even after the reform. Success turned out to be moderate. Many pre-1955 short variants survived, and, what was worse, new ones emerged after the reform. Particularly confusing was the use of different unofficial short forms in different parts of China. The existence of such local variants was confirmed by extensive reading of signs, advertisements, price tags and wall newspapers in twenty-one provinces, and by interviews with informants at four hundred localities. Results of that survey are displayed on twenty-four maps.

A few years earlier, even Japanese characters had gone through a reform which made many simplified forms official. Some of the new official Japanese forms differed from those which came to be official in China, creating a discrepancy which has at times been lamented. However, this dissertation compares the short forms used in pre-reform Japan with those of pre-reform China, and shows that most of the present discrepancies have roots in differences in Chinese and Japanese writing traditions, which bound the hands of reformers in both countries and enforced the decisions which were eventually made.

OCR and Pinyin texts

[This entry is largely for my own reference. But feel free to read on, especially if you’re interested in OCR or if you somehow happen to have a lot of Pinyin texts lying around.]

What’s the best way to run optical character recognition (OCR) on texts written in Pinyin with tone marks? Adobe Acrobat 7.0 Standard, the most advanced such software I have on my computer, doesn’t have a “Pinyin” setting. I’d be surprised if any OCR software currently does.

Getting second tones, fourth tones, and umlauts to be read correctly shouldn’t be a big problem, given how the same marks are standard in the orthographies of many European languages. But first tones and third tones are a different matter. The best that can probably be hoped for at present is a more-or-less regular rendering of vowels with first- and third-tone marks as something else that can be fixed quickly through a search-and-replace procedure.

Here’s an image, slightly reduced, of what was being scanned:
scan of sentences in Pinyin

Here’s the text:
W? bù shì xuézh?, bù néng y?nj?ng jùdi?n. Dànshì w? y?u cóng zìj? sh?nghuó l? délái de w? ge zh?nshí lìzi, d?u bi?omíng Hànzì bìng bù tèbié bi?oyì.

Here are the results of OCR, with various language settings applied:

DUTCH
WÖ bu shi xuézhë, bù néng yinjing jùdiän. Danshi wö yöu cóng zip shënghuó
li délái de wü ge zhënshí Iizi, döu biäomíng Hanzì bing bu tebié biäoyì.

CATALAN
W6 bu shi xuezh8, bir neng yinjing jhdisn. Danshi w6 y5u cong ziji shenghuó
li delai de wü ge zhenshí lizi, d6u bibmíng Hanzi bing bu tebie bigoyi.

DANISH
W6 bu shi xuezhe, bU neng yinjing jhdian. Danshi w6 y5u cong ziji shGnghu6
li delai de wii ge zhenshi Iizi, dóu bigoming Hanzi bing bu tebie biaoyi.

FINNISH
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
Ii delai de wii ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.

FRENCH
W6 bù shi xuézhë, bù néng yinjing jùdian. Dànshi wO y5u cong ziji shënghu6
li délai de wü ge zhënshi Iizi, dou bigoming Hànzi bing bu tèbié biaoyi.

GERMAN
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
li delai de wü ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.

GERMAN (SWISS)
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
li delai de wü ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.

ITALIAN
W6 bu shì xuézhe, bù néng yinjing jùdian. Dànshì w6 y5u cong ziji shènghu6
li délai de wii ge zhenshi Iizi, dou bigoming Hànzì bing bu tèbié biaoyì.

NYNORSK
W6 bu shi xuezhe, bU neng yinjing jhdian. Danshi wO y5u cong ziji shGnghu6
li delai de wii ge zhenshi Iizi, dou biaoming Hanzi bing bu tebie biaoyi.

PORTUGUESE (BRAZILIAN)
WÕ bu shi xuézhe, bU néng yinjing jùdiãn. Danshi wõ yõu cóng ziji shènghuó
li délái de wü ge zhenshí Iizi, dõu biãomíng Hanzi bing bu tèbié biãoyi.

PORTUGUESE
WÕ bu shi xuézhe, bU néng yinjing jùdiãn. Danshi wõ yõu cóng ziji shènghuó
li délái de wü ge zhenshí Iizi, dõu biãomíng Hanzi bing bu tèbié biãoyi.

SPANISH
W6 bu shi xuézhe, bU néng yinjing jhdian. Danshi wO y5u cóng ziji shenghuó
li délái de wü ge zhenshí Iizi, dóu bigomíng Hanzi bing bu tebié biaoyi.

There’s no clear winner. The best results, such as they are, appear to be using Dutch and Portuguese (Brazilian or standard).

Aborigine legislators should use original names: activist

Aborigine politicians should use their original names, not Han Chinese names, or explain to their constituents why they don’t, the head of an aboriginal group called the Vine Cultural Association stated on Tuesday.

All eight of Taiwan’s legislators holding the seats reserved for Aborigines — Chen Ying, Liao Kuo-tung, Lin Cheng-er, Yang Jen-fu, Kao Chin Su-mei, Kung Wen-chi, Lin Chung-te, Tseng Hua-te — currently officially use “Chinese” names rather than Aborigine ones.

The head of Taiwan’s Council of Indigenous Peoples, however, does use his original name: Walis Pelin.

I’m waiting for someone to get on TV and talk about how few legislators who are Hoklo use Taiwanese rather than Mandarin forms for the romanizations of their names. (I could probably count them all on one hand, even though Taiwan has some 225 legislators.) Same thing for legislators who are Hakka but who don’t use the Hakka forms of their names in romanization.

sources:

Cantonese input method for Chinese characters

There’s a new Unicode-based phonetic input method for inputting Chinese characters … using Cantonese: Canto Input.

Here’s the author’s description:

What is it?
CantoInput is a freely available, Unicode-based Chinese input method (IME) which allows you to type both traditional and simplified characters using Cantonese romanization. Both the Yale and Jyutping methods are supported. A Mandarin Pinyin mode is also available.

Why does the world need another Chinese input method?
While there already exist excellent phonetic input methods based on Mandarin Pinyin pronunciation, there is a general lack of support for Cantonese. As a Cantonese learner, I was frustrated by the difficulty of typing Chinese, especially Cantonese-specific colloquial characters. Most existing Cantonese input methods require a Chinese version of Windows and operate using non-Unicode encodings such as BIG5 or GB, while non-phonetic methods such as Cangjie have a very steep learning curve. I originally wrote this program for my own personal use but decided to make it freely available since I felt that other Cantonese speakers and learners might also find it useful. It’s still really basic at this time, but hopefully I’ll have time to impove the interface and add more features in the future.

Those interested in trying this out might find the comments on Chinese Forums useful.

May Fourth remembered

Today is the 87th anniversary of the demonstrations in Beijing that marked the beginning of what is now called the May Fourth Movement. What concerns me here is not the surge in Chinese nationalism (something the present-day PRC — and some would say Taiwan, too — could use rather less of) but the literary revolution that largely overthrew the use of Literary Sinitic (Classical Chinese).

This revolution, though, swift and remarkable as it was, unfortunately remains incomplete today. As Yin Binyong put it:

Ever since the beginnings of the May Fourth movement, many scholars — especially those who support the use of alphabetized writing for Chinese — have all advocated as the main goal of the modern Chinese language standardization movement that spoken and written Chinese should be the same. Unfortunately, this goal has remained primarily a subjective aspiration; as long as Chinese characters continue to be the sole writing system in China, this goal can never be realized. Despite the fact that literary Chinese is no longer used, nevertheless it has been replaced by a half-literary, half-vernacular style of writing, rather than a style based solely on the spoken language.

Even so, the literary movement should not be underestimated. The changes brought — for well or ill — by the introduction several decades later of “simplified” Chinese characters are practically nothing compared with the impact of the overall change from Literary Sinitic to vernacular Mandarin.

A good source of information on the literary aspect of the May Fourth Movement is The Chinese Renaissance, by Hu Shih (Hú Shì, 胡適), one of the main figures in this movement.

Finally, I’d like to direct people to Languagehat’s post yesterday on the somewhat analagous situation with classical Arabic and Arabic vernaculars, a subject I’d love to learn more about.

new MRT signage

David has posted on the inconsistent use of Tongyong Pinyin in the Taipei-area MRT system. I’ve already put a comment there, so I’ll not duplicate everything here.

I spend a lot of time complaining about signage, and my experiences in trying to get some errors in the MRT system corrected have, predictably, been frustrasting. But there is something I do really like: the font for the MRT signage. (See the photos with David’s post.) Does anyone recognize it?

For those of you not in Taiwan, the MRT is the Metropolitan Rapid Transit system for the Taipei area. Most of the system takes the form of a subway. One line, however, is elevated, as is a section of a different line (which also runs on ground level for several miles).

Wenlin: ‘software for learning Chinese’

I get a lot of questions about how to do some sort of conversion involving Chinese characters. Most of the time, my answer is something like, “Get Wenlin. Even the free, non-expiring demo version (4 MB) will do what you need — and a lot more.”

For those of you who aren’t familiar with Wenlin, Random Stuff That Matters has posted a five-minute movie (with sound) of Wenlin in action (14.5 MB).

The range of what Wenlin can do extends far beyond what the movie shows. A lot of people might not notice that even in the demo a wide range of options are available under

  • EditMake Transformed Copy

My favorite, which is available only with the full version, is

  • EditMake Transformed CopyPinyin Transcription

Oh, it is a thing of beauty. (That function, though, works only in the full version, not the demo.)

For those of you who have the full version, I thought I’d share a little-known feature of Wenlin: its ability to search for regular expressions.

Let’s say you are trying to remember a chengyu (set phrase) about studying, but all you can recall is that it contains the sound “rubu.” You’re not sure of the characters. You’re not even sure of the tones. First you look up entries beginning with “rubu” in Wenlin’s electronic edition of the ABC Chinese-English Comprehensive Dictionary:

  • ListWords by Pinyin
  • Then enter rubu and hit OK.

This will take you to rùbùfūchū and rúbùshèngyī. But neither of those is what you’re looking for. Now what? Here’s where regular expressions come in handy.

Hit Ctrl+F to search for something within the current page.

In the Find box, enter

  • re=r(u|ū|ú|ǔ|ù)b(u|ū|ú|ǔ|ù)

This will yield:

  • chǒngrǔbùjīng 寵辱不驚[宠–惊] f.e. unmoved by honors/disgrace
  • lèirúbùgān 淚濡不乾[泪–干] f.e. be drowned in tears
  • nièrúbùyán 囁嚅不言[嗫—] f.e. 〈wr.〉 move the mouth without speaking
  • xuérúbùjí 學如不及[学—] f.e. study as if one could never learn enough

Bingo!

The reason for using OR pipes to separate the possibilities instead of putting them together — i.e., the reason for writing (u|ū|ú|ǔ|ù) instead of [uūúǔù] — is that the regex library sees non-ASCII characters as strings of bytes (UTF-8); thus, without the pipes you could end up with extra garbage or not find what you intend to at all. This might be fixed in the next version.