typing in Pinyin on a Windows 2000/XP system

Jason Frazier has used the free Microsoft keyboard layout creator to devise a keyboard method for entering Pinyin texts with tone marks. This will work on Windows 2000 and XP systems.

Basically, to type a vowel with a tone mark, first press the key corresponding to the tone you want and then the vowel (or “v” for ü). Many may find this method preferable to using an online tool that converts Pinyin tone numbers to tone marks (my own online converter being desperately in need of an update) or a separate program such as Wenlin (or its free but tremendously useful demo version).

To download and install this Pinyin-entry tool, follow the directions on Jason’s Web page. I’ve added a screenshot below to help clarify part of the installation process.

screenshot of method to add pinyin keyboard layout

kanji conversions cause crash

This is a weird one: Sharp Corp. has acknowledged that more than 10 million of its cell phones have a “software glitch that disables the handsets when certain hiragana phrases are converted into kanji when writing e-mail.”

The phrases known to freeze the phones are: “mirare makuccha,” which roughly means “people’s eyes were fixed on me,” and “kazega naori kaketa,” meaning, “I was recovering from a cold,” according to Nikkei Net, a Web-based business and information technology news site.

I’m not sure I could come up with a proper comment on this even if I didn’t have a bad case of jet lag.

sources:

OCR and Pinyin texts

[This entry is largely for my own reference. But feel free to read on, especially if you’re interested in OCR or if you somehow happen to have a lot of Pinyin texts lying around.]

What’s the best way to run optical character recognition (OCR) on texts written in Pinyin with tone marks? Adobe Acrobat 7.0 Standard, the most advanced such software I have on my computer, doesn’t have a “Pinyin” setting. I’d be surprised if any OCR software currently does.

Getting second tones, fourth tones, and umlauts to be read correctly shouldn’t be a big problem, given how the same marks are standard in the orthographies of many European languages. But first tones and third tones are a different matter. The best that can probably be hoped for at present is a more-or-less regular rendering of vowels with first- and third-tone marks as something else that can be fixed quickly through a search-and-replace procedure.

Here’s an image, slightly reduced, of what was being scanned:
scan of sentences in Pinyin

Here’s the text:
W? bù shì xuézh?, bù néng y?nj?ng jùdi?n. Dànshì w? y?u cóng zìj? sh?nghuó l? délái de w? ge zh?nshí lìzi, d?u bi?omíng Hànzì bìng bù tèbié bi?oyì.

Here are the results of OCR, with various language settings applied:

DUTCH
WÖ bu shi xuézhë, bù néng yinjing jùdiän. Danshi wö yöu cóng zip shënghuó
li délái de wü ge zhënshí Iizi, döu biäomíng Hanzì bing bu tebié biäoyì.

CATALAN
W6 bu shi xuezh8, bir neng yinjing jhdisn. Danshi w6 y5u cong ziji shenghuó
li delai de wü ge zhenshí lizi, d6u bibmíng Hanzi bing bu tebie bigoyi.

DANISH
W6 bu shi xuezhe, bU neng yinjing jhdian. Danshi w6 y5u cong ziji shGnghu6
li delai de wii ge zhenshi Iizi, dóu bigoming Hanzi bing bu tebie biaoyi.

FINNISH
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
Ii delai de wii ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.

FRENCH
W6 bù shi xuézhë, bù néng yinjing jùdian. Dànshi wO y5u cong ziji shënghu6
li délai de wü ge zhënshi Iizi, dou bigoming Hànzi bing bu tèbié biaoyi.

GERMAN
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
li delai de wü ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.

GERMAN (SWISS)
WÖ bu shi xuezhe, bU neng yinjing jiidiän. Danshi wö yöu cong ziji shGnghu6
li delai de wü ge zhenshi Iizi, döu biäoming Hanzi bing bu tebie biäoyi.

ITALIAN
W6 bu shì xuézhe, bù néng yinjing jùdian. Dànshì w6 y5u cong ziji shènghu6
li délai de wii ge zhenshi Iizi, dou bigoming Hànzì bing bu tèbié biaoyì.

NYNORSK
W6 bu shi xuezhe, bU neng yinjing jhdian. Danshi wO y5u cong ziji shGnghu6
li delai de wii ge zhenshi Iizi, dou biaoming Hanzi bing bu tebie biaoyi.

PORTUGUESE (BRAZILIAN)
WÕ bu shi xuézhe, bU néng yinjing jùdiãn. Danshi wõ yõu cóng ziji shènghuó
li délái de wü ge zhenshí Iizi, dõu biãomíng Hanzi bing bu tèbié biãoyi.

PORTUGUESE
WÕ bu shi xuézhe, bU néng yinjing jùdiãn. Danshi wõ yõu cóng ziji shènghuó
li délái de wü ge zhenshí Iizi, dõu biãomíng Hanzi bing bu tèbié biãoyi.

SPANISH
W6 bu shi xuézhe, bU néng yinjing jhdian. Danshi wO y5u cóng ziji shenghuó
li délái de wü ge zhenshí Iizi, dóu bigomíng Hanzi bing bu tebié biaoyi.

There’s no clear winner. The best results, such as they are, appear to be using Dutch and Portuguese (Brazilian or standard).

Cantonese input method for Chinese characters

There’s a new Unicode-based phonetic input method for inputting Chinese characters … using Cantonese: Canto Input.

Here’s the author’s description:

What is it?
CantoInput is a freely available, Unicode-based Chinese input method (IME) which allows you to type both traditional and simplified characters using Cantonese romanization. Both the Yale and Jyutping methods are supported. A Mandarin Pinyin mode is also available.

Why does the world need another Chinese input method?
While there already exist excellent phonetic input methods based on Mandarin Pinyin pronunciation, there is a general lack of support for Cantonese. As a Cantonese learner, I was frustrated by the difficulty of typing Chinese, especially Cantonese-specific colloquial characters. Most existing Cantonese input methods require a Chinese version of Windows and operate using non-Unicode encodings such as BIG5 or GB, while non-phonetic methods such as Cangjie have a very steep learning curve. I originally wrote this program for my own personal use but decided to make it freely available since I felt that other Cantonese speakers and learners might also find it useful. It’s still really basic at this time, but hopefully I’ll have time to impove the interface and add more features in the future.

Those interested in trying this out might find the comments on Chinese Forums useful.

Wenlin: ‘software for learning Chinese’

I get a lot of questions about how to do some sort of conversion involving Chinese characters. Most of the time, my answer is something like, “Get Wenlin. Even the free, non-expiring demo version (4 MB) will do what you need — and a lot more.”

For those of you who aren’t familiar with Wenlin, Random Stuff That Matters has posted a five-minute movie (with sound) of Wenlin in action (14.5 MB).

The range of what Wenlin can do extends far beyond what the movie shows. A lot of people might not notice that even in the demo a wide range of options are available under

  • EditMake Transformed Copy

My favorite, which is available only with the full version, is

  • EditMake Transformed CopyPinyin Transcription

Oh, it is a thing of beauty. (That function, though, works only in the full version, not the demo.)

For those of you who have the full version, I thought I’d share a little-known feature of Wenlin: its ability to search for regular expressions.

Let’s say you are trying to remember a chengyu (set phrase) about studying, but all you can recall is that it contains the sound “rubu.” You’re not sure of the characters. You’re not even sure of the tones. First you look up entries beginning with “rubu” in Wenlin’s electronic edition of the ABC Chinese-English Comprehensive Dictionary:

  • ListWords by Pinyin
  • Then enter rubu and hit OK.

This will take you to rùbùfūchū and rúbùshèngyī. But neither of those is what you’re looking for. Now what? Here’s where regular expressions come in handy.

Hit Ctrl+F to search for something within the current page.

In the Find box, enter

  • re=r(u|ū|ú|ǔ|ù)b(u|ū|ú|ǔ|ù)

This will yield:

  • chǒngrǔbùjīng 寵辱不驚[宠–惊] f.e. unmoved by honors/disgrace
  • lèirúbùgān 淚濡不乾[泪–干] f.e. be drowned in tears
  • nièrúbùyán 囁嚅不言[嗫—] f.e. 〈wr.〉 move the mouth without speaking
  • xuérúbùjí 學如不及[学—] f.e. study as if one could never learn enough

Bingo!

The reason for using OR pipes to separate the possibilities instead of putting them together — i.e., the reason for writing (u|ū|ú|ǔ|ù) instead of [uūúǔù] — is that the regex library sees non-ASCII characters as strings of bytes (UTF-8); thus, without the pipes you could end up with extra garbage or not find what you intend to at all. This might be fixed in the next version.

Windows computer systems and Pinyin input of Chinese characters

I often get messages from people asking how to use Hanyu Pinyin to input Chinese characters on their English-language Windows systems. But the most I’ve ever added to my site on this topic is a brief page on using Pinyin to type Chinese characters on a U.S. English Windows 2000 system. Fortunately for everyone, now there’s Pinyin Joe’s Chinese computing resources, which explains in user-friendly detail how to set up Western-language Windows XP computers to input Chinese characters using Pinyin and even zhuyin fuhao. I certainly don’t recommend using zhuyin; but it’s nice to know the information on how to type it (both by itself and for character input) is available and put forward so clearly.

The site covers a few other areas as well. Check it out. Pinyin Joe’s also promises to cover Vista once Microsoft finally releases it.

Another good place to ask related questions is Forumosa‘s technology forum, especially within the thread on Hanyu Pinyin input for XP.

if people keep using Pinyin input, China will die, says Wubi-input inventor

Wang Yongmin (Wáng Yǒngmín, 王永民), the developer of the much hyped “Wubi” input method for Chinese characters, seems to get a bit more shrill each time he has a chance to make it into the papers. The Wubi Chinese character input method works by assembling characters based on the shapes of elements within characters.

Here’s something from a recent rant:

近日,五笔字型的发明者——王永民教授在中国科学院研究生院演讲时发表了这样的观点。

王永民认为,汉字的形是“身”,汉字的音是“衣”;“弃形留音”等于“舍身取衣”。拼音输入离开了对汉字造字元素的直接思考和运用,汉字必然将因此而形神俱灭,汉字本身所固有的文化遗传基因,将因此而丧失殆尽。

王永民认为,从文化意义上说,中华民族的伟大复兴也是汉字文化的伟大复兴,没有汉字,就没有中华民族。他指出,汉字和汉语拼音的主辅关系是早有定论的。

source: Wáng Yǒngmín: Pīnyīn shūrù shì Hànzì wénhuà de jué [fen]mù jī[qi] (王永民:拼音输入是汉字文化的掘墓机), Science Web, March 17, 2006

Unicode in Japan

No-sword links to an interesting page titled Unicode in Japan: Guide to a technical and psychological struggle. There’s a lot of useful information in this.

The Web page also touches some on script reform in postwar Japan; for the full story, see Literacy and Script Reform in Occupation Japan: Reading Between the Lines, by J. Marshall Unger. Pinyin Info offers a chapter-long selection from this book.

Other pages on that site include a Unicode tutorial, which is billed as “a page of Unicode terms, FAQs, and mistakes.”

My pet peeve about Unicode is its continuing, incorrect reference to Chinese characters as “ideographs.”