Unicode in Japan

Posted on Friday, March 10, 2006 by Pinyin Info

No-sword links to an interesting page titled Unicode in Japan: Guide to a technical and psychological struggle. There’s a lot of useful information in this.

The Web page also touches some on script reform in postwar Japan; for the full story, see Literacy and Script Reform in Occupation Japan: Reading Between the Lines, by J. Marshall Unger. Pinyin Info offers a chapter-long selection from this book.

Other pages on that site include a Unicode tutorial, which is billed as “a page of Unicode terms, FAQs, and mistakes.”

My pet peeve about Unicode is its continuing, incorrect reference to Chinese characters as “ideographs.”

Taiwan premier backs adoption of common years

Posted on Saturday, February 25, 2006 by Pinyin Info

The ROC system for dating years, the source of Taiwan’s approaching Y1C problem, is in the news.

Premier Su Tseng-chang on Friday told lawmakers that he supported Taiwan fully adopting the almost universal practice of dating years. Currently, the year 2006 is referred to in Taiwan as the year 95.

Wang Xuan, innovator in Chinese-character technology, dies at 69

Posted on Tuesday, February 14, 2006 by Pinyin Info

Wáng Xuǎn (王选), an important figure in technology related to the printing of Chinese characters with computers, has died at the age of 69.

In 2001 he was awarded the Supreme Scientific and Technological Award, China’s highest award for achievement in the field of science.

sources:

From Founts to Fonts, People’s Daily, October 18, 1996 (interesting comments on fonts, but Wang unfortunately repeats a common misunderstanding about Pinyin)
Wang Xuan, top IT expert dies at 69, Shanghai Daily, February 13, 2006
Newsmaker: Wang Xuan, Winner of Top Sci-Tech Award, People’s Daily, February 2, 2002

URLs, Chinese characters, and the Roman alphabet

Posted on Sunday, February 5, 2006 by Pinyin Info

In Will China Build a Separate Internet? John Yunker, citing Naseem Javed’s When Will The Internet Be Divided Among Nations?, states, “Naseem does raise a very important point — for Chinese speakers, the Internet is far from user-friendly. The major obstacle is the URL, which is still limited to ASCII (Latin) characters.”

I don’t see where Naseem Javed made that particular point — but no matter. I just want to note that URLs in ASCII do not present an obstacle to Internet users in China. After all, the Roman alphabet (specifically, Pinyin) is what most people use to enter Chinese characters on computers in the first place. And even those in China who don’t use Pinyin to input Chinese characters are perfectly capable of using their, yes, QWERTY keyboards to type the ASCII in URLs, the Roman alphabet having been taught for decades to every schoolchild in China (at least to those now literate enough to use the Internet in the first place).

On the other hand, having to enter Chinese-character URLs would be an obstacle to most of the world’s population.

Those looking to argue that ASCII URLs could be an obstacle would do better to look to Russia, Greece, or Saudi Arabia.

The folks at ICANN and IETF are working to upgrade the DNS to Unicode, but this will take time. There is a workaround in use that allows Web users to input Chinese characters as a URL which is then transformed into ASCII characters behind the scenes (known as “Punycode”) but I’m not sure how widely used this system currently is.

IE7 is supposed to have good support for Punycode. Now if only IE would finally get CSS right….

Here’s an example of Punycode: 拼音 is xn--muuy29i, according to an open-source Punycode converter. Thus, http://拼音.pinyin.info and http://xn--muuy29i.pinyin.info should both lead to the same page. And I would hope that the address bar in the browser would read http://拼音.pinyin.info instead of the xn--muuy29i ASCII version.

If you add a comment on how well the Punycode tests work for you, please mention your computer’s operating system and browser. (I’m using Win2K and Opera 8.51, and both http://拼音.pinyin.info and http://xn--muuy29i.pinyin.info work fine.)

Firefox extensions for Mandarin Chinese texts

Posted on Tuesday, January 10, 2006 by Pinyin Info

Although my favorite Web browser remains Opera (which is now free), I recognize that Firefox (which has always been free) has some nice things going for it, especially its wide range of extensions.

At least two of these extensions might be of special interest to readers of this site: Translate, which will translate a Web page from Mandarin Chinese (as well as lots of other languages) into English (more or less), and the Adso GreaseMonkey Script, which provides Pinyin and English annotation for Chinese characters.

First, Translate, which is the cat’s pajamas. I don’t know how I survived without it.

Using Firefox, Install Translate. (If that link has expired, find the installation through the home page of Gravelog.)
- Firefox will likely block your installation at first, which is a good thing. (Safety first.)
- Look for this message in a bar near the top of your browser window: “To protect your computer, Firefox prevented this site (ctomer.com) from installing software on your computer.”
- Click on the “Edit Options” button in the same bar (near the top right of your screen).
- A pop-up box will appear. Click on “Allow” and then “Close”.

Restart Firefox.

Try out the extension by going to a Web page with text in Chinese characters.

From the Firefox menu, choose Tools --> Translate --> Translate from Chinese-simp[lified] (or Tools --> Translate --> Translate from Chinese-trad[itional], as appropriate). The translated Web page will appear in a few moments.

If you want to translate just a portion of the text on a Web page, or if Babel Fish chokes on the text of the entire Web page and you need an alternate approach, simply use your mouse to select the text you’re interested in. Next, right click and select Translate --> From Chinese-simp (or Translate --> From Chinese-trad , as appropriate). Note: The translation will appear in a new tab, so don’t sit around waiting for the translation to appear in the same tab you’ve been working in.

Translate also handles Japanese, Korean, French, German, Spanish, Italian, Dutch, Portuguese, Greek, and Russian.

A related but less effective extension is gtranslate, which handles limited amounts of text in simplified but not traditional characters.

Now let’s examine the Adso GreaseMonkey Script.

Install Firefox or upgrade to version 1.5.
Using Firefox, install Greasemonkey (If that link has expired, find the installation through the main Greasemonkey page.)
- Firefox will likely block your installation at first, which is a good thing. (Safety first.)
- Look for this message in a bar near the top of your browser window: “To protect your computer, Firefox prevented this site (greasemonkey.mozdev.org) from installing software on your computer.”
- Click on the “Edit Options” button in the same bar (near the top right of your screen).
- A pop-up box will appear. Click on “Allow” and then “Close”.
Restart Firefox.
Install the Adso GreaseMonkey Script.
- Look for this message in a bar near the top of your browser window: “This is a Greasemonkey user script. Click Install to start using it.”
- Click the “Install” button in the same bar (near the top right of your screen).

Try it out by going to a Web page with text in Chinese characters.

To activate the script, press “a”.
Click on or highlight the script you’re interested in seeing the Pinyin for.
Move your mouse over the Chinese characters in the pop-up box; the Pinyin will appear.
screenshot of how this popup looks

To deactivate the script, press any other key.

For more information, see the Firefox Plugin: Chinese text annotation thread on Chinese-forums.com.

Of related interest is the Rikai Web page converter.

Chinese characters, Pinyin, and computers

Posted on Sunday, January 8, 2006 by Pinyin Info

Recently added to my list of recommended readings: Characters and Computers, edited by Victor H. Mair and Yongquan Liu. Although this collection was published in 1991 and thus no longer represents the state of the art, the issues raised here remain relevant.

Of particular interest, at least where Pinyin is concerned, is the important essay Pinyin-to-Chinese Character Computer Conversion Systems and the Realization of Digraphia in China, by Yin Binyong, who has also written the books on Pinyin orthography: Chinese Romanization: Pronunciation and Orthography and the Xinhua Pinxie Cidian. The complete text of this substantial essay (nearly 6,000 words) is available here on Pinyin Info. I strongly encourage everyone to read this.

Here are the subject headings:

The Three Stages in the Development of Pinyin-to-Chinese Character Computer Conversion Systems
The Theoretical Contribution of the Pinyin-to-Chinese Character Conversion System to the Realization of Digraphia in China
Practical Contributions of Pinyin-to-Chinese Character Conversion Systems to Digraphia in China
1. Can alphabetized Chinese take the road of “pinyin pictophonetic characters”?
2. What is an appropriate way to handle the representation of tones in a Pinyin-based writing system?
3. How to solve the problem of homonyms in alphabetized (Pinyin) Chinese writing?
Directions for the Future

Taiwan’s Y1C problem

Posted on Monday, January 2, 2006 by Pinyin Info

So, how did you ring in the year 95?

Yes, 95. Taiwan continues to make official use of a calendar tied to the founding of the Republic of China on January 1, 1912. That day began year 1.

For anyone doing a double take, that’s the Republic of China, better known these days as “Taiwan,” though Taiwan wasn’t a part of China in 1912. (And plenty of people would argue it’s not part of China now.) The People’s Republic of China was founded on October 1, 1949. National day in Taiwan, however, is marked not on January 1 but October 10, to commemorate the 1911 revolution that overthrew the Qing dynasty.

This everything-begins-again-with-me dating system, which reflects the habits of the imperial dynasties the ROC was supposed to have eliminated, isn’t just a quaint local custom. Its continued use is heading Taiwan toward its very own type of Y2K problem. In just a few years, when the ROC reaches the age of 100 and has to jump to three-digit years, Taiwan will likely experience what I like to call the Y1C problem. (Yes, I know: I’m mixing systems in that C represents hundred in a system that uses M, not K, for “thousand.” But that’s the best I could come up with. I’m open to suggestions for catchy but correct names.)

As far as I know, nothing is being done yet to address this. Slow are the wheels of Taiwan’s bureaucracy. To give an example of this, the Y2K problem certainly did not lack publicity, outrageous hype even; yet in 2005 the high-profile English Web site of the Office of the President gave the year as being “105.” About six weeks ago, when I gave a presentation to officials in charge of various government agencies’ Internet departments, listing some of the things wrong with the Taiwan government’s English-language Web sites, I specifically brought up the example of the presidential office’s howler.

I took it as a good sign that today, when I checked that site again, I saw the year given as 2006. But then I glanced at the Mandarin version of the same site. The year there: 106.

Before the year 100 comes in 2011, somebody remind me to find a bank outside Taiwan for what little money I have.

table-free CSS method for interlinear texts on Web pages

Posted on Thursday, December 29, 2005 by Pinyin Info

The interlinear version of the Scriptures is the prototype or ideal of all translation.
— Walter Benjamin

Hebrew-English interlinear text of part of Genesis 11 (Tower of Babel)

Interlinear texts are probably familiar to most who have studied a foreign language. Interlinear texts on the Web, however, tend to be in the form of tables. And, like most other fans of CSS, I tend to cringe at the word “table.” Moreover, text within tables doesn’t wrap to different window sizes.

I am generally opposed to the practice of displaying texts in both pinyin and Chinese characters interlinearly as opposed to en face. Pinyin was not designed to be an annotation system for Chinese characters but to be a full writing system (orthography) for modern Mandarin. Many if not most people, however, are misinformed about this basic point. Consequently, I try to avoid presenting pinyin in a way that could reinforce the mistaken notion that it is a supplement to characters rather than an independent system. Nevertheless, I recognize that interlinear texts can be useful in some circumstances. Moreover, perhaps others can make less problematic use of an interlinear technique for displaying other languages and scripts.

About six months ago I started to work out a standards-compliant, table-free method for displaying Chinese characters and pinyin interlinearly on Web pages. As is so often the case, once I figured out the basics I became distracted by something else and never finished. A recent request for a way to display ruby text with pinyin, however, has prompted me to present some of my ideas on this in case others might find them useful and produce something with them. And, at any rate, CSS3’s ruby text feature isn’t likely to be implemented by the major browsers anytime soon.

The fundamental approach of the method I recommend is to put individual words/phrases and their pinyin/character equivalent in floated div tags and use CSS to make everything look right. Unfortunately, the method isn’t semantically correct because it uses div and p tags for individual words rather than true blocks of text; but I don’t see that as a big enough problem to resort to the trouble of putting all this into xml. YMMV.

This is adapted from a thumbnail-captioning method detailed on A List Apart.

Floated elements, of course, need to have declared widths. But this gets tricky because words are of various widths. It’s not enough, either, to set widths based on the number of letters or Chinese characters within a block, because the question of width is complicated.

The five-letter syllable “chong,” for example, is wider than the five-letter “liang” because the letters l and i are thinner than any of the letters in chong — at least in most fonts. And the widths of pinyin elements do not correspond to the widths of Chinese characters.

With Chinese characters the situation is for the most part different. Note that 哩哩啦啦 and 爽爽快快 take the same amount of horizontal space to write:

哩哩啦啦
爽爽快快

The same, however, is not true of their Pinyin equivalents:

līlīlālā
shuǎngshuǎngkuàikuài

One way to deal with this is “headline counting,” which is an old method copy editors use to help make headlines fit within alloted spaces. Under this system, letters, numbers, and punctuation marks are given different values, based on their approximate width. Here are the values under one headline-counting method:

count value applicable letters, numbers, punctuation marks

0.5 flitj.,:;!

1.0 abcdeghknopqrsuvxyz[space]I1-[vowels, including i, with tone marks]

1.5 mwABCDEFGHJKLNOPQRSTUVXYZ234567890$?

2.0 MW[em dash]

count value	applicable letters, numbers, punctuation marks
0.5	flitj.,:;!
1.0	abcdeghknopqrsuvxyz[space]I1-[vowels, including i, with tone marks]
1.5	mwABCDEFGHJKLNOPQRSTUVXYZ234567890$?
2.0	MW[em dash]

Thus, “pinyin” would have a count of 5, but “Pinyin” would have a count of 5.5. And “Hanyu Pinyin” would have a count of 12.

To have the text spaced as attractively as possible, counts would also need to be performed for the Chinese characters and then checked against the count for the romanized text to make sure the larger value is used. This is because counts for Pinyin words could result in widths being set smaller than required, such as in the case of lí’è, which is thinner than 罹厄 unless the characters are made to be unusually small relative to the romanization. Deriving a count for the width of Chinese characters, however, is easy, because in most cases they can safely be treated as if they all took the same amount of horizontal space. The value assigned for the counting of Chinese characters would depend on how large you want to make them in relation to the pinyin.

Next, assign a CSS class to the relevant div. I’ve named the classes according to the counts (multiplied by 10). The base text goes inside a paragraph tag. Thus, to put “wèishénme” over “為什麼” would require the following code:

<div class="count95"> wèishénme
<p>為什麼</p> </div>

The main thing requiring attention is coming up with the correct width for each class. In the CSS for this example, I’ve rounded up counts so that two different classes can have the same width. In a finished version, perhaps they should be given separate widths or the pairs of classes should be combined to make for simpler code.

Here’s the CSS:

   .interlinear div     {
        margin-right: 0.2em;    /* FOR THE SPACES BETWEEN WORDS */
        height: 4.0em;          /* TO KEEP LINES FROM OVERLAPPING */
        }
   .count20, .count25      {
        width: 1.5em;
        }
   .count30, .count35      {
        width: 2.0em;
        }
   .count40, .count45      {
        width: 2.5em;
        }
   .count50, .count55      {
        width: 3.0em;
        }
   .count60, .count65      {
        width: 3.2em;
        }
   .count70, .count75      {
        width: 3.5em;
        }
   .count80, .count85      {
        width: 4.0em;
        }
   .count90, .count95      {
        width: 4.5em;
        }


   .interlinear p   {
        font-size: 100%;
        margin-top: 0.3em;
        line-height: 1em;
        }


  /* ++++++++++++++++++++++ */
  /* the CSS below this point probably does not need to be adjusted */
  /* except to add more 'countXX' classes for longer words  */
  /* ++++++++++++++++++++++ */

   .interlinear div.spacer {
        clear: both;
        height: 0;
        }
  .count20, .count25, .count30, .count35, .count40, .count45, 
  .count50, .count55, .count60, .count65, .count70, .count75, 
  .count80, .count85, .count90, .count95 {
        float: left;
        text-align: center;
        }
  .interlinear p   {
        text-align: center;
        font-family: serif;
        font-size: 100%;
        }
   .interlinear {
        font-family: serif;
        font-size: 100%;
        }

Note the unfortunate but likely necessary use of spacer divs to separate paragraphs by clearing the floated elements. In the HTML these divs take the following form:

<div class="spacer">   </div>

Here’s some of this in action:

Here’s some interlinear text with Pinyin above Chinese characters

Duìmiàn

對面

的

nǚhái

女孩

kàn

看

guòlai,

過來，

kàn

看

guòlai,

過來，

kàn

看

guòlai.

過來.

Zhèlǐ

這裡

的

biǎoyǎn

表演

hěn

很

jīngcǎi.

精彩.

Qǐng

請

bùyào

不要

jiǎzhuāng

假裝

bùlǐbùcǎi.

不理不睬.

Here’s some interlinear text with Chinese characters above Pinyin

對面

Duìmiàn

的

女孩

nǚhái

看

kàn

過來，

guòlai,

看

kàn

過來，

guòlai,

看

kàn

過來.

guòlai.

這裡

Zhèlǐ

的

表演

biǎoyǎn

很

hěn

精彩.

jīngcǎi.

請

Qǐng

不要

bùyào

假裝

jiǎzhuāng

不理不睬.

bùlǐbùcǎi.

So, does anyone have suggestions for improving this or know how to program a way to automate the process as much as possible?

Pinyin News

news and discussions mainly related to Chinese characters and romanization

Category Archives: software

Unicode in Japan

Taiwan premier backs adoption of common years

Wang Xuan, innovator in Chinese-character technology, dies at 69

URLs, Chinese characters, and the Roman alphabet

Firefox extensions for Mandarin Chinese texts

Chinese characters, Pinyin, and computers

Taiwan’s Y1C problem

table-free CSS method for interlinear texts on Web pages