indicator of character frequency: a suggestion for programmers

It occurred to me the other day that many people, especially language learners, might find it useful to have a tool that would take text written in Chinese characters and mark it up according to the frequency of use of the individual characters within.

Here’s a sentence from a recent CCP rant news item that can serve as an example:

非驴非马的“网语”不再满足于偏安网络一隅,正迅速向着其它媒体渗透,因而加剧了报纸电视等文字语言的混乱,玷污了汉语言文化的纯洁。
(Fēilǘfēimǎ de “wǎng yǔ” bùzài mǎnzú yú piān’ān wǎngluò yīyú, zhèng xùnsù xiàngzhe qítā méitǐ shèntòu, yīn’ér jiājù le bàozhǐ diànshì děng wénzì yǔyán de hùnluàn, diànwū le Hànyǔ yán wénhuà de chúnjié.)

Predictably, many of the characters here are extremely common. Others, however, would not even be covered under China’s definition of literacy. I’ve separated these characters into different classes, based on their frequencies of usage and applied different colors to each class:

  • character frequency: 1-100 (class i-c)
  • character frequency: 101-500 (class c-d)
  • character frequency: 501-1000 (class d-m)
  • character frequency: 1001-1500 (class m-md)
  • character frequency: 1501-2000 (class md-mm)
  • character frequency: beyond 2000 (class mmplus)

So the sample sentence would look like this:

的“语”电视

(Those of you reading this through RSS may need to visit the site to see what I’m talking about.)

The coding I used looks like this, though other approaches are possible:

<span class=”c-d” title=”101-500″>非</span><span class=”mmplus” title=”2001+”>驴</span>….

I added titles to make this more accessible.

Perhaps adding a summary would be useful:

1-100              24.6%
101-500           42.1%
501-1000          8.8%
1001-1500         14.0%
1501-2000          1.8%
2001+              8.8%

This approach could also be used for Japanese — for example, to highlight all kanji not included in the Jōyō kanji, or to highlight different sets of the Kyōiku kanji. For that matter, it could also be applied to written words in English or other languages that use alphabets, though conjugutions, plurals, and the like would complicate matters.

So, would anyone like to try coming up with one of these? Or has it been done already?

one possible resource:

Orientalism and Chinese characters: the case of ‘busyness’

Professor Victor H. Mair has sent me another piece along the lines of his popular essay danger + opportunity ≠ crisis.

The new piece discusses a misinterpretation of the nature of the Chinese character for máng (”busy”).

Since the entire essay is just a few paragraphs long, I won’t excerpt from it here but simply encourage everyone to read the whole thing: busyness ≠ heart + killing.

For related examples of this fanciful approach to etymology that Mair exposes, see misunderstandings of biblical proportions. And for a detailed explanation of how Chinese characters really do function, see Chinese.

Adso now available for download

David Lancashire’s wonderful Adso — which I tend to use primarily for conversions into Pinyin (under Style, select Pinyin) but which can handle much, much more — is now available for download as a Unix binary. A Windows version is expected soon.

This is fully-featured non-crippleware and should run on most modern linux distributions. To my knowledge, it is also the first reasonably-functional and freely-downloadable machine translation and NLP engine in the world.

If I were even half the programmer I ought to be, I’d snap this up in an instant.

Do Chinese characters save paper?

A common claim about Chinese characters (Hanzi) is that they take less space than alphabetic systems and so using them “saves paper.” After all, there aren’t spaces between words when writing in Chinese characters, and Chinese characters handle entire syllables rather than having to spell them out letter by letter. So this claim would seem to be self-evident. But things don’t always work out as expected.

cover of 'Did Adam and Eve Have Navels?' by Martin Gardnercover of the Mandarin translation of 'Did Adam and Eve Have Navels' 愛迪生,你被騙了!:你必須打破的27個科學迷思

A few weeks ago I was browsing the shelves of the enormous, wonderful Eslite bookstore near Taipei City Hall. (Nobody seems quite sure how the so-called English name of this chain is supposed to be pronounced, so many foreigners here prefer the Mandarin name: Chéngpǐn (誠品).) In many of the store’s sections, English-language originals and their translations into Mandarin are shelved right next to each other. So, after looking at a science book in English I pulled out the Mandarin Chinese translation of the same work and browsed through it. While I was doing so, I noticed something unexpected: the Mandarin version was longer than the English-language original.

This sparked my interest, so I pulled out some more paired titles, more or less at random, off the shelves for the purpose of comparison.

I did my best to keep the comparisons fair. In almost all of the cases I compared pairs of trade paperbacks: standard trade paperbacks in English with standard trade paperbacks in Mandarin.

Also, I didn’t count the pages taken up by indexes, since none of the translations into Mandarin had indexes. (Alphabets win hands down over Chinese characters when it comes to creating and using indexes, and I saw no reason to penalize the English books for this by counting pages that the ones in Chinese characters didn’t have the equivalent of.)

In addition, I avoided old books, since I wanted to be fairly sure the Mandarin Chinese translations were from the same English text as I was looking at. (I do, however, have one book written in German and translated into English. I didn’t check to see if the Mandarin version was done from the German original or the English translation.)

Of course, comparing across scripts and languages is certainly not the same as comparing simply across scripts (Hanzi vs. Hanyu Pinyin); but one does what one can.

Later, when I was supplementing my survey at the Eslite bookstore on Dunhua South Road when I noticed an error in my original method: I had forgotten to check where in the book page 1 fell. Many (but not all) English-language books mark the first page of the first chapter as page 1; many (but not all) books printed in Taiwan, however, include the front matter in their pagination, which leads to the first page of the first chapter being page 10 or so. So to help compensate for my oversight, it might be fair to subtract 10 pages from the Mandarin versions of those titles below followed by an asterisk. (The ones without an asterisk are those I examined most recently — and more carefully.)

Here are the results of my admittedly brief and unscientific survey:

Chronicles, Vol. 1, by Bob Dylan
English: 291 pp.
Mandarin in Hanzi: 295 pp.

Collapse, by Jared Diamond
English: 560 pp.
Mandarin in Hanzi: 609 pp.

The Death of Vishnu, by Manil Suri
English: 283 pp.
Mandarin in Hanzi: 287 pp.

Deep Simplicity: Bringing Order to Chaos and Complexity*, by John Gribbin
English: 235 pp.
Mandarin in Hanzi: 255 pp.

Did Adam and Eve Have Navels?: Debunking Pseudoscience*, by Martin Gardner
English: 310 pp.
Mandarin in Hanzi: 367 pp.

The Elegant Universe*, by Brian Greene
English: 428 pp.
Mandarin in Hanzi: 463 pp.

The Enigma of Arrival, by V.S. Naipaul
English: 350 pp.
Mandarin in Hanzi: 422 pp.

Harry Potter and the Half-Blood Prince, by J.K. Rowling
English: 607 pp. (hardback)
Mandarin in Hanzi: 716 pp.

Laboratory Earth*, by Stephen H. Schneider
English: 169 pp.
Mandarin in Hanzi: 227 pp.

The Long Tail, by Chris Anderson
English: 226 pp. (hardback, slightly larger than the Mandarin trade paperback)
Mandarin in Hanzi: 313 pp. (written left to right)

Perfume*, by Patrick Su?skind
English: 255 pp. (translation from German)
Mandarin in Hanzi: 278 pp.

Tough Choices, by Carly Fiorina
English: 309 pp.
Mandarin in Hanzi: 341 pp.

Vernon God Little, by D.B.C. Pierre
English: 275 pp. (mass market paperback)
Mandarin in Hanzi: 325 pp.

In every instance, the books in Chinese characters are longer than those in English. Moreover, the pages in the Mandarin-language trade paperbacks are somewhat larger than those in the English-language trade paperbacks. So that’s even more paper consumed by the books written in Chinese characters.

Although I certainly do not believe that all pairs of books in English and Mandarin translation follow this pattern, a pattern this very much appears to be.

My guess would be that books printed in China would have fewer pages than those printed in Taiwan. (Anyone want to check some of the above titles? Or does anyone have pairs of other titles in unexpurgated editions?) In general, books in China simply aren’t designed and printed with the same degrees of competency, attention, and concern for the reader as books in Taiwan — not to mention books in the United States and Britain. (Or have things changed very much in this regard since I lived in China?) So, among other factors, the characters tend to be smaller, along with the leading and the margins.

And then there’s the fact that translations in China sometimes omit sentences or entire sections, especially if they are deemed “sensitive.” (I doubt, however, that the books I examined suffered from Beijing’s censors.)

Also, China’s left-to-right format might have an advantage over Taiwan’s predominant top-to-bottom style in terms of space.

rice pizza = ‘mizza’

advertising photo of Pizza Hut's rice pizza; the copy reads '米zza 超ㄏㄤ美味新鮮fun'Something written with three different scripts (Chinese characters, zhuyin, and the roman alphabet) is very much the sort of thing that attracts my attention, as is a product that mixes scripts in its name. So this ad for a new product from Taiwan’s Pizza Hut definitely caught my eye, though it did not inspire me to actually taste the item being touted, which is a rice pizza. (Generally, I do not care for pizzas with Taiwanese characteristics, such as those with peas, corn, or squid. For that matter, I don’t even like pineapple on pizza.)

The name for this rice pizza, “米zza” (mǐzza), is a portmanteau — using two different languages and two different scripts, no less. 米 is the Chinese character for , which is used mainly in rice- and other grain-associated words. The second part of the word comes, of course, from “pizza.”

Let’s move on to the slogan:

米zza 超ㄏㄤ美味 新鮮fun

In romanization, this is

mǐzza: chāo hāng měiwèi — xīnxiān fun

Here we have Chinese characters (zza ㄏㄤ美味新鮮fun), zhuyin (米zza 超ㄏㄤ美味新鮮fun), and the Roman alphabet (米zza 超ㄏㄤ美味新鮮fun). Three scripts in just one line! (Yes, yes, I know that a line in written Japanese will often have just as many scripts, if not more; but this is Mandarin.)

The zhuyin, ㄏㄤ, represent hāng, a new slang word that, according to several people I have asked, has appeared within the last five years at most. It means “hot” in the sense of “extremely popular right now.”

Also, there’s a possibility that the English word “fun” is meant to echo the Mandarin fàn (飯 / 饭/ “rice”). Such puns across languages are not uncommon here, especially in local Internet slang.

So, the whole slogan might be translated as “Rice pizza: the super-’hot’ delicious food — fresh, new fun.” Sorry, that’s not a very good translation; it works better in Mandarin.

I predict such portmanteaux and mixing will be increasingly common here in Taiwan, where code switching is a way of life for many people. “Mǐzza” could be the wave of the future — just not the culinary future, I hope.

source: Taiwan Pizza Hut menu page, accessed January 30, 2007

some common character slips in China

Joel of Danwei has translated the gist of a list of the top errors in Mandarin use for 2006, as submitted by the readers of Yǎowénjiáozì (咬文嚼字), a magazine in China. (Yǎowénjiáozì is tricky to translate. Maybe “Pedantry,” though that sounds a bit harsh.)

I’ve reproduced the errors relating specifically to character use (7 out of 10), making the characters larger in order to help make the distinctions clearer. See Joel’s post for details.

  1. (xiàng) instead of (xiàng)
  2. 丙戍年 (bǐng shù nián) instead of 丙戌年 (bǐngxū nián)
  3. 神州[六号] (Shénzhōu [liù hào]) instead of 神舟[六号] (Shén Zhōu [liù hào]) (Those responsible for naming the spacecraft, however, certainly intended the name to remind people of “the Divine Land” (Shénzhōu, 神州, i.e. China).)
  4. () instead of ()
  5. 美發 (měi fā) instead of 美髮 (měifà) (The characters 發 () and 髮 () were both given the simplified form of 发, so people in China often end up with the wrong character when trying to use the traditional form of 发.)
  6. 启示 (qǐshì) instead of 启事 (qǐshì)
  7. 哈蜜瓜 (hā mì guā) instead of 哈密瓜 (Hāmìguā)

sources:

Pinyin Info 1, Condoleezza Rice 0

U.S. Secretary of State Condoleezza Rice has joined Al Gore, John F. Kennedy, and other prominent U.S. politicians in spreading the crisis/opportunity myth. Fortunately, though, Glenn Kessler of the Washington Post found Victor H. Mair’s essay danger + opportunity ≠ crisis here on Pinyin Info:

At one point, Rice said that the difficult circumstances in the Middle East could represent opportunity. “I don’t read Chinese but I am told that the Chinese character for crisis is wei-ji, which means both danger and opportunity,” she said in Riyadh. “And I think that states it very well. We’ll try to maximize the opportunity.”

But Victor H. Mair, a professor of Chinese at the University of Pennsylvania, has written on the Web site https://pinyin.info, a guide to the Chinese language, that “a whole industry of pundits and therapists has grown up around this one grossly inaccurate formulation.” He said the character “ji” actually means “incipient moment” or a “crucial point.” Thus, he said, a wei-ji “is indeed a genuine crisis, a dangerous moment, a time when things start to go awry.”

sources and further readings:

language reformer Qian Xuantong remembered

photo of Qian Xuantong (Ch'ien Hsuan-t'ung)Two days ago was the 68th anniversary of the death of Qian Xuantong (Qián Xuántóng / 錢玄同 / 钱玄同 / Ch’ien Hsüan-t’ung) (1887–1939), a phonetician, philologist, and professor of literature at Peking University. Although he isn’t well known today, Qian was an important contributor to the reforms associated with the May 4 movement. He also helped renew debate about script reform in China.

Just about the time that the National Phonetic Alphabet succeeded in gaining ascendancy over the Mandarin Alphabet and other schemes, the evolution of literary and political movements into a new stage gave rise to renewed consideration of the roman alphabet as the basis for reform of the Chinese written language.

What seems to have initiated the new stage of discussion was a letter written in March 1918 by Ch`ien Hsüan-t`ung, a well-known philologist and professor of literature at National Peking University, to Ch`en Tu-hsiu, who at the time was editor of La Jeunesse, the leading organ of young Chinese intellectuals, and who soon afterward became one of the founders of the Chinese Communist Party. In his letter Ch`ien Hsüan-t`ung expressed approval of Ch`en Tu-hsiu’s demand for a break with the Confucian ideology which had dominated Chinese life for more than two thousand years, and also offered his idea as to how this was to be carried out. “If you want to abolish Confucianism,” he said, “you must first abolish the Chinese script.” To his mind there was little of value in Chinese literature, 99.9 per cent of which he dismissed as merely transmitting Confucian ideology and Taoist mythology.

It seemed to Ch`ien that the ideographic [sic] script could not be adapted to the needs of modern China. He also saw no solution in the attempts which had thus far been made to apply a phonetic system of writing to Chinese. Indeed, it appeared to him that it would be impossible to apply a phonetic system of writing to Chinese at all. These views also led him to the conclusion, reached earlier by Wu Chih-hui and others, that Chinese writing itself would have to be abandoned and replaced by Esperanto.

I seem to remember that someone in Japan was driven to distraction about that country’s orthography and making a similar proposal about switching from Japanese to Esperanto. Or am I imagining that?

At any rate, others soon convinced Qian of the error of his ways, and before long he was a strong supporter of romanization, as were many others of his generation, including Lu Xun. By the way, Qian was the one who convinced Lu Xun to start writing stories. That alone should be enough to make the world forever grateful to him.

I strongly recommend the first of the readings below, from which the above quote was taken. It’s interesting reading.

sources: