indicator of character frequency: a suggestion for programmers

It occurred to me the other day that many people, especially language learners, might find it useful to have a tool that would take text written in Chinese characters and mark it up according to the frequency of use of the individual characters within.

Here’s a sentence from a recent CCP rant news item that can serve as an example:

非驴非马的“网语”不再满足于偏安网络一隅,正迅速向着其它媒体渗透,因而加剧了报纸电视等文字语言的混乱,玷污了汉语言文化的纯洁。
(Fēilǘfēimǎ de “wǎng yǔ” bùzài mǎnzú yú piān’ān wǎngluò yīyú, zhèng xùnsù xiàngzhe qítā méitǐ shèntòu, yīn’ér jiājù le bàozhǐ diànshì děng wénzì yǔyán de hùnluàn, diànwū le Hànyǔ yán wénhuà de chúnjié.)

Predictably, many of the characters here are extremely common. Others, however, would not even be covered under China’s definition of literacy. I’ve separated these characters into different classes, based on their frequencies of usage and applied different colors to each class:

  • character frequency: 1-100 (class i-c)
  • character frequency: 101-500 (class c-d)
  • character frequency: 501-1000 (class d-m)
  • character frequency: 1001-1500 (class m-md)
  • character frequency: 1501-2000 (class md-mm)
  • character frequency: beyond 2000 (class mmplus)

So the sample sentence would look like this:

的“语”电视

(Those of you reading this through RSS may need to visit the site to see what I’m talking about.)

The coding I used looks like this, though other approaches are possible:

<span class=”c-d” title=”101-500″>非</span><span class=”mmplus” title=”2001+”>驴</span>….

I added titles to make this more accessible.

Perhaps adding a summary would be useful:

1-100              24.6%
101-500           42.1%
501-1000          8.8%
1001-1500         14.0%
1501-2000          1.8%
2001+              8.8%

This approach could also be used for Japanese — for example, to highlight all kanji not included in the Jōyō kanji, or to highlight different sets of the Kyōiku kanji. For that matter, it could also be applied to written words in English or other languages that use alphabets, though conjugutions, plurals, and the like would complicate matters.

So, would anyone like to try coming up with one of these? Or has it been done already?

one possible resource:

Orientalism and Chinese characters: the case of ‘busyness’

Professor Victor H. Mair has sent me another piece along the lines of his popular essay danger + opportunity ≠ crisis.

The new piece discusses a misinterpretation of the nature of the Chinese character for máng (”busy”).

Since the entire essay is just a few paragraphs long, I won’t excerpt from it here but simply encourage everyone to read the whole thing: busyness ≠ heart + killing.

For related examples of this fanciful approach to etymology that Mair exposes, see misunderstandings of biblical proportions. And for a detailed explanation of how Chinese characters really do function, see Chinese.

Adso now available for download

David Lancashire’s wonderful Adso — which I tend to use primarily for conversions into Pinyin (under Style, select Pinyin) but which can handle much, much more — is now available for download as a Unix binary. A Windows version is expected soon.

This is fully-featured non-crippleware and should run on most modern linux distributions. To my knowledge, it is also the first reasonably-functional and freely-downloadable machine translation and NLP engine in the world.

If I were even half the programmer I ought to be, I’d snap this up in an instant.