indicator of character frequency: a suggestion for programmers

It occurred to me the other day that many people, especially language learners, might find it useful to have a tool that would take text written in Chinese characters and mark it up according to the frequency of use of the individual characters within.

Here’s a sentence from a recent CCP rant news item that can serve as an example:

非驴非马的“网语”不再满足于偏安网络一隅,正迅速向着其它媒体渗透,因而加剧了报纸电视等文字语言的混乱,玷污了汉语言文化的纯洁。
(Fēilǘfēimǎ de “wǎng yǔ” bùzài mǎnzú yú piān’ān wǎngluò yīyú, zhèng xùnsù xiàngzhe qítā méitǐ shèntòu, yīn’ér jiājù le bàozhǐ diànshì děng wénzì yǔyán de hùnluàn, diànwū le Hànyǔ yán wénhuà de chúnjié.)

Predictably, many of the characters here are extremely common. Others, however, would not even be covered under China’s definition of literacy. I’ve separated these characters into different classes, based on their frequencies of usage and applied different colors to each class:

  • character frequency: 1-100 (class i-c)
  • character frequency: 101-500 (class c-d)
  • character frequency: 501-1000 (class d-m)
  • character frequency: 1001-1500 (class m-md)
  • character frequency: 1501-2000 (class md-mm)
  • character frequency: beyond 2000 (class mmplus)

So the sample sentence would look like this:

的“语”电视

(Those of you reading this through RSS may need to visit the site to see what I’m talking about.)

The coding I used looks like this, though other approaches are possible:

<span class=”c-d” title=”101-500″>非</span><span class=”mmplus” title=”2001+”>驴</span>….

I added titles to make this more accessible.

Perhaps adding a summary would be useful:

1-100              24.6%
101-500           42.1%
501-1000          8.8%
1001-1500         14.0%
1501-2000          1.8%
2001+              8.8%

This approach could also be used for Japanese — for example, to highlight all kanji not included in the Jōyō kanji, or to highlight different sets of the Kyōiku kanji. For that matter, it could also be applied to written words in English or other languages that use alphabets, though conjugutions, plurals, and the like would complicate matters.

So, would anyone like to try coming up with one of these? Or has it been done already?

one possible resource:

13 thoughts on “indicator of character frequency: a suggestion for programmers

  1. I think that’s a great idea, however, I’d like to see it implemented as a Firefox plug-in. One click and then all the characters on the page are highlighted or marked some how indicating the frequency.

  2. You probably already know that Wenlin provides frequency for every character it has (which perhaps served to spark this idea). It certainly doesn’t have the visual immediacy or ease of use that your idea has, in any case.

  3. Ben: Thanks for mentioning this. I like that feature of Wenlin very much. In fact, I used it for the frequency numbers for the sample sentence. One minor note: Wenlin doesn’t give frequencies for all of the characters in its database, just the top 3,000. Usage statistics are less reliable past that point. Unfortunately, though, that doesn’t mean many such characters don’t have to be memorized….

  4. I’ve heard of the official lists for basic- and college-level character knowledge (and recognizing of course that no one counts the characters they know- probably because language isn’t actually made up of characters), but is there a list that shows the characters a person must be proficient in to function in all but the most technical environments? Do Big5 or Microsoft provide anything like this?

  5. Great, now we just need a Firefox extension. That, plus one of the Chinese pop-up dictionaries, is a great learning tool!

  6. Nice idea. I’ll see what I can do about adding this functionality to Adso as well. Two general suggestions:

    * consistent, machine-parsable/machine-generatable class names (“swofford1”, “swofford2″… etc rather than “i-c”, “i-d”). This is probably going to be more flexible as it won’t lock us down to a set number of classes. Whatever software generates the output could produce as fine-grained a heirarchy as required, with as many classes as necessary.
    * complementing point one, it would be nice to let the user choose colours along a spectrum, and automatically generate the necessary colours for the markup depending on the number of classes selected. I find the characters hard to read against a red backdrop, so would probably opt for something a little softer. Maybe a gradiation from light yellow to light grey or something similar.

    @digchinese — nice site design.

  7. It would be 1,000 times more useful (but of course more difficult) to do this with entire words instead of characters. There are some characters that appear to be very frequent, but that is only because they are elements in many different less frequent words, but not often used by themselves, and this would throw off the frequency count. Just as in English it is not so useful to know which letters are the most frequent, but it’s infinitely useful to a ESL learner to know which words are more frequent.

    Over 80% of Chinese words are two characters. One character words and three or more character words make up the rest. Also, the majority of extremely high frequency words are one character words that everyone knows anyway. Maybe those shouldn’t be highlighted at all.

Leave a Reply

Your email address will not be published. Required fields are marked *