It occurred to me the other day that many people, especially language learners, might find it useful to have a tool that would take text written in Chinese characters and mark it up according to the frequency of use of the individual characters within.
Here’s a sentence from a recent CCP rant news item that can serve as an example:
éžé©´éžé©¬çš„“网è¯â€ä¸å†æ»¡è¶³äºŽå安网络一隅,æ£è¿…速å‘ç€å…¶å®ƒåª’体渗é€ï¼Œå› è€ŒåŠ å‰§äº†æŠ¥çº¸ç”µè§†ç‰æ–‡å—è¯è¨€çš„æ··ä¹±ï¼ŒçŽ·æ±¡äº†æ±‰è¯è¨€æ–‡åŒ–的纯æ´ã€‚
(FÄ“ilǘfÄ“imÇŽ de “wÇŽng yÇ”” bùzà i mÇŽnzú yú piÄn’Än wÇŽngluò yÄ«yú, zhèng xùnsù xià ngzhe qÃtÄ méitÇ shèntòu, yÄ«n’ér jiÄjù le bà ozhÇ dià nshì dÄ›ng wénzì yÇ”yán de hùnluà n, dià nwÅ« le Hà nyÇ” yán wénhuà de chúnjié.)
Predictably, many of the characters here are extremely common. Others, however, would not even be covered under China’s definition of literacy. I’ve separated these characters into different classes, based on their frequencies of usage and applied different colors to each class:
- character frequency: 1-100 (class i-c)
- character frequency: 101-500 (class c-d)
- character frequency: 501-1000 (class d-m)
- character frequency: 1001-1500 (class m-md)
- character frequency: 1501-2000 (class md-mm)
- character frequency: beyond 2000 (class mmplus)
So the sample sentence would look like this:
éžé©´éžé©¬çš„“网è¯â€ä¸å†æ»¡è¶³äºŽå安网络一隅,æ£è¿…速å‘ç€å…¶å®ƒåª’体渗é€ï¼Œå› è€ŒåŠ å‰§äº†æŠ¥çº¸ç”µè§†ç‰æ–‡å—è¯è¨€çš„æ··ä¹±ï¼ŒçŽ·æ±¡äº†æ±‰è¯è¨€æ–‡åŒ–的纯æ´ã€‚
(Those of you reading this through RSS may need to visit the site to see what I’m talking about.)
The coding I used looks like this, though other approaches are possible:
<span class=”c-d” title=”101-500″>éž</span><span class=”mmplus” title=”2001+”>é©´</span>….
I added titles to make this more accessible.
Perhaps adding a summary would be useful:
1-100 24.6% 101-500 42.1% 501-1000 8.8% 1001-1500 14.0% 1501-2000 1.8% 2001+ 8.8%
This approach could also be used for Japanese — for example, to highlight all kanji not included in the JÅyÅ kanji, or to highlight different sets of the KyÅiku kanji. For that matter, it could also be applied to written words in English or other languages that use alphabets, though conjugutions, plurals, and the like would complicate matters.
So, would anyone like to try coming up with one of these? Or has it been done already?
one possible resource:
Kai Carver said
Excellent idea!
It would be a nice option in the adsotrans site
http://textbook.adsotrans.com/?q=node/16
The color red seems a little dark.
Shaun said
I think that’s a great idea, however, I’d like to see it implemented as a Firefox plug-in. One click and then all the characters on the page are highlighted or marked some how indicating the frequency.
Ben L. said
You probably already know that Wenlin provides frequency for every character it has (which perhaps served to spark this idea). It certainly doesn’t have the visual immediacy or ease of use that your idea has, in any case.
Chris said
Sounds like work! Pretty cool idea, but how useful would it be, really?
site admin said
Ben: Thanks for mentioning this. I like that feature of Wenlin very much. In fact, I used it for the frequency numbers for the sample sentence. One minor note: Wenlin doesn’t give frequencies for all of the characters in its database, just the top 3,000. Usage statistics are less reliable past that point. Unfortunately, though, that doesn’t mean many such characters don’t have to be memorized….
Ben L. said
I’ve heard of the official lists for basic- and college-level character knowledge (and recognizing of course that no one counts the characters they know- probably because language isn’t actually made up of characters), but is there a list that shows the characters a person must be proficient in to function in all but the most technical environments? Do Big5 or Microsoft provide anything like this?
digchinese.com said
http://digchinese.com/tools/charfreq
site admin said
Wow, that was fast! Nice work, digchinese.com. I’m glad you added support for both traditional and simplified characters. I look forward to seeing what else you’ll be putting up there.
firefox estension said
Great, now we just need a Firefox extension. That, plus one of the Chinese pop-up dictionaries, is a great learning tool!
trevelyan said
Nice idea. I’ll see what I can do about adding this functionality to Adso as well. Two general suggestions:
* consistent, machine-parsable/machine-generatable class names (“swofford1″, “swofford2″… etc rather than “i-c”, “i-d”). This is probably going to be more flexible as it won’t lock us down to a set number of classes. Whatever software generates the output could produce as fine-grained a heirarchy as required, with as many classes as necessary.
* complementing point one, it would be nice to let the user choose colours along a spectrum, and automatically generate the necessary colours for the markup depending on the number of classes selected. I find the characters hard to read against a red backdrop, so would probably opt for something a little softer. Maybe a gradiation from light yellow to light grey or something similar.
@digchinese — nice site design.
benitez said
Excellent!! a wonderfull tool.
thank you very much!!
刘è“地 said
It would be 1,000 times more useful (but of course more difficult) to do this with entire words instead of characters. There are some characters that appear to be very frequent, but that is only because they are elements in many different less frequent words, but not often used by themselves, and this would throw off the frequency count. Just as in English it is not so useful to know which letters are the most frequent, but it’s infinitely useful to a ESL learner to know which words are more frequent.
Over 80% of Chinese words are two characters. One character words and three or more character words make up the rest. Also, the majority of extremely high frequency words are one character words that everyone knows anyway. Maybe those shouldn’t be highlighted at all.
Leszek Gagala said
digchinese.com is a hero!
It would be cool to have something like this for Japanese too!