It occurred to me the other day that many people, especially language learners, might find it useful to have a tool that would take text written in Chinese characters and mark it up according to the frequency of use of the individual characters within.
Here’s a sentence from a recent CCP rant news item that can serve as an example:
非驴非马的“网语”不再满足于偏安网络一隅,正迅速向着其它媒体渗透,因而加剧了报纸电视等文字语言的混乱,玷污了汉语言文化的纯洁。
(Fēilǘfēimǎ de “wǎng yǔ” bùzài mǎnzú yú piān’ān wǎngluò yīyú, zhèng xùnsù xiàngzhe qítā méitǐ shèntòu, yīn’ér jiājù le bàozhǐ diànshì děng wénzì yǔyán de hùnluàn, diànwū le Hànyǔ yán wénhuà de chúnjié.)
Predictably, many of the characters here are extremely common. Others, however, would not even be covered under China’s definition of literacy. I’ve separated these characters into different classes, based on their frequencies of usage and applied different colors to each class:
- character frequency: 1-100 (class i-c)
- character frequency: 101-500 (class c-d)
- character frequency: 501-1000 (class d-m)
- character frequency: 1001-1500 (class m-md)
- character frequency: 1501-2000 (class md-mm)
- character frequency: beyond 2000 (class mmplus)
So the sample sentence would look like this:
非驴非马的“网语”不再满足于偏安网络一隅,正迅速向着其它媒体渗透,因而加剧了报纸电视等文字语言的混乱,玷污了汉语言文化的纯洁。
(Those of you reading this through RSS may need to visit the site to see what I’m talking about.)
The coding I used looks like this, though other approaches are possible:
<span class=”c-d” title=”101-500″>非</span><span class=”mmplus” title=”2001+”>驴</span>….
I added titles to make this more accessible.
Perhaps adding a summary would be useful:
1-100 24.6%
101-500 42.1%
501-1000 8.8%
1001-1500 14.0%
1501-2000 1.8%
2001+ 8.8%
This approach could also be used for Japanese — for example, to highlight all kanji not included in the Jōyō kanji, or to highlight different sets of the Kyōiku kanji. For that matter, it could also be applied to written words in English or other languages that use alphabets, though conjugutions, plurals, and the like would complicate matters.
So, would anyone like to try coming up with one of these? Or has it been done already?
one possible resource: