UTF-8 Unicode vs. other encodings over time

Some eight years ago UTF-8 (Unicode) became the most used encoding on Web pages. At the time, though, it was used on only about 26% of Web pages, so it had a plurality but not an absolute majority.

Graph showing growth of the UTF-8 encoding

By the beginning of 2010 Unicode was rapidly approaching use on half of Web pages.
graph showing a steep rise in the use of UTF-8 and a steep decline in other major encodings

In 2012 the trends were holding up.
UTF-8_website_use_2001-2012

Note that the 2008 crossover point appears different in the latter two Google graphs, which is why I’m showing all three graphs rather than just the third.

A different source (with slightly different figures) provides us with a look at the situation up to the present, with UTF-8 now on 85% of Web pages. Expansion of UTF-8 is slowing somewhat. But that may be due largely to the continuing presence of older websites in non-Unicode encodings rather than lots of new sites going up in encodings other than UTF-8.
growth in Unicode UTF-8 encoding on Web pages, 2010-2015

Here’s the same chart, but focusing on encodings (other than UTF-8) that use Chinese characters, so the percentages are relatively low.
asian_language_encodings_2010-2015

And here’s the same as the above, but with the results for individual languages combined.
asian_language_encodings_2010-2015_by_language

By the way, Pinyin.info has been in UTF-8 since the site began way back in 2001. The reason that Chinese characters and Pinyin with tone marks appear scrambled within Pinyin News is that a hack caused the WordPress database to be set to Swedish (latin1_swedish_ci), of all things. And I haven’t been able to get it fixed; so just for the time being I’ve given up trying. One of these days….

Sources:

Google Translate and romaji revisited

OK, Google has improved its Pinyin converter some, though it still fails in important areas. So that’s the present situation for Google and Mandarin.

How about for Google and Japanese?

Professor J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures generously agreed to reexamine Google’s performance in conversions to rōmaji (Japanese written in romanization).

Below is his latest evaluation.

For his initial analysis (in December 2009), see Google Translate and rōmaji.

I ran the test passage through Google Translate again. There’s some improvement, but it’s still pretty mediocre.

Original Google Translate
6日午後4時35分ごろ、東京都千代田区皇居外苑の都道(内堀通り)の二重橋前交差点で、中国からの観光客の40代の男性が乗用車にはねられ、全身を強く打って間もなく死亡した。車は歩道に乗り上げて歩いていた男性(69)もはね、男性は頭を強く打って意識不明の重体。丸の内署は、運転していた東京都港区白金3丁目、会社役員高橋延拓容疑者(24)を自動車運転過失傷害の疑いで現行犯逮捕し、容疑を同致死に切り替えて調べている。 6-Nichi gogo 4-ji 35-fun-goro, Tōkyō-to Chiyoda-ku Kōkyogaien no todō (uchibori-dōri) no Nijūbashi zen kōsaten de, Chūgoku kara no kankō kyaku no 40-dai no dansei ga jōyōsha ni hane rare, zenshin o tsuyoku Utte mamonaku shibō shita. Kuruma wa hodō ni noriagete aruite ita dansei (69) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no jūtai. Marunouchi-sho wa, unten shite ita Tōkyō-to Minato-ku hakkin 3-chōme, kaisha yakuin Takahashi nobe Tsubuse yōgi-sha (24) o jidōsha unten kashitsu shōgai no utagai de genkō-han taiho shi, yōgi o dō chishi ni kirikaete shirabete iru.
 同署によると、死亡した男性は横断歩道を歩いて渡っていたところを直進してきた車にはねられた。車は左に急ハンドルを切り、車道と歩道の境に置かれた仮設のさくをはね上げ、歩道に乗り上げたという。さくは歩道でランニングをしていた男性(34)に当たり、男性は両足に軽いけが。 Dōsho ni yoru to, shibō shita dansei wa ōdan hodō o aruite watatte ita tokoro o chokushin shite kita kuruma ni hane rareta. Kuruma wa hidari ni kyū handoru o kiri, shadō to hodō no sakai ni oka reta kasetsu no saku o haneage, hodō ni noriageta toyuu. Saku wa hodō de ran’ningu o shite ita dansei (34) niatari, dansei wa ryōashi ni karui kega.
 同署は、死亡した男性の身元確認を進めるとともに、当時の交差点の信号の状況を調べている。 Dōsho wa, shibō shita dansei no mimoto kakunin o susumeru totomoni, tōji no kōsaten no shingō no jōkyō o shirabete iru.
 現場周辺は東京観光のスポットの一つだが、最近はジョギングを楽しむ人も増えている。 Genba shūhen wa Tōkyō kankō no supotto no hitotsudaga, saikin wa jogingu o tanoshimu hito mo fuete iru.

Notes:

  • The use of numerals dodges a plethora of errors, but “6-Nichi” is still wrong for Muika.
  • Lots of correct capitalizations have been added, but “uchibori” was missed and “Utte” capitalized by mistake.
  • Some false spaces or lack of spaces persist: “hane rare”, “oka reta”; “hitotsudaga” and “niatari” were correctly hitotsu da ga and ni atari in the original test.
  • Names still get butchered (“hakkin” for Shirogane, “nobe Tsubuse” for Nobuhiro.
  • The needless apostrophe in “ran’ningu” is still there.
  • Interestingly, “toyuu” is a new error: it should be to iu.
  • There’s evidence of some attempt to use hyphens, but why not in “kankō kyaku” or “Nijūbashi zen”?

So, to update: Google gets kudos for conscientiousness, but I stick by my original comments.

For more by Prof. Unger, see Pinyin.info’s recommended readings, which includes selections from The Fifth Generation Fallacy: Why Japan Is Betting Its Future on Artificial Intelligence, Literacy and Script Reform in Occupation Japan: Reading Between the Lines, and Ideogram: Chinese Characters and the Myth of Disembodied Meaning.

Google Translate and rōmaji

The following is a guest post by Professor J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures.

The challenge

On 18 November 2009, Mark Swofford posted an item on his website pinyin.info criticizing the way Google Translate produces Hanyu Pinyin from standard Chinese text. He concluded by saying, “Google Translate will also romanize Japanese texts written in kanji and kana, Russian texts written in Cyrillic, etc. But I’ll leave those to others to analyze.” So I decided to take up Swofford’s challenge as it pertains to Japanese. Using Google Translate, I romanized a news item from the Asahi of 6 December 2009:

Original Google Translate
6日午後4時35分ごろ、東京都千代田区皇居外苑の都道(内堀通り)の二重橋前交差点で、中国からの観光客の40代の男性が乗用車にはねられ、全身を強く打って間もなく死亡した。車は歩道に乗り上げて歩いていた男性(69)もはね、男性は頭を強く打って意識不明の重体。丸の内署は、運転していた東京都港区白金3丁目、会社役員高橋延拓容疑者(24)を自動車運転過失傷害の疑いで現行犯逮捕し、容疑を同致死に切り替えて調べている。 roku nichi gogo yon ji san go fun goro , tōkyō to chiyoda ku kōkyogaien no todō ( uchibori dōri ) no nijūbashi zen kōsaten de , chūgoku kara no kankō kyaku no yon zero dai no dansei ga jōyōsha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shibō shi ta . kuruma wa hodō ni noriage te arui te i ta dansei ( roku kyū ) mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no jūtai . marunouchi sho wa , unten shi te i ta tōkyō to minato ku hakkin san chōme , kaisha yakuin takahashi nobe tsubuse yōgi sha ( ni yon ) wo jidōsha unten kashitsu shōgai no utagai de genkō han taiho shi , yōgi wo dō chishi ni kirikae te shirabe te iru .
 同署によると、死亡した男性は横断歩道を歩いて渡っていたところを直進してきた車にはねられた。車は左に急ハンドルを切り、車道と歩道の境に置かれた仮設のさくをはね上げ、歩道に乗り上げたという。さくは歩道でランニングをしていた男性(34)に当たり、男性は両足に軽いけが。 dōsho ni yoru to , shibō shi ta dansei wa ōdan hodō wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni kyū handoru wo kiri , shadō to hodō no sakai ni oka re ta kasetsu no saku wo haneage , hodō ni noriage ta to iu . saku wa hodō de ran’ningu wo shi te i ta dansei ( san yon ) ni atari , dansei wa ryōashi ni karui kega .
 同署は、死亡した男性の身元確認を進めるとともに、当時の交差点の信号の状況を調べている。 dōsho wa , shibō shi ta dansei no mimoto kakunin wo susumeru totomoni , tōji no kōsaten no shingō no jōkyō wo shirabe te iru .
 現場周辺は東京観光のスポットの一つだが、最近はジョギングを楽しむ人も増えている。 genba shūhen wa tōkyō kankō no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru .

Google’s romanization algorithm does a thoroughly mediocre job compared with what a human transcriber would do. To see this, compare the following:

Google Translate human transcriber
roku nichi gogo yon ji san go fun goro , tōkyō to chiyoda ku kōkyogaien no todō ( uchibori dōri ) no nijūbashi zen kōsaten de , chūgoku kara no kankō kyaku no yon zero dai no dansei ga jōyōsha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shibō shi ta . kuruma wa hodō ni noriage te arui te i ta dansei ( roku kyū ) mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no jūtai . marunouchi sho wa , unten shi te i ta tōkyō to minato ku hakkin san chōme , kaisha yakuin takahashi nobe tsubuse yōgi sha ( ni yon ) wo jidōsha unten kashitsu shōgai no utagai de genkō han taiho shi , yōgi wo dō chishi ni kirikae te shirabe te iru . Muika gogo yo-ji sanjūgo-fun goro, Tōkyō-to Chiyoda-ku Kōkyo Gaien no todō (Uchibori dōri) no Nijūbashi-zen kōsaten de, Chūgoku kara no kankō-kyaku no yonjū-dai no dansei ga jōyōsha ni hanerare, zenshin o tsuyoku utte mamonaku shibō-shita. Kuruma wa hodō ni noriagete aruite ita dansei (rokujūkyū) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no jūtai. Marunouchi-sho wa, unten-shite ita Tōkyō-to Minato-ku Shirogane san-chōme, kaisha yakuin Takahashi Nobuhiro yōgisha (nijūyon) o jidōsha unten kashitsu shōgai no utagai de genkōhan taiho-shi, yōgi o dō-chishi ni kirikaete shirabete iru.
dōsho ni yoru to , shibō shi ta dansei wa ōdan hodō wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni kyū handoru wo kiri , shadō to hodō no sakai ni oka re ta kasetsu no saku wo haneage , hodō ni noriage ta to iu . saku wa hodō de ran’ningu wo shi te i ta dansei ( san yon ) ni atari , dansei wa ryōashi ni karui kega . Dō-sho ni yoru to, shibō-shita dansei wa ōdan hodō o aruite watatte ita tokoro o chokushin-shite kita kuruma ni hanerareta. Kuruma wa hidari ni kyū-handoru o kiri, shadō to hodō no sakai ni okareta kasetsu no saku o haneage, hodō ni noriageta to iu. Saku wa hodō de ranningu o shite ita dansei (sanjūyon) ni atari, dansei wa ryōashi ni karui kega.
dōsho wa , shibō shi ta dansei no mimoto kakunin wo susumeru totomoni , tōji no kōsaten no shingō no jōkyō wo shirabe te iru . Dō-sho wa, shibō-shita dansei no mimoto kakunin o susumeru to tomo ni, tōji no kōsaten no shingō no jōkyō o shirabete iru.
genba shūhen wa tōkyō kankō no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru . Genba shūhen wa Tōkyō kankō no supotto no hitotsu da ga, saikin wa jogingu o tanoshimu hito mo fuete iru.

For the sake of comparison, I have retained Google’s Hepburn-style romanization. The following changes have been made in the text in the righthand column:

  1. Misread words have been rewritten. Many involve numerals; e.g. muika for “roku nichi”, yo-ji for “yon ji”, sanjūgo-fun for “san go fun”. The personal name Nobuhiro is an educated guess, but “Nobetsubuse” is certainly wrong. Shirogane for “hakkin” is a place-name (N.B. Google did not produce *hakukin, indicating that the algorithm does more than just character-by-character on-yomi).
  2. False spaces and consequent misreadings have been eliminated. E.g. hanerare for “hane rare”, wattate ita for “wata~tsu te i ta”.
  3. Run-together phrases have been parsed correctly. E.g. to tomo ni for “totomoni”.
  4. Capitalization of proper nouns and the first words in sentences has been introduced.
  5. Hyphens are used conservatively for prefixes and suffixes, and for compound verbs with suru.
  6. Obsolete “wo” for the particle o has been eliminated. (N.B. Google did not produce *ha for the particle wa, so “wo” for o is just the result of laziness.)
  7. Apostrophes after n to indicate mora nasals in positions where they are not needed have been eliminated.
  8. Punctuation has been normalized to match for romanized format and paragraph indentations have been restored.

One could make the romanized text more easily readable by restoring arabic numerals, italicizing gairaigo, and so on. Of course, if the reporter knew that his/her copy would be reported orally or in romanization, s/he might have chosen different wording to avoid homophonic ambiguities. E.g., Marunouchi-sho could be Marunouchi Keisatsu-sho, though perhaps in the context of a traffic accident story, it is obvious that the suffix sho denotes ‘police station’. Furthermore, in a digraphic Japan, homophones might not be such as great problem. If, for instance, readers were accumstomed to seeing dōsho for 同所 ‘same place’, then dō-sho would immediately signal that something different was meant, which, given context, might be entirely sufficient to eliminate misunderstanding.

But having said all that, my guess is that the romanization function of Google Translate was programmed with some care. Rather than criticize the quality Google’s algorithm, I suggest pursuing the logical consequences of assuming that it deserves about a B+ by current standards.

Analysis

Clearly, there is a vast amount of knowledge an editor needs if s/he wants to bring Google’s result up to an acceptable level of romanization for human consumption. That minimal level, in turn, is probably a far cry from what a committee of linguists might decide would be an ideal romanization for daily use in 21st-century Japan. It is quite obvious why Google’s algorithm blunders — the reasons were well understood and described long ago (e.g. in Unger 1987) — and though the algorithm can be improved, it can never produce perfect results. Computers cannot read minds, and mindreading is ultimately what it would take to produce a flawless romanization.1

Furthermore, imagine the representation of the words of the text that presumably takes shape in some form or other in the mind of the skilled reader of the original text. Given that Google’s programmers are doing their best to get their computers to identify words and their forms from Japanese textual data, it is clear that readers, who achieve excellent comprehension with little or no conscious effort, must be doing vastly more. The sequence of stages — from (1) the original text to (2) the Google transcription, (3) the better edited version, (4) some future “ideal” romanization scheme, and onward to (5) whatever the brain of the skilled reader ultimately distills and comprehends — concretely illustrates how, at each stage, different kinds of information — from the easily programmable to genuine expert knowledge — must be brought to bear on the raw data.

Of course, something similar can be said of English texts as well: like Chinese characters, orthographic words of English, even though written with letters of the roman alphabet, typically function both logographically and phonographically. The English reader has to do some work too. But how much? Think of the sequence of stages just described in reverse order. The step from the mind of an expert reader (5) to an ideal romanization (4) is short compared with the distance down to the crude level of romanization produced by Google Translate (2). Yet Google does quite a bit relative to the original text (1). It does not totally fail, but rather makes mistakes, which, as just demonstrated, a human editor can identify and correct. It manages to find many word boundaries and no doubt could do better if the company’s programmers consulted some linguists and exerted themselves more. The point is that Japanese readers must cover the whole distance from the text to genuine comprehension, a distance that must be much greater than that traversed by the practiced reader of English, for all its quaint anachronistic spellings. With a decent, standardized roman orthography, the Japanese reader would have a considerably shorter distance to negotiate.

Note

  1. Indeed, starting in the 1980s, Asahi pioneered in the use of an IBM-designed system called NELSON (New Editing and Layout System of Newspapers) that uses large-array keyboards (descriptive input) rather than the sort of kanji henkan methods (transcriptive input) common on personal computers and dedicated word-processing systems. Consequently, the expedient of storing the underlying roman or kana input stream alongside the selected characters is not available for Asahi stories. Of course, such information is routinely thrown away by many other input systems too.

Journal issue focuses on romanization

cover of this issue of the Journal of the Royal Asiatic SocietyThe most recent issue of the Journal of the Royal Asiatic Society of Great Britain and Ireland (third series, volume 20, part 1, January 2010) features the following articles on romanization movements and script reforms.

  • Editorial Introduction: Romanisation in Comparative Perspective, by ?lker Aytürk
  • The Literati and the Letters: A Few Words on the Turkish Alphabet Reform, by Laurent Mignon
  • Alphabet Reform in the Six Independent ex-Soviet Muslim Republics, by Jacob M. Landau
  • Politics of Romanisation in Azerbaijan (1921–1992), by Ayça Ergun
  • Romanisation in Uzbekistan Past and Present, by Mehmet Uzman
  • Romanisation of Bengali and Other Indian Scripts, by Dennis Kurzon
  • The R?maji movement in Japan, by Nanette Gottlieb
  • Postscript from the JRAS Editor, Sarah Ansari

Unfortunately, none of these cover any Sinitic languages or the case of Vietnam. And Gottlieb’s take on r?maji is certainly more conservative than Unger’s. But I expect this will all make for interesting reading.

I am able to view all of the articles on my system. But perhaps others will run up against a subscription wall.

I thank Victor H. Mair for drawing this publication to my attention.

kanji scandal

The Kyoto-based Japan Kanji Aptitude Testing Foundation — the group behind the Kanji of the Year announcement and which runs Japan’s well-attended kanji aptitude tests — is registered as a public-interest corporation, which means that it is not supposed to generate profits greater than it needs to operate (much like a non-profit organization in the United States). On March 10, however, Japan’s Ministry of Education stepped in, saying that the foundation was making too much money and needed to overhaul its operations.

How much money are we talking about?

The foundation racked up profits of ¥880 million [US$8.8 million] in fiscal 2006 and ¥660 million in fiscal 2007. The value of its assets increased from ¥5 billion at the end of fiscal 2004 to ¥7.35 billion at the end of fiscal 2007. It would not be far-fetched to say that the foundation has created a kanji business. Kanken became a registered trademark. In fiscal 2007 alone, the foundation sold some 1.5 million copies of books. It is also providing kanji-related questions to TV shows.

But there are more problems than just how much of the money the foundation makes. It has been funneling money into companies controlled by the foundation’s director and his son, the deputy director. “In fiscal 2007, commissions to these companies amounted to 2.48 billion yen [US$24.9 million], accounting for about 40 percent of the foundation’s annual expenditures,” the Asahi Shimbun reported.

Moreover, it appears the companies did little work for the large amount of money they received.

The Ministry of Education has warned the foundation before, with not much in the way of results. The foundation is to report back to the ministry by April 15. Given how entrenched the foundation is within Japan, I don’t expect much to change.

sources:

early Chinese tattoos

As my friend Tian of Hanzi Smatter continues to document, some people, Westerners especially, remain keen on having themselves tattooed with Chinese characters — even if they can’t read them. I doubt, though, that many are aware of China’s historical traditions in tattooing. As Carrie E. Reed notes in Early Chinese Tattoo (2.9 MB PDF), which is the latest reissue from Sino-Platonic Papers, “it appears that the practice of tattoo (other than the penal use) never achieved any level of general acceptance or widespread use among most parts of ancient Chinese society of any era.”

Yes, penal use: In early China tattooing was a common way of branding criminals. Often such tattoos were standard designs, such as circles. But sometimes they contained text.

Here’s something from Reed’s discussion of the Yuan dynasty’s legal code:

In the section on illicit sexual relationships we read that, in general, on the first offense the adulterous couple will be separated, but if they are “caught in the act” a second time, the man (it is not clear if the woman is tattooed as well) will be tattooed on the face with the words “committed licentious acts two times” (犯姦二度) and banished. Numerous examples are given to illustrate this type of punishment.

Reed examines and translates many texts describing tattoos.

Some of the terms encountered in these early texts are (with a literal translation given in parentheses) qing 黥 (to brand, tattoo), mo 墨 (to ink), ci qing 刺青 (to pierce [and make] blue-green), wen shen 文身 (to pattern the body), diao qing 雕青 (to carve and [make] blue-green), ju yan 沮顏 (to injure the countenence), wen mian 文面 (to pattern the face), li mian 剺面 (to cut the face) , hua mian 畫面 (to mark the face), lou shen 鏤身 (to engrave the body), lou ti 鏤體 (same), xiu mian 繡面 (to embroider [or ornament] the face), ke nie 刻涅 (to cut [and] blacken), nie zi 涅字 (to blacken characters) ci zi 刺字 (to pierce characters), and so on. These terms are sometimes used together, and there are numerous further variations. In general, if the tattooing of characters (字) appears in the term, it refers to punishment, but this is certainly not true in every case. Likewise, if a term literally meaning “to ornament” or “decorate” is used, it does not necessarily mean that the tattoo was done voluntarily or for decorative purposes.

All of the types of tattoo, except perhaps for the figurative and textual, are usually described as inherently opprobrious; people bearing them are stigmatized as impure, defiled, shameful or uncivilized. There does not ever seem to have been a widespread acceptance of tattoo of any type by the “mainstream” society; this was inevitable, partly due to the early and long-lasting association of body marking with peoples perceived as barbaric, or with punishment and the inevitable subsequent ostracism from the society of law-abiding people. Another reason, of course, is the Confucian belief that the body of a filial person is meant to be maintained as it was given to one by one’s parents.

This was first published in June 2000 as issue no. 103 of Sino-Platonic Papers. Although the work contains no illustrations, it does feature copious translations of texts describing tattoos or relating tales about them.

Compensation for kanji-input basic technology subject of lawsuit

A Japanese man who says he invented the technology behind the context-based conversion of a sentence written solely in kana into one in both kanji and kana, as well as another related technology, filed suit against Toshiba on December 7, seeking some US$2.3 million in compensation from his former employer.

Shinya Amano, a professor at Shonan Institute of Technology, said in a written complaint that although the firm received patents for the technologies in conjunction with him and three others and paid him tens of thousands of yen annually in remuneration, he actually developed the technologies alone.

Amano is claiming 10 percent of an estimated ¥2.6 billion in profit Toshiba made in 1996 and 1997 — much higher than the roughly ¥230,000 he was actually awarded for the work over the two-year span.

His claim is believed valid, taking into account the statute of limitations and the terms of the patents.

“This is not about the sum of the money — I filed the suit for my honor,” Amano said in a press conference after bringing the case to the Tokyo District Court.

“Japan is a technology-oriented country, but engineers are treated too lightly here,” he said.

Toshiba said through its public relations office that it believes it paid Amano fair compensation in line with company policy. The company declined to comment on the lawsuit before receiving the complaint in writing.

Amano claims that he invented the technology that converts a sentence composed of kana alone into a sentence composed of both kanji and kana by assessing its context, and another technology needed to prioritize kanji previously used in such conversions.

Using theories of artificial intelligence, the two technologies developed in 1977 and 1978 are still used today in most Japanese word-processing software, he said.

source: Word-processor inventor sues Toshiba over redress, Kyodo News, via Japan Times, December 9, 2007

stroke counts: Taiwan vs. China

One of the myths about Chinese characters is that for each character there is One True Way and One True Way Only for it to be written, with a specific number of specific strokes in a certain specific and invariable order. Generally speaking, characters are indeed taught with standard stroke orders with certain numbers of strokes (the patterns help make it less difficult to remember how characters are written) — but these can vary from place to place, though the characters may look the same. Moreover, people often write characters in their own fashion, though they may not always be aware of this.

Michael Kaplan of Microsoft recently examined the stroke data from standards bodies in China for all 70,195 “ideographs” [sic] in Unicode 5.0 and compared it against “the 54,195 ideographs for which stroke count data was provided by Taiwan standards bodies” to see how how much of a difference there was in the stroke counts for the characters that both sides provided data for.

(I’m a bit surprised the two sides have compiled such extensive lists, and I’d love to see them. But that’s another matter.)

He found that 9,768 of these characters (18 percent) have different stroke counts between the two standards, with 9,045 characters differing by 1 stroke, 675 characters by 2 strokes, 44 characters by 3 strokes, 2 characters by 4 strokes, 1 character by 5 strokes, and 1 character by 6 strokes.

Note: This is about stroke counts of matching characters, not about differing stroke counts for traditional and “simplified” characters — e.g., not ? (11 strokes) vs ? (8 strokes).

So, is this a case of chabuduoism, or of truly differing standards? The answer is not yet fully clear; but be sure to read Kaplan’s post and the comments there.

sources and additional info: