Compensation for kanji-input basic technology subject of lawsuit

A Japanese man who says he invented the technology behind the context-based conversion of a sentence written solely in kana into one in both kanji and kana, as well as another related technology, filed suit against Toshiba on December 7, seeking some US$2.3 million in compensation from his former employer.

Shinya Amano, a professor at Shonan Institute of Technology, said in a written complaint that although the firm received patents for the technologies in conjunction with him and three others and paid him tens of thousands of yen annually in remuneration, he actually developed the technologies alone.

Amano is claiming 10 percent of an estimated ¥2.6 billion in profit Toshiba made in 1996 and 1997 — much higher than the roughly ¥230,000 he was actually awarded for the work over the two-year span.

His claim is believed valid, taking into account the statute of limitations and the terms of the patents.

“This is not about the sum of the money — I filed the suit for my honor,” Amano said in a press conference after bringing the case to the Tokyo District Court.

“Japan is a technology-oriented country, but engineers are treated too lightly here,” he said.

Toshiba said through its public relations office that it believes it paid Amano fair compensation in line with company policy. The company declined to comment on the lawsuit before receiving the complaint in writing.

Amano claims that he invented the technology that converts a sentence composed of kana alone into a sentence composed of both kanji and kana by assessing its context, and another technology needed to prioritize kanji previously used in such conversions.

Using theories of artificial intelligence, the two technologies developed in 1977 and 1978 are still used today in most Japanese word-processing software, he said.

source: Word-processor inventor sues Toshiba over redress, Kyodo News, via Japan Times, December 9, 2007

stroke counts: Taiwan vs. China

One of the myths about Chinese characters is that for each character there is One True Way and One True Way Only for it to be written, with a specific number of specific strokes in a certain specific and invariable order. Generally speaking, characters are indeed taught with standard stroke orders with certain numbers of strokes (the patterns help make it less difficult to remember how characters are written) — but these can vary from place to place, though the characters may look the same. Moreover, people often write characters in their own fashion, though they may not always be aware of this.

Michael Kaplan of Microsoft recently examined the stroke data from standards bodies in China for all 70,195 “ideographs” [sic] in Unicode 5.0 and compared it against “the 54,195 ideographs for which stroke count data was provided by Taiwan standards bodies” to see how how much of a difference there was in the stroke counts for the characters that both sides provided data for.

(I’m a bit surprised the two sides have compiled such extensive lists, and I’d love to see them. But that’s another matter.)

He found that 9,768 of these characters (18 percent) have different stroke counts between the two standards, with 9,045 characters differing by 1 stroke, 675 characters by 2 strokes, 44 characters by 3 strokes, 2 characters by 4 strokes, 1 character by 5 strokes, and 1 character by 6 strokes.

Note: This is about stroke counts of matching characters, not about differing stroke counts for traditional and “simplified” characters — e.g., not 國 (11 strokes) vs 国 (8 strokes).

So, is this a case of chabuduoism, or of truly differing standards? The answer is not yet fully clear; but be sure to read Kaplan’s post and the comments there.

sources and additional info:

variant Chinese characters and Unicode

A submission to the Unicode Consortium’s Ideographic [sic] Variation Database for the “Combined registration of the Adobe-Japan1 collection and of sequences in that collection” is available for review through November 25. This submission, PRI 108, is a revision of PRI 98.

This set “enumerates 23,058 glyphs” and contains 14,664 tetragraphs (Chinese characters / kanji). About three quarters of Unicode pertains to Chinese characters.

Two sets of charts are available: the complete one (4.4 MB PDF), which shows all the submitted sequences, and the partial one (776 KB PDF), which shows “only the characters for which multiple sequences are submitted.”

Below is a more or less random sample of some of the tetragraphs.

Initially I was going to combine this announcement with a rant against Unicode’s continued misuse of the term “ideographic.” But I’ve decided to save that for a separate post.

sample image of some of the kanji variants in the proposal

names of love hotels in macho kanji and other scripts

Donald Ritchie’s recent review of Japanese Love Hotels: A Cultural History, by Sarah Chaplin, has the following interesting section:

The contemporary love hotel is now much more kawaii (cute) than kinky.

Among the the reasons offered for this is that there has been something of a power shift in love-hotel choice. It used to be the male half that decided. Back then the places had hopeful macho monikers — Empire, Rex, King. Then the female half began to choose. Love hotels started calling themselves “fashion hotels” or “boutique hotels,” and began to have lavish lobbies with theme-shops, colors like beige and lavender, and decor like Laura Ashley.

This change can be documented in the Meguro Emperor (still in Meguro), which began in 1973 as a he-man fort before it slowly metamorphosed into a romantic Disneyland castle. The interior has been several times revised to segue from male- to female-friendly. Even the name has changed. It is now Gallery Hotel.

In most love hotels “macho” kanji has been replaced by “feminine” hiragana, trendy katakana or, more often, romaji, that romanized script that carries no male/female associations at all.

source: It’s ladies first now in Japanese love hotels, Japan Times, August 26, 2007

Japanese and attitudes toward kanji

Ken of What Japan Thinks has helpfully translated into English the results of a recent poll of 1,010 Japanese adults on their attitudes about kanji ability.

A total of 95 percent of those polled said they believe the kanji ability of elementary and middle school children is “undesirably low.” Of those giving this response, 56 percent associated the problem with a drop in school education levels.

A slight majority (52 percent) of all those polled reported a lack of confidence in their own kanji ability.

Here are the questions. For the responses, see the translation or the poll results in Japanese (『漢字力』などに関する調査, Goo Research, June 27, 2007):

  • Do you feel that elementary and middle school children’s kanji ability is sufficient?
    • It’s undesirably low
      • Why do you think that?
    • It’s not a problem
      • Why do you think that?
  • Do you have confidence in your own kanji ability?
    • Yes
    • No
      • Why don’t you have confidence in your own kanji ability?
  • What do you do when you cannot produce a kanji character?

    Japanese literacy–an SPP reissue

    Here’s another re-release from the archives of Sino-Platonic Papers: Computers and Japanese Literacy: Nihonzin no Yomikaki Nôryoku to Konpyûta, by J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures. The link above is to the PDF version (1.2 MB), which reproduces the original exactly.

    This is a parallel text in Japanese (in romanization) and English, so if any of you want to practice reading romaji, here’s your chance.

    The English text alone is available in HTML: Computers and Japanese Literacy.

    The essay touches on many of themes Unger explores in depth in his books, all of which have excerpts available here on Pinyin Info: The Fifth Generation Fallacy, Literacy and Script Reform in Occupation Japan, and Ideogram: Chinese Characters and the Myth of Disembodied Meaning.

    Here is the opening, in both English and Japanese (in romanization).

    Watakusi wa saikin, gendai no konpyûta siyô to Nihongo ni tuite kenkyu site orimasu. Gengogakusya mo konpyûta no nôryoku ya mondaiten ni tuite iken o happyo suru sekinin ga aru to omou kara desu. I am currently engaged in research on contemporary computer usage and the Japanese language. Linguists too, I believe, have a responsibility to present their views on the potentials and problems of computers.
    Sate, Amerika no zen- Kôsei Kyôiku tyôkan, John Gardner-si no kotoba de hazimetai to omoimasu. Sore wa “aizyô nasi no hihan to hihan nasi no aizyô (Eigo de iu to, “unloving criticism and uncritical love”) to iu kotoba desu. Gardner-si wa, Amerikazin no aikokusyugi ni tuite Amerika o sukosi de mo hihan site wa ikenai to syutyô suru hito wa kangaetigai da, aizyô nasi ni syakai ya bunka no ketten o hihan bakari suru koto wa motiron warui keredo, hihan sore zitai o kiratte kokusuisyugi o susumeru koto mo syôrai no tame ni yoku nai, to iimasita. Kono koto wa bokoku igai no syakai to bunka ni tai suru baai de mo onazi de wa nai desyô ka? Gengogakusya ya rekisigakusya mo “aizyô nasi no hihan to hihan nasi no aizyô” to iu ryôkyokutan o sakeru yô ni sita hô ga ii to omou no desu. Watakusi wa Nihon no gengo to bunka o senmon ni site, Nihon ni tai site aizyô o motte orimasu kara koso, Nihongo no hyôkihô ya Nihonzin no yomikaki nôryoku ni tuite no teisetu o mondai ni site iru wake desu. Iwayuru zyôhôka syakai no zidai ni hairi, ippan no hitobito ga pasokon ya wâpuro o kozin-yô ni tukau yô ni naru ni turete, nettowâku tûsin, kyôiku-yô sohutowea, sôzôteki na puroguramingu nado ga yôkyû sarete kite iru desyô. Mosi sono konpon ni aru yomikaki nôryoku no henka to genzyô o gokai sureba, gôriteki na konpyûta siyôhô o kaihatu dekinai darô to omou kara desu. Let me begin by quoting the former U.S. Secretary of Health, Education, and Welfare, John Gardner. I am thinking of his phrase “unloving criticism and uncritical love.” By this, he meant that it was wrong for proponents of American patriotism to oppose even the slightest criticism of the United States: although it is bad to dwell unsympathetically on finding fault with social and cultural shortcomings, it is equally bad for the future of society to advance nationalism and eschew all criticism. I think that this is also true when considering foreign societies and cultures. Linguists and historians would do well to avoid the twin extremes of “unloving criticism and uncritical love.” As someone professionally involved with the language and culture of Japan, I have an affection for the country, but for that very reason, I wish to call into question the accepted theory of Japanese script and literacy. As we enter the age of the so-called informational society, and as more and more ordinary people begin to use computers on an individual basis, demands on network communications, educational software, creative programming, and so on, will steadily increase. Unless we understand the present situation and history of literacy, which underlies all these applications, we cannot hope to develop a rational basis for computer usage.
    Sate, hyôi mozi to iu kotoba wa Nihongo ni tuite no hon ni yoku dete imasu kara kokugogaku no yôgo da to itte mo ii hodo desu ga, hyôi mozi to iu mono wa zissai ni sonzai site iru desyô ka? Kyakkanteki ni kangaete miru to, dono gengo mo konponteki ni wa hanasu mono desu. Mozi wa syakaiteki, rekisiteki na men ga arimasu ga, mozi wa kotoba no imi no moto de wa arimasen. Tatoeba, itizi mo yomenai mômoku no hito de mo, hoka no syôgai ga nai kagiri, bokokugo ga kanzen ni hanaseru yô ni narimasu. Sitagatte, hanasi-kotoba to wa mattaku kankei ga nai mozi nado to iu mono wa muimi na gainen desu. Gengo no imi wa gengo no kôzô kara hassei si, mozi wa sono han’ei de sika nai wake desu. Kore wa toku ni kore kara no konpyûta o kangaeru toki ni wasurete wa ikemasen…. The term “ideographic characters” appears so often in books on the Japanese language that one might say it has become a stock phrase of Japanese linguistics. I wonder, however, whether such things as “ideographs” actually exist. When examined objectively, all languages are fundamentally speech. Characters are not the source of the meanings of words, although they do have their social and historical aspects. For example, blind people who cannot read a single character can nonetheless speak their native tongues perfectly, unless they suffer from some other handicap. The very idea of characters totally divorced from speech is therefore meaningless. For the meaning of language emerges from the structure of language, of which writing is merely a reflection. It is particularly important that we not forget this when we consider the computers of the future….

    This was first published in January 1988 as issue no. 6 of Sino-Platonic Papers.

    indicator of character frequency: a suggestion for programmers

    It occurred to me the other day that many people, especially language learners, might find it useful to have a tool that would take text written in Chinese characters and mark it up according to the frequency of use of the individual characters within.

    Here’s a sentence from a recent CCP rant news item that can serve as an example:

    非驴非马的“网语”不再满足于偏安网络一隅,正迅速向着其它媒体渗透,因而加剧了报纸电视等文字语言的混乱,玷污了汉语言文化的纯洁。
    (Fēilǘfēimǎ de “wǎng yǔ” bùzài mǎnzú yú piān’ān wǎngluò yīyú, zhèng xùnsù xiàngzhe qítā méitǐ shèntòu, yīn’ér jiājù le bàozhǐ diànshì děng wénzì yǔyán de hùnluàn, diànwū le Hànyǔ yán wénhuà de chúnjié.)

    Predictably, many of the characters here are extremely common. Others, however, would not even be covered under China’s definition of literacy. I’ve separated these characters into different classes, based on their frequencies of usage and applied different colors to each class:

    • character frequency: 1-100 (class i-c)
    • character frequency: 101-500 (class c-d)
    • character frequency: 501-1000 (class d-m)
    • character frequency: 1001-1500 (class m-md)
    • character frequency: 1501-2000 (class md-mm)
    • character frequency: beyond 2000 (class mmplus)

    So the sample sentence would look like this:

    的“语”电视

    (Those of you reading this through RSS may need to visit the site to see what I’m talking about.)

    The coding I used looks like this, though other approaches are possible:

    <span class=”c-d” title=”101-500″>非</span><span class=”mmplus” title=”2001+”>驴</span>….

    I added titles to make this more accessible.

    Perhaps adding a summary would be useful:

    1-100              24.6%
    101-500           42.1%
    501-1000          8.8%
    1001-1500         14.0%
    1501-2000          1.8%
    2001+              8.8%

    This approach could also be used for Japanese — for example, to highlight all kanji not included in the Jōyō kanji, or to highlight different sets of the Kyōiku kanji. For that matter, it could also be applied to written words in English or other languages that use alphabets, though conjugutions, plurals, and the like would complicate matters.

    So, would anyone like to try coming up with one of these? Or has it been done already?

    one possible resource: