variant Chinese characters and Unicode

A submission to the Unicode Consortium’s Ideographic [sic] Variation Database for the “Combined registration of the Adobe-Japan1 collection and of sequences in that collection” is available for review through November 25. This submission, PRI 108, is a revision of PRI 98.

This set “enumerates 23,058 glyphs” and contains 14,664 tetragraphs (Chinese characters / kanji). About three quarters of Unicode pertains to Chinese characters.

Two sets of charts are available: the complete one (4.4 MB PDF), which shows all the submitted sequences, and the partial one (776 KB PDF), which shows “only the characters for which multiple sequences are submitted.”

Below is a more or less random sample of some of the tetragraphs.

Initially I was going to combine this announcement with a rant against Unicode’s continued misuse of the term “ideographic.” But I’ve decided to save that for a separate post.

sample image of some of the kanji variants in the proposal

Indian influence on Chinese popular literature: a bibliography

Sino-Platonic Papers has rereleased for free another book-length back issue: A Partial Bibliography for the Study of Indian Influence on Chinese Popular Literature (10.8 MB PDF), by Victor H. Mair.

Here are the contents:

  • Journals and Works Referred to in Abbreviated Fashion
  • Catalogs of Tun-huang Manuscripts and Bibliographies of Studies on Them
  • Chinese Studies, Texts, Translations, and Dictionaries
  • Japanese and Korean Studies, Texts, Translations, and Dictionaries; Southeast Asian Sinitic Dictionaries
  • South and Southeast Asian and Buddhicized Central Asian Texts, Translations, and Dictionaries (Includes Indic, Tibetan, Uighur, Indonesian, etc.)
  • Near and Middle Eastern Texts, Translations, and Dictionaries
  • Studies and Texts in European Languages (Other than Translations from the Above Groups)
  • Films, Performances, Lectures, Unpublished Manuscripts, and Personal Communications
  • Articles and Books Not Seen

The introduction is also online in quick-loading HTML format.

This was first published in March 1987 as issue no. 3 of Sino-Platonic Papers.

names of love hotels in macho kanji and other scripts

Donald Ritchie’s recent review of Japanese Love Hotels: A Cultural History, by Sarah Chaplin, has the following interesting section:

The contemporary love hotel is now much more kawaii (cute) than kinky.

Among the the reasons offered for this is that there has been something of a power shift in love-hotel choice. It used to be the male half that decided. Back then the places had hopeful macho monikers — Empire, Rex, King. Then the female half began to choose. Love hotels started calling themselves “fashion hotels” or “boutique hotels,” and began to have lavish lobbies with theme-shops, colors like beige and lavender, and decor like Laura Ashley.

This change can be documented in the Meguro Emperor (still in Meguro), which began in 1973 as a he-man fort before it slowly metamorphosed into a romantic Disneyland castle. The interior has been several times revised to segue from male- to female-friendly. Even the name has changed. It is now Gallery Hotel.

In most love hotels “macho” kanji has been replaced by “feminine” hiragana, trendy katakana or, more often, romaji, that romanized script that carries no male/female associations at all.

source: It’s ladies first now in Japanese love hotels, Japan Times, August 26, 2007

Japanese and attitudes toward kanji

Ken of What Japan Thinks has helpfully translated into English the results of a recent poll of 1,010 Japanese adults on their attitudes about kanji ability.

A total of 95 percent of those polled said they believe the kanji ability of elementary and middle school children is “undesirably low.” Of those giving this response, 56 percent associated the problem with a drop in school education levels.

A slight majority (52 percent) of all those polled reported a lack of confidence in their own kanji ability.

Here are the questions. For the responses, see the translation or the poll results in Japanese (『漢字力』などに関する調査, Goo Research, June 27, 2007):

  • Do you feel that elementary and middle school children’s kanji ability is sufficient?
    • It’s undesirably low
      • Why do you think that?
    • It’s not a problem
      • Why do you think that?
  • Do you have confidence in your own kanji ability?
    • Yes
    • No
      • Why don’t you have confidence in your own kanji ability?
  • What do you do when you cannot produce a kanji character?

    Google’s new ‘cross-language information retrieval’

    Google has just launched a “cross-language information retrieval” (CLIR) function to Google Translate.

    Here is how Google describes it:

    Now, you can search for something in your own language (for example, English) and search the web in another language (for example, French). If you’re looking for wine tasting events in Bordeaux while on vacation in France, just type “wine tasting events in Bordeaux” into the search box on the “Search results” tab on Google Translate. You’ll then get French search results and a (machine) translation of these search results into English. Similarly, an Arabic speaker could look for restaurants in New York, by searching for “???? ???????”; or a Chinese speaker could look for documents on machine learning on the English web by looking for “????”.

    These are the languages available, though for now these are not available in all combinations but mainly to or from English. (German and French are the only languages listed that can work with each other rather than English.)

    • Arabic
    • English
    • French
    • German
    • Italian
    • Japanese
    • Korean
    • Mandarin (in traditional characters)
    • Mandarin (in simplified characters)
    • Portuguese
    • Russian
    • Spanish

    sources:

    Japanese literacy–an SPP reissue

    Here’s another re-release from the archives of Sino-Platonic Papers: Computers and Japanese Literacy: Nihonzin no Yomikaki Nôryoku to Konpyûta, by J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures. The link above is to the PDF version (1.2 MB), which reproduces the original exactly.

    This is a parallel text in Japanese (in romanization) and English, so if any of you want to practice reading romaji, here’s your chance.

    The English text alone is available in HTML: Computers and Japanese Literacy.

    The essay touches on many of themes Unger explores in depth in his books, all of which have excerpts available here on Pinyin Info: The Fifth Generation Fallacy, Literacy and Script Reform in Occupation Japan, and Ideogram: Chinese Characters and the Myth of Disembodied Meaning.

    Here is the opening, in both English and Japanese (in romanization).

    Watakusi wa saikin, gendai no konpyûta siyô to Nihongo ni tuite kenkyu site orimasu. Gengogakusya mo konpyûta no nôryoku ya mondaiten ni tuite iken o happyo suru sekinin ga aru to omou kara desu. I am currently engaged in research on contemporary computer usage and the Japanese language. Linguists too, I believe, have a responsibility to present their views on the potentials and problems of computers.
    Sate, Amerika no zen- Kôsei Kyôiku tyôkan, John Gardner-si no kotoba de hazimetai to omoimasu. Sore wa “aizyô nasi no hihan to hihan nasi no aizyô (Eigo de iu to, “unloving criticism and uncritical love”) to iu kotoba desu. Gardner-si wa, Amerikazin no aikokusyugi ni tuite Amerika o sukosi de mo hihan site wa ikenai to syutyô suru hito wa kangaetigai da, aizyô nasi ni syakai ya bunka no ketten o hihan bakari suru koto wa motiron warui keredo, hihan sore zitai o kiratte kokusuisyugi o susumeru koto mo syôrai no tame ni yoku nai, to iimasita. Kono koto wa bokoku igai no syakai to bunka ni tai suru baai de mo onazi de wa nai desyô ka? Gengogakusya ya rekisigakusya mo “aizyô nasi no hihan to hihan nasi no aizyô” to iu ryôkyokutan o sakeru yô ni sita hô ga ii to omou no desu. Watakusi wa Nihon no gengo to bunka o senmon ni site, Nihon ni tai site aizyô o motte orimasu kara koso, Nihongo no hyôkihô ya Nihonzin no yomikaki nôryoku ni tuite no teisetu o mondai ni site iru wake desu. Iwayuru zyôhôka syakai no zidai ni hairi, ippan no hitobito ga pasokon ya wâpuro o kozin-yô ni tukau yô ni naru ni turete, nettowâku tûsin, kyôiku-yô sohutowea, sôzôteki na puroguramingu nado ga yôkyû sarete kite iru desyô. Mosi sono konpon ni aru yomikaki nôryoku no henka to genzyô o gokai sureba, gôriteki na konpyûta siyôhô o kaihatu dekinai darô to omou kara desu. Let me begin by quoting the former U.S. Secretary of Health, Education, and Welfare, John Gardner. I am thinking of his phrase “unloving criticism and uncritical love.” By this, he meant that it was wrong for proponents of American patriotism to oppose even the slightest criticism of the United States: although it is bad to dwell unsympathetically on finding fault with social and cultural shortcomings, it is equally bad for the future of society to advance nationalism and eschew all criticism. I think that this is also true when considering foreign societies and cultures. Linguists and historians would do well to avoid the twin extremes of “unloving criticism and uncritical love.” As someone professionally involved with the language and culture of Japan, I have an affection for the country, but for that very reason, I wish to call into question the accepted theory of Japanese script and literacy. As we enter the age of the so-called informational society, and as more and more ordinary people begin to use computers on an individual basis, demands on network communications, educational software, creative programming, and so on, will steadily increase. Unless we understand the present situation and history of literacy, which underlies all these applications, we cannot hope to develop a rational basis for computer usage.
    Sate, hyôi mozi to iu kotoba wa Nihongo ni tuite no hon ni yoku dete imasu kara kokugogaku no yôgo da to itte mo ii hodo desu ga, hyôi mozi to iu mono wa zissai ni sonzai site iru desyô ka? Kyakkanteki ni kangaete miru to, dono gengo mo konponteki ni wa hanasu mono desu. Mozi wa syakaiteki, rekisiteki na men ga arimasu ga, mozi wa kotoba no imi no moto de wa arimasen. Tatoeba, itizi mo yomenai mômoku no hito de mo, hoka no syôgai ga nai kagiri, bokokugo ga kanzen ni hanaseru yô ni narimasu. Sitagatte, hanasi-kotoba to wa mattaku kankei ga nai mozi nado to iu mono wa muimi na gainen desu. Gengo no imi wa gengo no kôzô kara hassei si, mozi wa sono han’ei de sika nai wake desu. Kore wa toku ni kore kara no konpyûta o kangaeru toki ni wasurete wa ikemasen…. The term “ideographic characters” appears so often in books on the Japanese language that one might say it has become a stock phrase of Japanese linguistics. I wonder, however, whether such things as “ideographs” actually exist. When examined objectively, all languages are fundamentally speech. Characters are not the source of the meanings of words, although they do have their social and historical aspects. For example, blind people who cannot read a single character can nonetheless speak their native tongues perfectly, unless they suffer from some other handicap. The very idea of characters totally divorced from speech is therefore meaningless. For the meaning of language emerges from the structure of language, of which writing is merely a reflection. It is particularly important that we not forget this when we consider the computers of the future….

    This was first published in January 1988 as issue no. 6 of Sino-Platonic Papers.

    reviews of books related to China and linguistics

    Sino-Platonic Papers has just released online its first compilation of book reviews. Here is a list of the books discussed. (Note: The links below do not lead to the reviews but to other material.)

    Invited Reviews

    • J. Marshall Unger, The Fifth Generation Fallacy. Reviewed by Wm. C. Hannas
    • Rejoinder by J. Marshall Unger
    • Hashimoto Mantaro, Suzuki Takao, and Yamada Hisao. A Decision for the Chinese NationsToward the Future of Kanji (Kanji minzoku no ketsudanKanji no mirai ni mukete). Reviewed by Wm. C. Hannas
    • S. Robert Ramsey. The Languages of China. Reviewed by Wm. C. Hannas
    • James H. Cole, Shaohsing. Reviewed by Mark A. Allee
    • Henry Hung-Yeh Tiee, A Reference Grammar of Chinese Sentences. Reviewed by Jerome L. Packard

    Reviews by the Editor

    • David Pollack, The Fracture of Meaning
    • Jerry Norman, Chinese
    • N. H. Leon, Character Indexes of Modern Chinese
    • Shiu-ying Hu, comp., An Enumeration of Chinese Materia Medico
    • Donald M. Ayers, English Words from Latin and Greek Elements
    • Chen Gang, comp., A Dictionary of Peking Colloquialisms (Beijing Fangyan Cidian)
    • Dominic Cheung, ed. and tr., The Isle Full of Noises
    • Jonathan Chaves, ed. and tr., The Columbia Book of Later Chinese Poetry
    • Philip R. Bilancia, Dictionary of Chinese Law and Government
    • Charles O. Hucker, A Dictionary of Official Titles in Imperial China
    • Robert K. Logan, The Alphabet Effect
    • Liu Zhengtan, Gao Mingkai, et al., comp., A Dictionary of Loan Words and Hybrid Words in Chinese (Hanyu Wailai Cidian)
    • The Mandarin Daily Dictionary of Loan Words (Guoyu Ribao Wailaiyu Cidian)
    • Shao Xiantu, Zhou Dingguo, et al., comp., A Dictionary of the Origins of Foreign Place Names (Waiguo Diming Yuyuan Cidian)
    • Tsung-tung Chang, Metaphysik, Erkenntnis und Praktische Philosophie um Chuang-Tzu
    • Irene Bloom, trans, ed., and intro., Knowledge Painfully Acquired: The K’un-chih chi of Lo Ch’in-shun
    • Research Institute for Language Pedagogy of the Peking College of Languages, comp., Frequency Dictionary of Words in Modern Chinese (Xiandai Hanyu Pinlyu Cidian)
    • Liu Yuan, chief compiler, Word List of Modern Mandarin (Xianhi Hanyu Cibiao)
    • The Editing Group of A New English-Chinese Dictionary, comp., A New English-Chinese Dictionary
    • BBC External Business and Development Group, Everyday Mandarin

    This is SPP no. 8, from February 1988. The entire text is now online as a 4.2 MB PDF.

    indicator of character frequency: a suggestion for programmers

    It occurred to me the other day that many people, especially language learners, might find it useful to have a tool that would take text written in Chinese characters and mark it up according to the frequency of use of the individual characters within.

    Here’s a sentence from a recent CCP rant news item that can serve as an example:

    非驴非马的“网语”不再满足于偏安网络一隅,正迅速向着其它媒体渗透,因而加剧了报纸电视等文字语言的混乱,玷污了汉语言文化的纯洁。
    (Fēilǘfēimǎ de “wǎng yǔ” bùzài mǎnzú yú piān’ān wǎngluò yīyú, zhèng xùnsù xiàngzhe qítā méitǐ shèntòu, yīn’ér jiājù le bàozhǐ diànshì děng wénzì yǔyán de hùnluàn, diànwū le Hànyǔ yán wénhuà de chúnjié.)

    Predictably, many of the characters here are extremely common. Others, however, would not even be covered under China’s definition of literacy. I’ve separated these characters into different classes, based on their frequencies of usage and applied different colors to each class:

    • character frequency: 1-100 (class i-c)
    • character frequency: 101-500 (class c-d)
    • character frequency: 501-1000 (class d-m)
    • character frequency: 1001-1500 (class m-md)
    • character frequency: 1501-2000 (class md-mm)
    • character frequency: beyond 2000 (class mmplus)

    So the sample sentence would look like this:

    的“语”电视

    (Those of you reading this through RSS may need to visit the site to see what I’m talking about.)

    The coding I used looks like this, though other approaches are possible:

    <span class=”c-d” title=”101-500″>非</span><span class=”mmplus” title=”2001+”>驴</span>….

    I added titles to make this more accessible.

    Perhaps adding a summary would be useful:

    1-100              24.6%
    101-500           42.1%
    501-1000          8.8%
    1001-1500         14.0%
    1501-2000          1.8%
    2001+              8.8%

    This approach could also be used for Japanese — for example, to highlight all kanji not included in the Jōyō kanji, or to highlight different sets of the Kyōiku kanji. For that matter, it could also be applied to written words in English or other languages that use alphabets, though conjugutions, plurals, and the like would complicate matters.

    So, would anyone like to try coming up with one of these? Or has it been done already?

    one possible resource: