A submission to the Unicode Consortium’s Ideographic [sic] Variation Database for the “Combined registration of the Adobe-Japan1 collection and of sequences in that collection” is available for review through November 25. This submission, PRI 108, is a revision of PRI 98.
This set “enumerates 23,058 glyphs” and contains 14,664 tetragraphs (Chinese characters / kanji). About three quarters of Unicode pertains to Chinese characters.
Two sets of charts are available: the complete one (4.4 MB PDF), which shows all the submitted sequences, and the partial one (776 KB PDF), which shows “only the characters for which multiple sequences are submitted.”
Below is a more or less random sample of some of the tetragraphs.
Initially I was going to combine this announcement with a rant against Unicode’s continued misuse of the term “ideographic.” But I’ve decided to save that for a separate post.
I don’t understand the motivation for having multiple code points for what, it seems to me, are simply typographic variants of the same character. Shouldn’t these distinctions be manifested in the individual font renderings rather than separate encodings? Won’t separate code points make it more difficult to match the characters in a search string with characters in a searched text?
Is something we should be commenting on to the Unicode consortium? What are the pluses and minuses of this proposal?
I agree with you, Zev. Through the history the form of Chinese characters keeps changing, due to either typo or individual creativity. What are shown in the complete and partial sets above are most probably just two of possible variations. From my point of view it’s really no need trying to record the evolution history of a language.
To show how many variations a Chinese character can get, I’ve uploaded an image of character variations, hopefully the link works:
At least 5 or 6 variations can be seen for a single character ? (wine).
In my opinion, it would make more sense to include an entirely different set for different “fonts” (this term probably is not quite correct), for example, ??(formal script?),??(flow script?),??(running script?),??,?? … etc. The difference between the ‘complete’ and ‘partial’ sets mentioned in this post are all in the the ?? class. That is, it only represents very very small part of chinese character variation inside the same group.
I decompose Unicode variants e.g.,
$ perl -C -nwe ‘use Unicode::Normalize q(decompose); print decompose($_)’
These are crazy Adobe-Japan proposal to apply their minute typographic variant of their Kanji characters for registered variants of the base character using the variant selector characters to pseudo-encode them.
For details you may see here
I’ve been rethinking this question since my original posted comment and I do see a valid motivation for having separate code points in Unicode for even small variants of CJK characters. In the vast majority of situations, it makes sense to use the basic CJK codepoints and to manifest particular typographic forms by means of different fonts. This makes searching and comparison of texts easier.
But there also needs to be a way to TALK about character variants using Unicode-encoded text. In other words, there needs to be a way, in very specialized circumstances, to encode text like “the character ? has a variant form ?” in a way that is font-independent. This can only be done if the two variants have distinct code points within Unicode.
So I now think the proposal is sensible. But the variant forms should seldom be used.