Pinyin-to-Chinese Character Computer Conversion Systems and the Realization of Digraphia in China
by YIN Binyong
Institute of Applied Linguistics, Chinese Academy of Social
Sciences
The concept of digraphia in China, that is the “peaceful co-existence” of two parallel writing systems, Chinese characters and Hanyu Pinyin, has been in existence for a long time; Professor John DeFrancis suggested this term in 1984, but so far this has remained only a proposal. Questions such as: Is this idea of “digraphia” in China realizable? What steps would be necessary in order to put this idea into practice? etc. still require scientific evidence and practical testing. I believe that the successful development of Hanyu Pinyin-to-Chinese character computer conversion systems (“P-H systems”) has in fact answered almost all of major questions concerning digraphia in China. These P-H computer conversion systems not only prove the feasibility of digraphia in China, but they also concretely indicate the basic outlines for Hanyu Pinyin, answering the ceaseless arguments concerning various technical questions about Hanyu Pinyin. Thus it may be said that Hanyu Pinyin-to-Chinese character computer conversion systems have made some important contributions both in theory and in practice towards the realization of digraphia in China. In this paper, I will discuss this topic under three headings: the History of the Development of Pinyin-to-Chinese character conversion systems, plus their theoretical contribution and their practical contributions to the realization of digraphia in China.
I. The Three Stages in the Development of Pinyin-to-Chinese Character Computer Conversion Systems
Experiments with Pinyin-to-Chinese character computer conversion systems began at the beginning of the 1980’s. It is now clear that this development must go through three stages. These three stages are identical with the principles for the development of digraphic systems. The first stage may be called the “syllable-based conversion” stage, in which each Pinyin syllable which is input is converted to a Chinese character output. For example, if the syllable “feng” is input, we get the characters 風, 蜂, 峰, 瘋, 逢, 缝 etc. Obviously this approach was influenced by the theory that “Chinese is a monosyllabic language”; there are still many people who subscribe to that theory. The biggest problem for such syllable-based conversion systems is of course that there are too many homophonic characters. There have been two types of input systems designed to solve this problem. One way is to add a character-distinguishing symbol after each Hanyu Pinyin syllable as its input, so that this method is effectively just a way of spelling Chinese characters. The other method is to have all the homophonous characters appear together on the screen and let the user choose the one desired. The first method was abandoned quite early, primarily because remembering all the distinguishing symbols for so many different characters was too great a load on users’ memories, and the second was generally adopted. However, the second method of conversion also has many shortcomings. Perhaps the biggest shortcoming is that the rate of output is so slow. Therefore the syllable-based conversion systems were quickly replaced by the “whole word conversion” system, and at present the syllable-based conversion system is only used as a supplemental system.
The second stage may be called “whole words and phrase-based conversion systems”, in which entire spoken words or set phrases are entered as input and Chinese characters appear as output. For example, if we input “fengguang”, this system will convert it into the Chinese characters 風光; if we input the phrase “fengherinuan”, meaning “warm and sunny weather”, the Chinese character output will be 風和日暖 , etc. Obviously, this approach does not follow the outdated myth of Chinese as a monosyllabic language, but is rather based on the realities of the present-day spoken language, and thus it has achieved some significant results. Nowadays, both in China and abroad, most Pinyin-to-Chinese character conversion systems are based on this principle. By taking whole words and phrases as the basic input units, the rate of confusion of homophonic characters has been greatly reduced. Experiments show that the accuracy rate of this method without any indication of tones is more than ninety percent (that is, the number of incorrect homophonous characters is less than ten percent). The accuracy rate with partial tone marking included is more than ninety-five percent (that is, less than five percent of incorrect homophonous characters). This type of whole word conversion input system was already shown its practical value in actual use. In Beijing, some writers have begun to use such systems to write novels, and scientific and technical workers are using it to translate long scientific works with quite satisfactory results.
The third stage is “sentence and paragraph-based conversion”, also called “section and chapter-based conversion”, in which whole sentences, whole paragraphs, or entire chapters of articles are input in Hanyu Pinyin (using words and phrases as the basic input units) in order to obtain Chinese character output. Using sentence and paragraph-based conversion resolves the problem of homophonous words in isolation, increasing the accuracy of output and reducing the rate of incorrect characters. For example, the word “ba” (without tones) has as its two most commonly used homophones the words expressed by the characters 把 and 吧 . In terms of discourse analysis, the coverb 把 can never appear at the end of a sentence, while the particle 吧 only appears at the end of sentences or clauses. For example,
Qing ni ba zazhi song gei wo ba. 請你把雜志誦給我 吧
Therefore these two commonly used words can share the same simple Hanyu Pinyin input form and need not be further distinguished; the computer can automatically choose the correct Chinese character according to the discourse context. This is an application of artificial intelligence, and sentence and paragraph-based conversion systems are also more useful in machine translation between two languages than mere word and phrase-based conversion systems. The study of sentence and paragraph-based conversion systems is just beginning; it is the highest level of Pinyin-to-Chinese character conversion and is a logical extension of it. According to recent research, uniting syllable conversion, word- and phrase-based conversion, and sentence- and paragraph-based conversion systems can result in an accuracy rate of more than ninety-seven percent, with less than three percent of wrong characters derived from this Hanyu Pinyin input.
II. The Theoretical Contribution of the Pinyin-to-Chinese Character Conversion System to the Realization of Digraphia in China
The significance of the present and future development of Pinyin-to-Chinese character conversion systems lies not only in the opening up of an entirely new territory in the area of Chinese information processing, but more importantly in its great contribution to the realization of digraphia in China, both theoretically and practically.
The primary contribution of this type of conversion system is that it has finally provided scientific proof of the feasibility of digraphia in China. For the past one hundred years, people have been endlessly arguing about the “feasibility” of alphabetizing Chinese. The number of materials documenting the history of that debate could be edited into a very thick book, but none of them provided any scientific proof, so it was impossible to judge which arguments were right and which were wrong. The development of the Pinyin-to-Chinese character conversion system has finally provided a definite answer to this debate.
Let us assume that for the writing of contemporary Mandarin Chinese, the system of Chinese characters has an accuracy of one hundred percent. In fact, if we investigate the matter carefully, it can easily be shown that even when we use Chinese characters to write contemporary Mandarin Chinese, the rate of accuracy is not always equal to one hundred percent, because there are many expressions in the spoken language whose written form is uncertain. But it is popularly believed that Chinese characters are one hundred percent accurate, and as there is no other widely accepted medium more accurate for conveying contemporary Mandarin Chinese. For the sake of argument, let us take Chinese characters as our reference point and define the information conveying accuracy of Chinese characters as 100%.
What is the accuracy for conveying information of using Hanyu Pinyin to write contemporary Chinese? As noted above, using the third level of Pinyin-to-Chinese character conversion, the so-called “sentence and paragraph-based conversion”, one article written in Pinyin converted mechanically by computer with the addition of some basic artificial intelligence programming and some additional technical work, can reach an average accuracy of ninety-seven percent (taking Chinese characters as our reference point). Note that computer conversion is only a type of mechanical conversion; the ability of the human mind to convert is of course always much greater than that of a computer; any text which a computer can convert can always be converted by the human mind, but also texts which cannot be converted by computer can still be converted accurately by the human mind as well. For example, the pinyin configuration renmin can accurately be converted to the Chinese characters 人民 both by the computer and by the human mind, but the configuration zhidao yuanze will not necessarily be converted by the computer into 指導原則 (guiding principles), because the computer may mistake zhidao for the more commonly occurring 知道 (know), while the human mind could convert more accurately from experience. Because of such cases, the accuracy of conversion (taking Chinese characters as our reference point) for computers is at present only ninety-seven percent, while that of the human mind should surpass that figure. Of course, the accuracy of the human mind can also be improved by constant study, in the same way as the accuracy of computer conversion can be increased by the improvement of the design of software programs. Any writing system, if it is used as a tool of expression (not as a work of art, etc.) and if it can accurately convey the data of the language, may be said to be feasible. If we say that the degree of accuracy of Pinyin-to-Chinese character conversion is A (where A is a relative variable), and the degree of feasibility of alphabetized (Pinyin) Chinese is P, then according to the above analysis, we can arrive at the following formula:
A ≤ P < 1 (1)
A is a relative variable. At present, the highest value for A is 0.97, but with the constant improvement of Pinyin-to-Chinese character conversion system, it will reach 0.98, 0.99, 0.995 ... gradually approaching 1. P, then, is a semi-closed interval, which will gradually get smaller as A increases.
This formula helps us to understand two important questions:
- The feasibility of a digraphic system for China (that is, of alphabetized Chinese) is extremely great. As we have seen, at present, the feasibility is equal to or greater than 0.97. If the degree of feasibility can be raised to or above 0.99, or 0.995, the realization of digraphia in China will no longer face any great problems from a theoretical point of view. The above formula shows that the feasibility is gradually approaching complete feasibility. Of course, this type of digraphia at the beginning will be limited to computer use only, and will not be a system commonly used throughout Chinese society. But with the gradual spread of personal computers into every home and office, this computer digraphia will spread to society as a whole. Just as the invention of the technology of printing greatly facilitated the spread of the use of the alphabetized writing systems in the West, so the development and spread of digraphia in China will be facilitated by computer technology. Gradually people will begin to realize that two writing tools are better suited to the achievement of China’s modernization than just one, and will not find using Pinyin to be an extra burden.
- Alphabetized Chinese cannot completely replace Chinese characters. Note that the value of P in the above formula will always be less than 1.00, that is, there will always be places where Hanyu Pinyin cannot replace Chinese characters, so therefore Chinese characters must continue to exist for a long time and cannot be abandoned. There are many reasons why Chinese characters will continue to be used for a long period and the issue is very complex, so I will not discuss it further here, but the above formula gives a concrete mathematical expression to this second conclusion.
Another theoretical contribution of the development of the Pinyin-to-Chinese conversion system is that it reveals the objective principle according to which the development of digraphia in China must evolve simultaneously with the standardization of the modern Chinese spoken language.
Ever since the beginnings of the May Fourth movement, many scholars -- especially those who support the use of alphabetized writing for Chinese -- have all advocated as the main goal of the modern Chinese language standardization movement that spoken and written Chinese should be the same. Unfortunately, this goal has remained primarily a subjective aspiration; as long as Chinese characters continue to be the sole writing system in China, this goal can never be realized. Despite the fact that literary Chinese is no longer used, nevertheless it has been replaced by a half-literary, half-vernacular style of writing, rather than a style based solely on the spoken language.
For a long time, there has not been any efficient way to change this half-literary style of writing. Almost all of those who have promoted alphabetization for Chinese have predicted that using some sort of phonetically-based writing system such as Hanyu Pinyin is the only way to “make the spoken and written language the same” in China, and the only way to achieve the standardization of the modern Chinese language. However, this idea has remained only an hypothesis. With the development of the Pinyin-to-Chinese character conversion systems, however, the objective principle that the standardization of modern Chinese must evolve simultaneously with digraphia has become clear. One of the most important tasks for the perfection of Pinyin-to-Chinese character conversion systems is the establishment of an exhaustive, authoritative data base of commonly used vocabulary. This data base may not be perfect at the outset, but it will gradually become perfected along with the development of a digraphic writing situation in China. This process of perfecting the vocabulary data base will incorporate the series of stages in the standardization of the contemporary Chinese language; the standardization of contemporary Chinese will be “fixed” as its word-bank data base is perfected, rather than remaining a mere aspiration. We may say that using Pinyin-to-Chinese character conversion systems has a kind of internal “force” which “compels” people gradually to abandon the above-mentioned “half-literary, half-vernacular” style of writing and instead go in the direction of “uniting the spoken and written language” styles. When using Hanyu Pinyin as input for computers, people will gradually come to find it inconvenient to type in such semi-literary phrasings as “wo yi di jing” (我已抵京), which are difficult for the computer to disambiguate, and will soon become more comfortable typing as they speak: “wo yijing daoda Beijing” (我已經到達北京). Thus we see that Pinyin-to-Chinese conversion systems have not only revealed the close theoretical relationship between the development of digraphia and the standardization of the modern Chinese language, but also that the development of this type of conversion system may be also facilitate the standardization of the contemporary Chinese language at the same time that the digraphic situation is evolving. (See YIN Binyong, “Pinyin Diannao Poshi Renmen Zhuanbian Wenfeng”, Xin Tang Di-9 qi, 1988)
III. Practical Contributions of Pinyin-to-Chinese Character Conversion Systems to Digraphia in China.
In order to realize the possibility of digraphia in China, it is necessary to construct and perfect an adequate alphabetized (pinyin) Chinese writing system. In the previous section we have already noted the theoretical contributions of Pinyin-to-Chinese character conversion systems towards the realization of digraphia in China by both scientifically demonstrating its feasibility and also by showing its interdependent relation to the problem of language standardization in China. In addition, however, there are also many practical and technical problems concerning the construction of an alphabetized writing system for Chinese which have also been endlessly debated without any real resolution. One major reason for this lack of resolution is a lack of broad practical experience in this area. The successful development of Pinyin-to-Chinese character computer conversion systems now provides such practical experience and thus points the way towards some concrete answers to many of these practical problems. In this section I shall discuss some of these problems and the solutions which these systems suggest.
1. Can alphabetized Chinese take the road of “pinyin pictophonetic characters”?
This question has been debated for decades. Schemes for Pinyin pictophonetic characters constitute about two-thirds of the thousands of plans for alphabetized Chinese which the State Language Commission has received. The designers of these schemes basically take standard pictophonetic Chinese characters (consisting of a phonetic component differentiated by a distinguishing semantic “radical”) as their model for alphabetized (Pinyin) Chinese, hoping in this way to make Chinese characters and individual pinyin syllables correspond to each other on a one-to-one basis. For example, the syllable “feng” in Hanyu Pinyin may represent any of the morphemes which are represented by the Chinese characters: 風, 封, 豐, 峰, 烽, 蜂, 楓, 瘋, 逢, 縫 ... etc. A Pinyin pictophonetic writing system correspondingly adds semantically significant “silent” letters to the basic phonological representation “feng” in order to differentiate these different morphemes on a one-to-one basis, with Chinese characters, as follows:
feng | 風 | wind | |
fengd | 封 | to seal | d represents dongci (verb) |
fengx | 豐 | abundant | x represents xingrongci (adjective) |
fengs | 峰 | summit | s represents shan (mountain) |
fengh | 烽 | beacon | h represents huo (fire) |
fengc | 蜂 | bee | c represents chong (insect) |
fengm | 楓 | maple | m represents mu (wood) |
fengb | 瘋 | insane | b represents bing (sickness) |
fengz | 逢 | to meet | z represents zou (going) |
fengss | 縫 | to sew | ss represents si (silk) |
... etc. |
The designers of these schemes are all quite proud of their systems, and each designer feels that his plan is the best system to resolve the problem of homophony in Chinese characters and morphemes, and that it should be widely promulgated and used. During the first stage of the development of conversion systems (the “syllable-based stage”), many such schemes for Pinyin pictophonetic conversion systems appeared, often referred to as yinxingma (音形碼 ; sound-form codes), each one hoping to head the list. Unfortunately, most of the designers, and more importantly, most of the users of such software, quickly began to abandon these meticulously worked-out but cumbersome systems, preferring instead the method of having all of the homophonous characters appear together on the screen and mechanically choosing the character desired, in spite of the inefficiency of that method. By the time of the development of the second stage of Pinyin-to-character conversion systems, users discovered that the accuracy of converting Pinyin-to-Chinese characters by using whole words and input units could achieve from ninety-five to ninety-seven percent accuracy. After that, no one wanted to go to all the trouble of memorizing and adding extra silent letters after every syllable they typed, just to deal with five to ten percent of homophonous cases. The few homophonous characters can easily be dealt with by a back-up display and choose method. The development of whole word and phrase-based Pinyin-to-Chinese character conversion systems which are based on the linguistic fact that the majority of ambiguous symbols can be easily disambiguated by their linguistic context, thus clearly exposed the shortcomings of schemes for Pinyin pictophonetic writing systems.
2. What is an appropriate way to handle the representation of tones in a Pinyin-based writing system?
The development of Pinyin-to-Chinese character conversion systems have clearly shown in practice that a Pinyin-based writing system should be used to express entire words in the spoken language rather than individual Chinese characters. The next most important problem is to find an appropriate way to handle the representation of tones in a Pinyin-based writing system.
This is another endlessly debated topic. From the point of view of language teaching, if all of the words in a Pinyin text include tone markings (as Hanyu Pinyin texts do at present), there are some advantages. However, from the practical point of using a Pinyin-based writing system as an everyday tool, it seems that tone marks need be added only to a small percentage of the words, Although the debate over “full tone marking” versus “partial tone marking” has raged for years, no principled answer has resulted until now.
The practical development of Pinyin-to-Chinese character conversion systems has now led us to a reasonable answer to this long-debated question. It now appears that all of the Pinyin-to-Chinese character conversion systems developed throughout the world all share the feature that they do not require full marking of tones. This fact is not simply a coincidence, but rather the practical result of the fact that in actual practice if tone marks are added only when necessary, an accuracy rate of more than ninety-five percent can be obtained in conversion. Furthermore, if we add tone markings to all of the Pinyin words, the accuracy of conversion does not have any significant improvement over that ninety-five percent. In other words, the remaining less than five percent of homophonous words are not disambiguated in conversion by adding tone markings in most cases. Based on these facts then, both from the point of view of information processing, as well as from the point of view of using alphabetized (pinyin) Chinese as a practical tool of communication, full marking of tones is inefficient and is not worth the extra effort.
If we adopt the position of partial marking of tones, two questions must be answered, which also will serve as a reply to advocates of the full tone marking position.
The first question is whether indicating tone markings only where necessary rather than on every syllable influences the reader’s accurate pronunciation or understanding of the text. The facts of experience are that if the reader already knows the spoken language, that is to say, if the reader already has an oral mastery of standard Mandarin Chinese, then when he sees the Pinyin forms of the words, especially in context, even though there are no tone markings on the Pinyin words, he can still read and understand them correctly. For example, when one sees the Pinyin “renmin”, one can pronounce it correctly including the correct tones, even though they are not represented. On the other hand, if a reader has not mastered the spoken language (e.g., foreign students), at the beginning he will find some difficulties. These difficulties, however, should be resolved by improving the reader’s command of the standard spoken language, rather than by altering the writing system in general use of by native speakers of standard Mandarin. Textbooks and dictionaries for beginning students, of course, may have two parallel Pinyin texts, one with and one mostly without tone markings, until the student’s command of the spoken language reaches the level where tone markings are no longer necessary, as in the case of “renmin” just cited, and they can read essays written in Hanyu Pinyin Chinese mostly without tone markings.
The second question is: if we decide to mark only some of the tones when writing in Hanyu Pinyin, which syllables should be marked? Obviously, each individual writer cannot simply indicate the tones of whichever syllable he pleases arbitrarily. On the other hand, if we set up some principles or rules for tone marking, will this be an extra burden for people to learn? This is a crucial question for partial marking of tones in Pinyinized Chinese.
In general, our experience with Pinyin-to-Chinese character conversion systems tells us that first we must solve the problems of marking tones on monosyllabic words, and then the problems of marking tones on certain bisyllabic words. Polysyllabic words do not have to have tone markings.
Because of their lack of other context within the phonological word, the most important place to add tone markings is on monosyllabic words. For each group of monosyllabic words which share the same segmental Hanyu Pinyin spelling (without tone marking), it is most efficient if the word which occurs most frequently is left unmarked for tone, while the tones of all of the other monosyllabic words which share that spelling are marked for tone. Although, for reasons of efficiency, this principle seems to have been adopted by most Pinyin-to-Chinese character conversion systems to date, no general spelling rule for Hanyu Pinyin has yet been set up. Some examples are:
Pinyin | No Tones | Tone to be Indicated | ||
---|---|---|---|---|
wo | 我 (16,790) | 窩 (96) | 握 (91) | 臥 (30) |
kan | 看 (4,682) | 刊 (5) | 砍 (81) | 坎 (7) |
shang | 上 (10,602) | 傷 (132) | 賞 (12) | 尚 (42) |
The numbers which appear after each character refer to their relative numbers of occurrence as noted in the Xiandai Hanyu Pinyin Cidian (Beijing Language Institute Press, 1986). Based on these figures, only 1.3% of occurrences of Pinyin monosyllabic words spelled wo need to have their tones marked to distinguish them from the word wo which refers to first person singular (the unmarked case); only 2% of cases of Pinyin monosyllabic kan require tone markings to distinguish them from the most commonly occurring word meaning “to see”; only 1.8% of occurrences of shang require tone markings to differentiate them from the word meaning “on, above, mount”, and so on. Obviously, beginning students need only learn the unmarked forms which constitute the majority of cases as these are the easiest to remember.
The same method will be used in dealing with bisyllabic words. In any group of words which have the same segmental Pinyin spelling (that is, the same pronunciation except for tone), the tone markings will be omitted from the most common (unmarked) case, and only added to the remaining cases as distinguishers. For example:
zhidao | 知道 | (1,603) |
zhǐdǎo | 指導 | (189) |
zhídào | 直到 | (110) |
If there is one morpheme which is the same in a group of bisyllabic words which have the same segmental pronunciation and spelling except for tone, then that morpheme need not be marked for tone; for example:
zhongxin | 中心 | 193 |
zhōngxin | 忠心 | 6 |
zhòngxin | 重心 | 7 |
quanli | 權利 | 21 |
quanlì | 權力 | 16 |
It may be more difficult to remember bisyllabic words distinguished by tone markings, but fortunately there are very few such cases. When typing on a computer, if one is unsure one can always be helped by calling up the group of homonyms on the screen. Another point is that the method of differentiating near homonyms which differ only in tone just described is limited to those words which are included in the “List of Words in Common Use” (Tongyongci Biao). Words which fall outside this list can only be located by calling the complete list of Pinyin homographs up on the screen; such rare words will of course have to have full tone marking, and in some cases even the corresponding Chinese characters will have to be written down.
Using this approach, in any given text only around five percent of Pinyin words and expressions need be marked for tone. In this way the scope of those words which will require partial marking of tones is reduced to a fairly small number.
3. How to solve the problem of homonyms in alphabetized (Pinyin) Chinese writing?
The problem of homophones has always been regarded as the “fatal flaw” of any phonetic or alphabetized writing system for Chinese. But, as we have seen, the actual experience of Pinyin-to-Chinese character conversion has yielded some unexpected results! Using words and expressions as units of spelling/input, and adding tone markings to only a small number of cases, the accuracy of Pinyin-to-character conversion has reached ninety-five percent! Thus the problem of homophonous words has been reduced to five percent. In fact, this five percent error rate in conversion is a result of the computer’s mechanical conversion. If a Pinyin text is “converted” into Chinese characters by a human mind, the rate of error is definitely less than five percent, and in fact, most of the time can be reduced to zero. Thus our experience with Pinyin-to-Chinese character conversion systems has objectively proved that the problem of homophonous words for alphabetized Chinese is definitely not as serious as formerly believed; in fact, the problem is very small.
In terms of computerized Pinyin-to-Chinese character conversion systems, there are two ways to solve the problem of homophones. One way is to distinguish the forms of the homophones in spelling, as in quanli for 權利 (rights) vs. quanlì or quanlih for 權力 (power). The second way for the computer to solve the problem is to apply “artificial intelligence” to figure out the correct form from the context in which the word is being used. As we have seen, both of these approaches to solving the technical problem of computer conversion of Pinyin-to-Chinese characters have their analogs in the area of the human mind distinguishing homophonous forms when using Hanyu Pinyin to write Chinese. Applying the analog of the first approach, we must distinguish the forms of homophones in writing system so that readers (as well as computers!) can tell them apart. This will of course increase the number of forms for students to have to remember, so the number of such forms should ideally be kept to a minimum.
As noted above, the most important way to distinguish the forms of homophonous words is to use tone markings. However, for some cases, the addition of tone markings is not sufficient to distinguish between common homophonous monosyllabic words. Among the monosyllabic words, there are a small number of commonly occurring homophones which share the same tone, and thus cannot be adequately distinguished by tone marking alone, In order to deal with these cases, it is also necessary to establish some standardized “variant spellings” for some of these words. The number of words with such “variant spellings” would be small, perhaps only several dozen, but because they occur quite frequently, the effect of these special spellings will be very great. In fact, our Pinyin-to-Chinese character conversion system has already begun to use such special spelling forms and they have proved very effective in distinguishing these common homophonous words. Here are a few examples:
Basic Forms | Variant Spelling Forms | ||||
---|---|---|---|---|---|
bei | 北 | north | bey | 被 | (coverb) |
de | 得 | (verb suffix) | d | 的 | (possessive); di 地 (-ly) |
guo | 國 | country | -go | 過 | (experiential suffix) |
mai | 買 | buy | may | 賣 | “sell” |
mei | 沒 | not | moi | 每 | “each” |
men | 門 | door | -mn | 們 | (plural suffix) |
shi | 十 | ten | sh | 是 | “to be” |
ta | 他 | he | taa | 她 | “she”; to 它 “it” |
xiang | 想 | think | xang | 向 | “towards”; xanq 像 “like” |
yi | 以 | take | i | 一 | “one” |
you | 有 | have | yeu | 由 | “from”; iu 又 “again” |
zai | 在 | at | zay | 再 | “again” |
zhe | 這 | this | -zh | 著 | “-ing” |
zi | 字 | character | -z | 子 | (noun suffix) |
In the above variant spelling forms one may see the historical influence of other earlier romanization systems, such as G.R. (Gwoyeu Romatzyh) and the old New Latinized Alphabet (Ladinghua Sin Wenzi). However, in order to make the forms as short as possible, these older models were not imitated indiscriminately. Thus, for example, for the character 每 (meaning “each”), we use moi rather than the old G.R. form meei, and for the character 是 (“to be”) we use sh rather than shy, etc. The forms of such “variant spellings” should be decided on through careful discussion under the auspices of some type of authoritative conference or institution, and then formally promulgated for general use.
As for using artificial intelligence in computers to solve Pinyin-to-Chinese character conversion in a manner analogous to human intelligence, the future looks promising. Because this is a very complex question, however, it cannot be discussed here.
IV. Directions for the Future
As we have seen, the development of Pinyin-to-Chinese character computer conversion systems have a great impact, both theoretical and practical, on the potential for a digraphic writing situation in China. Therefore, the task of perfecting such a system is an urgent one. In this writer’s opinion, the following three tasks are the most urgent:
-
Standardization:
- At present both in China and abroad, many different parties have been and are developing diverse Pinyin-to-Chinese character conversion systems, all of them differing to various degrees. This situation is not only very inconvenient for users, it is also not beneficial either for the development of a unified system nor towards the development of digraphia in China. Because of this, standardization is an urgent task. For example, we need a standardized database of words to replace all existing dictionaries and vocabulary lists. A standard for ordering such lists of monosyllabic, bisyllabic, etc. words (according to degree of use) is also needed. Also the questions of tone markings, differentiation of homophones, variant spellings, and specialized terminology all need standardization work.
-
Accuracy:
- In order to realize “sentence and paragraph-based conversion” and “section and chapter-based conversion”, we must improve the accuracy of conversion. This task may be pursued along two tracks: the first is to invest more time on the forms of Pinyinized words. Although this may increase the memory load for learners, it will nevertheless increase the accuracy of conversion and is necessary for the realization of alphabetized Chinese; naturally the number of cases will be quite small. The second task is to utilize artificial intelligence to differentiate among homophonous words. Work in this area is just beginning, but its potential is extremely great. Pinyin-to-Chinese character conversion must be able to achieve consistently an accuracy of 99 percent before digraphia can be realized in China.
-
Cooperation:
- In order to realize these goals, it will be necessary for both applied linguists and computer specialists to cooperate. Applied linguists tend to look at these problems from the point of view of linguistics and writing systems, while computer specialists tend to think about them in terms of information processing. If these two groups do not begin to cooperate very soon, we will easily find ourselves in a situation were each has “gone their own way” and communication is no longer possible. Such a situation would not help the realization of the possibility of digraphia in China at all. In order to promote this type of cooperation, we need an organization invested with authority or a specialized conference to carry out overall planning. This is an extremely important and urgent matter which this writer hopes can be addressed as soon as possible.