Google Translate and romaji revisited

OK, Google has improved its Pinyin converter some, though it still fails in important areas. So that’s the present situation for Google and Mandarin.

How about for Google and Japanese?

Professor J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures generously agreed to reexamine Google’s performance in conversions to r?maji (Japanese written in romanization).

Below is his latest evaluation.

For his initial analysis (in December 2009), see Google Translate and r?maji.

I ran the test passage through Google Translate again. There’s some improvement, but it’s still pretty mediocre.

Original Google Translate
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? 6-Nichi gogo 4-ji 35-fun-goro, T?ky?-to Chiyoda-ku K?kyogaien no tod? (uchibori-d?ri) no Nij?bashi zen k?saten de, Ch?goku kara no kank? kyaku no 40-dai no dansei ga j?y?sha ni hane rare, zenshin o tsuyoku Utte mamonaku shib? shita. Kuruma wa hod? ni noriagete aruite ita dansei (69) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no j?tai. Marunouchi-sho wa, unten shite ita T?ky?-to Minato-ku hakkin 3-ch?me, kaisha yakuin Takahashi nobe Tsubuse y?gi-sha (24) o jid?sha unten kashitsu sh?gai no utagai de genk?-han taiho shi, y?gi o d? chishi ni kirikaete shirabete iru.
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? D?sho ni yoru to, shib? shita dansei wa ?dan hod? o aruite watatte ita tokoro o chokushin shite kita kuruma ni hane rareta. Kuruma wa hidari ni ky? handoru o kiri, shad? to hod? no sakai ni oka reta kasetsu no saku o haneage, hod? ni noriageta toyuu. Saku wa hod? de ran’ningu o shite ita dansei (34) niatari, dansei wa ry?ashi ni karui kega.
???????????????????????????????????????????? D?sho wa, shib? shita dansei no mimoto kakunin o susumeru totomoni, t?ji no k?saten no shing? no j?ky? o shirabete iru.
????????????????????????????????????????? Genba sh?hen wa T?ky? kank? no supotto no hitotsudaga, saikin wa jogingu o tanoshimu hito mo fuete iru.

Notes:

  • The use of numerals dodges a plethora of errors, but “6-Nichi” is still wrong for Muika.
  • Lots of correct capitalizations have been added, but “uchibori” was missed and “Utte” capitalized by mistake.
  • Some false spaces or lack of spaces persist: “hane rare”, “oka reta”; “hitotsudaga” and “niatari” were correctly hitotsu da ga and ni atari in the original test.
  • Names still get butchered (“hakkin” for Shirogane, “nobe Tsubuse” for Nobuhiro.
  • The needless apostrophe in “ran’ningu” is still there.
  • Interestingly, “toyuu” is a new error: it should be to iu.
  • There’s evidence of some attempt to use hyphens, but why not in “kank? kyaku” or “Nij?bashi zen”?

So, to update: Google gets kudos for conscientiousness, but I stick by my original comments.

For more by Prof. Unger, see Pinyin.info’s recommended readings, which includes selections from The Fifth Generation Fallacy: Why Japan Is Betting Its Future on Artificial Intelligence, Literacy and Script Reform in Occupation Japan: Reading Between the Lines, and Ideogram: Chinese Characters and the Myth of Disembodied Meaning.

Google Translate and r?maji

The following is a guest post by Professor J. Marshall Unger of the Ohio State University’s Department of East Asian Languages and Literatures.

The challenge

On 18 November 2009, Mark Swofford posted an item on his website pinyin.info criticizing the way Google Translate produces Hanyu Pinyin from standard Chinese text. He concluded by saying, “Google Translate will also romanize Japanese texts written in kanji and kana, Russian texts written in Cyrillic, etc. But I’ll leave those to others to analyze.” So I decided to take up Swofford’s challenge as it pertains to Japanese. Using Google Translate, I romanized a news item from the Asahi of 6 December 2009:

Original Google Translate
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? roku nichi gogo yon ji san go fun goro , t?ky? to chiyoda ku k?kyogaien no tod? ? uchibori d?ri ? no nij?bashi zen k?saten de , ch?goku kara no kank? kyaku no yon zero dai no dansei ga j?y?sha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shib? shi ta . kuruma wa hod? ni noriage te arui te i ta dansei ? roku ky? ? mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no j?tai . marunouchi sho wa , unten shi te i ta t?ky? to minato ku hakkin san ch?me , kaisha yakuin takahashi nobe tsubuse y?gi sha ? ni yon ? wo jid?sha unten kashitsu sh?gai no utagai de genk? han taiho shi , y?gi wo d? chishi ni kirikae te shirabe te iru .
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? d?sho ni yoru to , shib? shi ta dansei wa ?dan hod? wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni ky? handoru wo kiri , shad? to hod? no sakai ni oka re ta kasetsu no saku wo haneage , hod? ni noriage ta to iu . saku wa hod? de ran’ningu wo shi te i ta dansei ? san yon ? ni atari , dansei wa ry?ashi ni karui kega .
???????????????????????????????????????????? d?sho wa , shib? shi ta dansei no mimoto kakunin wo susumeru totomoni , t?ji no k?saten no shing? no j?ky? wo shirabe te iru .
????????????????????????????????????????? genba sh?hen wa t?ky? kank? no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru .

Google’s romanization algorithm does a thoroughly mediocre job compared with what a human transcriber would do. To see this, compare the following:

Google Translate human transcriber
roku nichi gogo yon ji san go fun goro , t?ky? to chiyoda ku k?kyogaien no tod? ? uchibori d?ri ? no nij?bashi zen k?saten de , ch?goku kara no kank? kyaku no yon zero dai no dansei ga j?y?sha ni hane rare , zenshin wo tsuyoku u~tsu te mamonaku shib? shi ta . kuruma wa hod? ni noriage te arui te i ta dansei ? roku ky? ? mo hane , dansei wa atama wo tsuyoku u~tsu te ishiki fumei no j?tai . marunouchi sho wa , unten shi te i ta t?ky? to minato ku hakkin san ch?me , kaisha yakuin takahashi nobe tsubuse y?gi sha ? ni yon ? wo jid?sha unten kashitsu sh?gai no utagai de genk? han taiho shi , y?gi wo d? chishi ni kirikae te shirabe te iru . Muika gogo yo-ji sanj?go-fun goro, T?ky?-to Chiyoda-ku K?kyo Gaien no tod? (Uchibori d?ri) no Nij?bashi-zen k?saten de, Ch?goku kara no kank?-kyaku no yonj?-dai no dansei ga j?y?sha ni hanerare, zenshin o tsuyoku utte mamonaku shib?-shita. Kuruma wa hod? ni noriagete aruite ita dansei (rokuj?ky?) mo hane, dansei wa atama o tsuyoku utte ishiki fumei no j?tai. Marunouchi-sho wa, unten-shite ita T?ky?-to Minato-ku Shirogane san-ch?me, kaisha yakuin Takahashi Nobuhiro y?gisha (nij?yon) o jid?sha unten kashitsu sh?gai no utagai de genk?han taiho-shi, y?gi o d?-chishi ni kirikaete shirabete iru.
d?sho ni yoru to , shib? shi ta dansei wa ?dan hod? wo arui te wata~tsu te i ta tokoro wo chokushin shi te ki ta kuruma ni hane rare ta . kuruma wa hidari ni ky? handoru wo kiri , shad? to hod? no sakai ni oka re ta kasetsu no saku wo haneage , hod? ni noriage ta to iu . saku wa hod? de ran’ningu wo shi te i ta dansei ? san yon ? ni atari , dansei wa ry?ashi ni karui kega . D?-sho ni yoru to, shib?-shita dansei wa ?dan hod? o aruite watatte ita tokoro o chokushin-shite kita kuruma ni hanerareta. Kuruma wa hidari ni ky?-handoru o kiri, shad? to hod? no sakai ni okareta kasetsu no saku o haneage, hod? ni noriageta to iu. Saku wa hod? de ranningu o shite ita dansei (sanj?yon) ni atari, dansei wa ry?ashi ni karui kega.
d?sho wa , shib? shi ta dansei no mimoto kakunin wo susumeru totomoni , t?ji no k?saten no shing? no j?ky? wo shirabe te iru . D?-sho wa, shib?-shita dansei no mimoto kakunin o susumeru to tomo ni, t?ji no k?saten no shing? no j?ky? o shirabete iru.
genba sh?hen wa t?ky? kank? no supotto no hitotsu da ga , saikin wa jogingu wo tanoshimu hito mo fue te iru . Genba sh?hen wa T?ky? kank? no supotto no hitotsu da ga, saikin wa jogingu o tanoshimu hito mo fuete iru.

For the sake of comparison, I have retained Google’s Hepburn-style romanization. The following changes have been made in the text in the righthand column:

  1. Misread words have been rewritten. Many involve numerals; e.g. muika for “roku nichi”, yo-ji for “yon ji”, sanj?go-fun for “san go fun”. The personal name Nobuhiro is an educated guess, but “Nobetsubuse” is certainly wrong. Shirogane for “hakkin” is a place-name (N.B. Google did not produce *hakukin, indicating that the algorithm does more than just character-by-character on-yomi).
  2. False spaces and consequent misreadings have been eliminated. E.g. hanerare for “hane rare”, wattate ita for “wata~tsu te i ta”.
  3. Run-together phrases have been parsed correctly. E.g. to tomo ni for “totomoni”.
  4. Capitalization of proper nouns and the first words in sentences has been introduced.
  5. Hyphens are used conservatively for prefixes and suffixes, and for compound verbs with suru.
  6. Obsolete “wo” for the particle o has been eliminated. (N.B. Google did not produce *ha for the particle wa, so “wo” for o is just the result of laziness.)
  7. Apostrophes after n to indicate mora nasals in positions where they are not needed have been eliminated.
  8. Punctuation has been normalized to match for romanized format and paragraph indentations have been restored.

One could make the romanized text more easily readable by restoring arabic numerals, italicizing gairaigo, and so on. Of course, if the reporter knew that his/her copy would be reported orally or in romanization, s/he might have chosen different wording to avoid homophonic ambiguities. E.g., Marunouchi-sho could be Marunouchi Keisatsu-sho, though perhaps in the context of a traffic accident story, it is obvious that the suffix sho denotes ‘police station’. Furthermore, in a digraphic Japan, homophones might not be such as great problem. If, for instance, readers were accumstomed to seeing d?sho for ?? ‘same place’, then d?-sho would immediately signal that something different was meant, which, given context, might be entirely sufficient to eliminate misunderstanding.

But having said all that, my guess is that the romanization function of Google Translate was programmed with some care. Rather than criticize the quality Google’s algorithm, I suggest pursuing the logical consequences of assuming that it deserves about a B+ by current standards.

Analysis

Clearly, there is a vast amount of knowledge an editor needs if s/he wants to bring Google’s result up to an acceptable level of romanization for human consumption. That minimal level, in turn, is probably a far cry from what a committee of linguists might decide would be an ideal romanization for daily use in 21st-century Japan. It is quite obvious why Google’s algorithm blunders — the reasons were well understood and described long ago (e.g. in Unger 1987) — and though the algorithm can be improved, it can never produce perfect results. Computers cannot read minds, and mindreading is ultimately what it would take to produce a flawless romanization.1

Furthermore, imagine the representation of the words of the text that presumably takes shape in some form or other in the mind of the skilled reader of the original text. Given that Google’s programmers are doing their best to get their computers to identify words and their forms from Japanese textual data, it is clear that readers, who achieve excellent comprehension with little or no conscious effort, must be doing vastly more. The sequence of stages — from (1) the original text to (2) the Google transcription, (3) the better edited version, (4) some future “ideal” romanization scheme, and onward to (5) whatever the brain of the skilled reader ultimately distills and comprehends — concretely illustrates how, at each stage, different kinds of information — from the easily programmable to genuine expert knowledge — must be brought to bear on the raw data.

Of course, something similar can be said of English texts as well: like Chinese characters, orthographic words of English, even though written with letters of the roman alphabet, typically function both logographically and phonographically. The English reader has to do some work too. But how much? Think of the sequence of stages just described in reverse order. The step from the mind of an expert reader (5) to an ideal romanization (4) is short compared with the distance down to the crude level of romanization produced by Google Translate (2). Yet Google does quite a bit relative to the original text (1). It does not totally fail, but rather makes mistakes, which, as just demonstrated, a human editor can identify and correct. It manages to find many word boundaries and no doubt could do better if the company’s programmers consulted some linguists and exerted themselves more. The point is that Japanese readers must cover the whole distance from the text to genuine comprehension, a distance that must be much greater than that traversed by the practiced reader of English, for all its quaint anachronistic spellings. With a decent, standardized roman orthography, the Japanese reader would have a considerably shorter distance to negotiate.

Note

  1. Indeed, starting in the 1980s, Asahi pioneered in the use of an IBM-designed system called NELSON (New Editing and Layout System of Newspapers) that uses large-array keyboards (descriptive input) rather than the sort of kanji henkan methods (transcriptive input) common on personal computers and dedicated word-processing systems. Consequently, the expedient of storing the underlying roman or kana input stream alongside the selected characters is not available for Asahi stories. Of course, such information is routinely thrown away by many other input systems too.

Journal issue focuses on romanization

cover of this issue of the Journal of the Royal Asiatic SocietyThe most recent issue of the Journal of the Royal Asiatic Society of Great Britain and Ireland (third series, volume 20, part 1, January 2010) features the following articles on romanization movements and script reforms.

  • Editorial Introduction: Romanisation in Comparative Perspective, by ?lker Aytürk
  • The Literati and the Letters: A Few Words on the Turkish Alphabet Reform, by Laurent Mignon
  • Alphabet Reform in the Six Independent ex-Soviet Muslim Republics, by Jacob M. Landau
  • Politics of Romanisation in Azerbaijan (1921–1992), by Ayça Ergun
  • Romanisation in Uzbekistan Past and Present, by Mehmet Uzman
  • Romanisation of Bengali and Other Indian Scripts, by Dennis Kurzon
  • The R?maji movement in Japan, by Nanette Gottlieb
  • Postscript from the JRAS Editor, Sarah Ansari

Unfortunately, none of these cover any Sinitic languages or the case of Vietnam. And Gottlieb’s take on r?maji is certainly more conservative than Unger’s. But I expect this will all make for interesting reading.

I am able to view all of the articles on my system. But perhaps others will run up against a subscription wall.

I thank Victor H. Mair for drawing this publication to my attention.

kanji scandal

The Kyoto-based Japan Kanji Aptitude Testing Foundation — the group behind the Kanji of the Year announcement and which runs Japan’s well-attended kanji aptitude tests — is registered as a public-interest corporation, which means that it is not supposed to generate profits greater than it needs to operate (much like a non-profit organization in the United States). On March 10, however, Japan’s Ministry of Education stepped in, saying that the foundation was making too much money and needed to overhaul its operations.

How much money are we talking about?

The foundation racked up profits of ¥880 million [US$8.8 million] in fiscal 2006 and ¥660 million in fiscal 2007. The value of its assets increased from ¥5 billion at the end of fiscal 2004 to ¥7.35 billion at the end of fiscal 2007. It would not be far-fetched to say that the foundation has created a kanji business. Kanken became a registered trademark. In fiscal 2007 alone, the foundation sold some 1.5 million copies of books. It is also providing kanji-related questions to TV shows.

But there are more problems than just how much of the money the foundation makes. It has been funneling money into companies controlled by the foundation’s director and his son, the deputy director. “In fiscal 2007, commissions to these companies amounted to 2.48 billion yen [US$24.9 million], accounting for about 40 percent of the foundation’s annual expenditures,” the Asahi Shimbun reported.

Moreover, it appears the companies did little work for the large amount of money they received.

The Ministry of Education has warned the foundation before, with not much in the way of results. The foundation is to report back to the ministry by April 15. Given how entrenched the foundation is within Japan, I don’t expect much to change.

sources:

early Chinese tattoos

As my friend Tian of Hanzi Smatter continues to document, some people, Westerners especially, remain keen on having themselves tattooed with Chinese characters — even if they can’t read them. I doubt, though, that many are aware of China’s historical traditions in tattooing. As Carrie E. Reed notes in Early Chinese Tattoo (2.9 MB PDF), which is the latest reissue from Sino-Platonic Papers, “it appears that the practice of tattoo (other than the penal use) never achieved any level of general acceptance or widespread use among most parts of ancient Chinese society of any era.”

Yes, penal use: In early China tattooing was a common way of branding criminals. Often such tattoos were standard designs, such as circles. But sometimes they contained text.

Here’s something from Reed’s discussion of the Yuan dynasty’s legal code:

In the section on illicit sexual relationships we read that, in general, on the first offense the adulterous couple will be separated, but if they are “caught in the act” a second time, the man (it is not clear if the woman is tattooed as well) will be tattooed on the face with the words “committed licentious acts two times” (????) and banished. Numerous examples are given to illustrate this type of punishment.

Reed examines and translates many texts describing tattoos.

Some of the terms encountered in these early texts are (with a literal translation given in parentheses) qing ? (to brand, tattoo), mo ? (to ink), ci qing ?? (to pierce [and make] blue-green), wen shen ?? (to pattern the body), diao qing ?? (to carve and [make] blue-green), ju yan ?? (to injure the countenence), wen mian ?? (to pattern the face), li mian ?? (to cut the face) , hua mian ?? (to mark the face), lou shen ?? (to engrave the body), lou ti ?? (same), xiu mian ?? (to embroider [or ornament] the face), ke nie ?? (to cut [and] blacken), nie zi ?? (to blacken characters) ci zi ?? (to pierce characters), and so on. These terms are sometimes used together, and there are numerous further variations. In general, if the tattooing of characters (?) appears in the term, it refers to punishment, but this is certainly not true in every case. Likewise, if a term literally meaning “to ornament” or “decorate” is used, it does not necessarily mean that the tattoo was done voluntarily or for decorative purposes.

All of the types of tattoo, except perhaps for the figurative and textual, are usually described as inherently opprobrious; people bearing them are stigmatized as impure, defiled, shameful or uncivilized. There does not ever seem to have been a widespread acceptance of tattoo of any type by the “mainstream” society; this was inevitable, partly due to the early and long-lasting association of body marking with peoples perceived as barbaric, or with punishment and the inevitable subsequent ostracism from the society of law-abiding people. Another reason, of course, is the Confucian belief that the body of a filial person is meant to be maintained as it was given to one by one’s parents.

This was first published in June 2000 as issue no. 103 of Sino-Platonic Papers. Although the work contains no illustrations, it does feature copious translations of texts describing tattoos or relating tales about them.

Compensation for kanji-input basic technology subject of lawsuit

A Japanese man who says he invented the technology behind the context-based conversion of a sentence written solely in kana into one in both kanji and kana, as well as another related technology, filed suit against Toshiba on December 7, seeking some US$2.3 million in compensation from his former employer.

Shinya Amano, a professor at Shonan Institute of Technology, said in a written complaint that although the firm received patents for the technologies in conjunction with him and three others and paid him tens of thousands of yen annually in remuneration, he actually developed the technologies alone.

Amano is claiming 10 percent of an estimated ¥2.6 billion in profit Toshiba made in 1996 and 1997 — much higher than the roughly ¥230,000 he was actually awarded for the work over the two-year span.

His claim is believed valid, taking into account the statute of limitations and the terms of the patents.

“This is not about the sum of the money — I filed the suit for my honor,” Amano said in a press conference after bringing the case to the Tokyo District Court.

“Japan is a technology-oriented country, but engineers are treated too lightly here,” he said.

Toshiba said through its public relations office that it believes it paid Amano fair compensation in line with company policy. The company declined to comment on the lawsuit before receiving the complaint in writing.

Amano claims that he invented the technology that converts a sentence composed of kana alone into a sentence composed of both kanji and kana by assessing its context, and another technology needed to prioritize kanji previously used in such conversions.

Using theories of artificial intelligence, the two technologies developed in 1977 and 1978 are still used today in most Japanese word-processing software, he said.

source: Word-processor inventor sues Toshiba over redress, Kyodo News, via Japan Times, December 9, 2007

stroke counts: Taiwan vs. China

One of the myths about Chinese characters is that for each character there is One True Way and One True Way Only for it to be written, with a specific number of specific strokes in a certain specific and invariable order. Generally speaking, characters are indeed taught with standard stroke orders with certain numbers of strokes (the patterns help make it less difficult to remember how characters are written) — but these can vary from place to place, though the characters may look the same. Moreover, people often write characters in their own fashion, though they may not always be aware of this.

Michael Kaplan of Microsoft recently examined the stroke data from standards bodies in China for all 70,195 “ideographs” [sic] in Unicode 5.0 and compared it against “the 54,195 ideographs for which stroke count data was provided by Taiwan standards bodies” to see how how much of a difference there was in the stroke counts for the characters that both sides provided data for.

(I’m a bit surprised the two sides have compiled such extensive lists, and I’d love to see them. But that’s another matter.)

He found that 9,768 of these characters (18 percent) have different stroke counts between the two standards, with 9,045 characters differing by 1 stroke, 675 characters by 2 strokes, 44 characters by 3 strokes, 2 characters by 4 strokes, 1 character by 5 strokes, and 1 character by 6 strokes.

Note: This is about stroke counts of matching characters, not about differing stroke counts for traditional and “simplified” characters — e.g., not ? (11 strokes) vs ? (8 strokes).

So, is this a case of chabuduoism, or of truly differing standards? The answer is not yet fully clear; but be sure to read Kaplan’s post and the comments there.

sources and additional info:

variant Chinese characters and Unicode

A submission to the Unicode Consortium’s Ideographic [sic] Variation Database for the “Combined registration of the Adobe-Japan1 collection and of sequences in that collection” is available for review through November 25. This submission, PRI 108, is a revision of PRI 98.

This set “enumerates 23,058 glyphs” and contains 14,664 tetragraphs (Chinese characters / kanji). About three quarters of Unicode pertains to Chinese characters.

Two sets of charts are available: the complete one (4.4 MB PDF), which shows all the submitted sequences, and the partial one (776 KB PDF), which shows “only the characters for which multiple sequences are submitted.”

Below is a more or less random sample of some of the tetragraphs.

Initially I was going to combine this announcement with a rant against Unicode’s continued misuse of the term “ideographic.” But I’ve decided to save that for a separate post.

sample image of some of the kanji variants in the proposal