UTF-8 Unicode vs. other encodings over time

Some eight years ago UTF-8 (Unicode) became the most used encoding on Web pages. At the time, though, it was used on only about 26% of Web pages, so it had a plurality but not an absolute majority.

Graph showing growth of the UTF-8 encoding

By the beginning of 2010 Unicode was rapidly approaching use on half of Web pages.
graph showing a steep rise in the use of UTF-8 and a steep decline in other major encodings

In 2012 the trends were holding up.
UTF-8_website_use_2001-2012

Note that the 2008 crossover point appears different in the latter two Google graphs, which is why I’m showing all three graphs rather than just the third.

A different source (with slightly different figures) provides us with a look at the situation up to the present, with UTF-8 now on 85% of Web pages. Expansion of UTF-8 is slowing somewhat. But that may be due largely to the continuing presence of older websites in non-Unicode encodings rather than lots of new sites going up in encodings other than UTF-8.
growth in Unicode UTF-8 encoding on Web pages, 2010-2015

Here’s the same chart, but focusing on encodings (other than UTF-8) that use Chinese characters, so the percentages are relatively low.
asian_language_encodings_2010-2015

And here’s the same as the above, but with the results for individual languages combined.
asian_language_encodings_2010-2015_by_language

By the way, Pinyin.info has been in UTF-8 since the site began way back in 2001. The reason that Chinese characters and Pinyin with tone marks appear scrambled within Pinyin News is that a hack caused the WordPress database to be set to Swedish (latin1_swedish_ci), of all things. And I haven’t been able to get it fixed; so just for the time being I’ve given up trying. One of these days….

Sources:

Popularity of Chinese character country code TLDs

Yesterday we looked at the popularity of the Chinese character TLD for Singapore Internet domains. Today we’re going to examine the Chinese character ccTLDs (country code top-level domains) for those places that use Chinese characters and compare the figures with those for the respective Roman alphabet TLDs.

In other words, how, for example, does the use of taiwan in traditional Chinese characters   .台灣 domains compare with the use of .tw domains?

Since, unlike the case with Singapore, I don’t have the registration figures, I’m having to make do with Google hits, which is a different measure. For this purpose, Google is unfortunately a bit of a blunt instrument. But at least it should be a fairly evenhanded blunt instrument and will be useful in establishing baselines for later comparisons.

A few notes before we get started:

  • Japan has yet to bother with completing the process for its own name in kanji (Japan, as written in kanji / Chinese characters), so it is omitted here.
  • Macau only recently asked for aomen in simplified Chinese characters    
  .澳门 and aomen in traditional Chinese characters    
  .澳門, so those figures are still at zero.
  • Oddly enough, there’s no taiwan_super in traditional Chinese characters   
  .臺灣 ccTLD, even though the Ma administration, which was in power when Taiwan’s ccTLDs went into effect, officially prefers the more complex form of taiwan_super in traditional Chinese characters   
  .臺灣 to taiwan in traditional Chinese characters   .台灣 — not to mention prefering it to taiwan in simplified Chinese characters    
  .台湾.
  Google Hits Percent of Total
MACAU    
.mo 18400000 100.00
aomen in simplified Chinese characters    
  .澳门 0 0.00
aomen in traditional Chinese characters    
  .澳門 0 0.00
TAIWAN    
.tw 206000000 99.86
taiwan in simplified Chinese characters    
  .台湾 67600 0.03
taiwan_super in traditional Chinese characters   
  .臺灣 0 0.00
taiwan in traditional Chinese characters   .台灣 230000 0.11
HONG KONG    
.hk 193000000 99.94
xianggang  in Chinese characters 
  .香港 118000 0.06
SINGAPORE    
.sg 97800000 100.00
xinjiapo  in Chinese characters 
  .新加坡 2 0.00
CHINA    
.cn 315000000 99.61
zhongguo in simplified Chinese characters  
  .中国 973000 0.31
zhongguo in traditional Chinese characters   
  .中國 251000 0.08

So in no instance does the Chinese character ccTLD reach even one half of one percent of the total for any given place.

Here are the results in a chart.

Graph showing that although China leads in domains in Chinese characters, they do not reach even one half of one percent of the total for China

Note that the ratio of simplified:traditional forms in China and Taiwan are roughly mirror images of each other, as is perhaps to be expected.

See also Platform on Tai, Pinyin News, December 30, 2011

Popularity of the Chinese character TLD for Singapore Internet domains

For quite a few years Singapore has had several choices for those wishing to register Singapore-specific domain names, including .com.sg, .net.sg,, .org.sg, .edu.sg, .gov.sg, .per.sg, and just .sg.

Of those, .sg is a top-level domain (TLD), whereas .com.sg, .net.sg,, .org.sg, .edu.sg, .gov.sg, and .per.sg are second-level domains. This post is mainly concerned with TLDs; but when I’m giving totals I also include .com.sg, .net.sg,, .org.sg, .edu.sg, .gov.sg, and .per.sg but exclude specific domains such as groupon.sg. OK, now back to the post.

Although English is the dominant language of Singapore, it is but one of four official languages there, along with Mandarin, Malay, and Tamil, with Mandarin (along with other Sinitc languages) being the most common of the latter three. Some three-quarters of the city-state’s population is ethnic Chinese, and around half of that group speak Mandarin as the main language in their homes. In addition, for decades Singapore has promoted its campaign to Strike Hard Against Hoklo, Cantonese, and Other Languages that Your Government Says Are Puny and Insignificant Because They Have Only Tens of Millions of Speakers Apiece Speak Mandarin.

So you might think that four years ago, when Singapore introduced Singapore’s name in Chinese characters ('Singapore' (Xinjiapo) in Chinese characters) as a top-level Internet domain (TLD), many in that multilingual society might jump at the chance to pick up some domain names ending with “Singapore” in Chinese characters. (Oh, it hurts me to use images instead of real text there; but until I get the hack fixed, that’s what I’m stuck with.)

Let’s take a look at what happened when the gates opened.

In September 2011, the first month that dot-Xinjiapo (.'Singapore' (Xinjiapo) in Chinese characters) domains became available, a total of 86 were registered. That’s not much of a land rush. The next month and the month after that saw no new registrations. But, OK, maybe they had a sunrise period limiting things. What happened later?

In December 2011 the number jumped to 218. This figure grew over the year 2012 to an all-time high that October of … 247 domains using the .'Singapore' (Xinjiapo) in Chinese characters TLD. Just 247. During the same month, Singapore had 143,887 registered domains, meaning that at the high point those with the Chinese character TLD were less than one fifth of one percent of the total. Since then, the number has fallen to a mere 210, with the percentage dropping to less than one eighth of one percent of the total.

Let’s look at this over time:
dot_xinjiapo_singapore_domains

A Google search for the .'Singapore' (Xinjiapo) in Chinese characters domains reveals that those domains are even less used than the already astonishingly low registration numbers might indicate.

results of a Google search for  .??? domains

So that’s a total of two active dot-Xinjiapo domains, one of which is for sale. In other words, basically there’s just one being used. Ouch. That’s about as close to utter insignificance as a Singapore TLD can get.

Indeed, the only sort of Singapore-related domain that is of even less interest to the netizens of Singapore is one within the dot-Cinkappur TLD, with Singapore written in the Tamil script: 'Singapore' as written in Tamil

Dot-Cinkappur (.'Singapore' as written in Tamil) domains have been available since December 2011, which is just a few months after the introduction of dot-Xinjiapo domains. The middle of 2015 saw the all-time record high in dot-Cinkappur domain registrations: sixteen. Since then the number has dropped to just fifteen.

A search on Google for dot-Cinkappur domains reveals zero active sites.

source: Registration Statistics, Singapore Network Information Centre (SGNIC), accessed October 27, 2015

See also: sg domain names in Chinese characters lag, Pinyin News, June 23, 2010.

Emoji, language, and translation

A couple of days ago the New York Times ran a small piece, “How Emojis Find Their Way to Phones.” It contains the sort of nonsense about Chinese characters and language that often sets me off.

Fortunately, Victor Mair quickly posted something on this. J. Marshall Unger (Ideogram: Chinese Characters and the Myth of Disembodied Meaning, The Fifth Generation Fallacy, and Literacy and Script Reform in Occupation Japan) and S. Robert Ramsey (The Languages of China) quickly followed. But since those are in the comments to a Language Log post and thus may not be seen as much as they should be, I thought I’d link to them here.

The Language Log post itself is on Emoji Dick, which is billed as a translation of Moby Dick into emoji. As long as I’m writing, I might as well offer up a sample for you. See if you can determine the original English.

emoji_dick

Did you try “Call me Ishmael”? Sorry. That’s not it. But if you guessed that I would choose the passage from Moby Dick that mentions Taiwan, give yourself bonus points.

Here’s what the above emoji supposedly translate:

Hereby the casks are sought to be kept damply tight; while by the changed character of the withdrawn water, the mariners readily detect any serious leakage in the precious cargo.

Now, from the South and West the Pequod was drawing nigh to Formosa and the Bashee Isles, between which lies one of the tropical outlets from the China waters into the Pacific.

Ah, of course. It’s all so clear now.

The next time you hear someone use “pictorial language,” “ideographs,” or the like in all seriousness, perhaps ask them for their own English translation of the above string of images.

Actually, Emoji Dick screwed this up some, as part belongs to the main text and part to a footnote.

moby_dick_cropped

Milk Shop

Here’s another in my series of photos of English with Chinese character(istic)s, that is Chinese characters being used to write English (sort of). I want to stress that these aren’t loan words, just an approximate phonetic rendering of the English.

Today’s entry — which was taken a few weeks ago in Xinzhu (usually spelled “Hsinchu”), Taiwan — is Mi2ke4 Xia4 (lit. “lost guest summer”).

sign for a drinks store, labeled 'milk shop' in English and 'mi ke xia' in Chinese characters

Crunchy

I tend to think of Hanzi being used to write English words as “Singlish,” after John DeFrancis’s classic spoof, “The Singlish Affair,” which is the opening chapter of his essential book The Chinese Language: Fact and Fantasy. But these days the word is mainly used for Singaporean English. So now I usually go with something like “English with Chinese character(istic)s.”

For a few earlier examples, see the my photos of the dog and the butterfly businesses.

Today’s example is “Crunchy,” written as ke3 lang3 qi2 (can bright strange). Kelangqi, however, isn’t how to say “crunchy” in Mandarin (cui4 de is); it’s just an attempt to render the English word using Chinese characters, probably in an attempt to look different and cool.

Sign advertising a store named 'Crunchy' in English and 'ke lang qi' (in Chinese characters) in Mandarin

Crunchy, which is now out of business, was just a block away from the Dog (dou4 ge2) store, which is still around.

Remembering Hu Shih: 1891-1962

black and white photo of the face of Hu Shih (??)

Hú Shì
17 December 1891 — 24 February 1962

Today, on the fiftieth anniversary of the death of Hu Shih (Hú Shì/??/??), I’d like to say a few things in his memory. This is, after all, someone I regard as a hero in many ways. I even keep a photo of him in my office.

The opening of the preface to a splendid new biography of Hu Shih covers the basics:

Hu Shi (1891–1962), “the Father of the Chinese Renaissance,” towered over China’s intellectual landscape in the first half of the twentieth century. Among other achievements, he is credited with having made everyday speech respectable as a medium of written communication. Groomed as a traditional scholar-bureaucrat in his father’s footsteps, he had already turned into an iconoclastic renegade by the time he left Shanghai at the age of eighteen to study in the United States. In John Dewey, whose approach to philosophy was to treat all doctrines as working hypotheses, Hu felt he found “the proper way to think.” He and his associates who studied with Dewey at Columbia University established the framework of China’s modern educational system. A dedicated humanist, social reformer and promoter of women rights, he was, at different periods of his life, president of Peking University, president of the Academia Sinica, and ambassador to Washington.

To return to the most important point, at least in terms of the focus of this site, it was he, more than anyone else, who helped break the stranglehold of Literary Sinitic (a.k.a. classical Chinese). The vernacular movement he spearheaded is of far greater significance and has had a much greater impact on Chinese culture and people’s lives than so-called character simplification. Yet it receives relatively little attention, perhaps because many do not understand — or do not want to admit — how very different Literary Sinitic is from modern standard Mandarin.

Hu Shih is also the one who, more than anyone else, popularized the use of modern punctuation in Chinese texts, such as through his book Zh?ngguó Zhéxuésh? Dàg?ng and his editions of earlier works. That alone should be enough to earn him the eternal gratitude of all who read texts written in Chinese characters.

There’s so much more to the man than this, though most of it falls outside the bounds of this site. So rather than go into it here I will just encourage people to read more by and about him.

Shortly after Hu Shih’s death his son wrote:

father passed away during a cocktail party in honor of the members of the Academia Sinica after the completion of the members’ meeting. He passed away without any pain, and from every one present at the party, I gathered that he died happy, for the last words he said was, “Let’s have some drinks!”

I lift my glass.

Further reading:

dàd?n ji?shè

xi?ox?n qiúzhèng
N? bùnéng zuò w? de sh?,
zhèngrú w? bùnéng zuò n? de mèng.

—Hú Shì
from “Mèng y? Sh?” (???)

New database of cross-strait differences in Mandarin goes online

Last week, on the same day President Ma Ying-jeou accepted the resignation of a minister who made some drunken lewd remarks at a wěiyá (year-end office party), Ma was joking to the media about blow jobs.

Classy.

screenshot from a video of a news story on this

But it was all for a good cause, of course. You see, the Mandarin expression chuī lǎba, when not referring to the literal playing of a trumpet, is usually taken in Taiwan to refer to a blow job. But in China, Ma explained, chuī lǎba means the same thing as the idiom pāi mǎpì (pat/kiss the horse’s ass — i.e., flatter). And now that we have the handy-dandy Zhōnghuá Yǔwén Zhīshikù (Chinese Language Database), which Ma was announcing, we can look up how Mandarin differs in Taiwan and China, and thus not get tripped up by such misunderstandings. Or at least that’s supposed to be the idea.

The database, which is the result of cross-strait cooperation, can be accessed via two sites: one in Taiwan, the other in China.

It’s clear that a lot of money has been spent on this. For example, many entries are accompanied by well-documented, precise explanations by distinguished lexicographers. Ha! Just kidding! Many entries are really accompanied by videos — some two hundred of them — of cutesy puppets gabbing about cross-strait differences in Mandarin expressions. But if there’s a video in there of the panda in the skirt explaining to the sheep in the vest that a useful skill for getting ahead in Chinese society is chuī lǎba, I haven’t found it yet. Will NMA will take up the challenge?

Much of the site emphasizes not so much language as Chinese characters. For example, another expensively produced video feeds the ideographic myth by showing off obscure Hanzi, such as the one for chěng.

WARNING: The screenshot below links to a video that contains scenes with intense wawa-ing and thus may not be suitable for anyone who thinks it’s not really cute for grown women to try to sound like they’re only thwee-and-a-half years old.

cheng3

In a welcome bit of synchronicity, Victor Mair posted on Language Log earlier the same week on the unpredictability of Chinese character formation and pronunciation, briefly discussing just such patterns of duplication, triplication, etc.

Mair notes:

Most of these characters are of relatively low frequency and, except for a few of them, neither their meanings nor their pronunciations are known by persons of average literacy.

Many more such characters consisting or two, three, or four repetitions of the same character exist, and their sounds and meanings are in most cases equally or more opaque.

The Hanzi for chěng (which looks like 馬馬馬 run together as one character) in the video above is sufficiently obscure that it likely won’t be shown correctly in many browsers on most systems when written in real text: 𩧢. But never fear: It’s already in Unicode and so should be appearing one of these years in a massively bloated system font.

Further reinforcing the impression that the focus is on Chinese characters, Liú Zhàoxuán, who is the head of the association in charge of the project on the Taiwan side, equated traditional Chinese characters with Chinese culture itself and declared that getting the masses in China to recognize them is an important mission. (Liu really needs to read Lü Shuxiang’s “Comparing Chinese Characters and a Chinese Spelling Script — an evening conversation on the reform of Chinese characters.”)

Then he went on about how Chinese characters are a great system because, supposedly, they have a one-to-one correspondence with language that other scripts cannot match and people can know what they mean by looking at them (!) and that they therefore have a high degree of artistic quality (gāodù de yìshùxìng). Basically, the person in charge of this project seems to have a bad case of the Like Wow syndrome, which is not a reassuring trait for someone in charge of producing a dictionary.

The same cooperation that built the Web sites led to a new book, Liǎng’àn Měirì Yī Cí (《兩岸每日一詞》 / Roughly: Cross-Strait Term-a-Day Book), which was also touted at the press conference.

The book contains Hanyu Pinyin, as well as zhuyin fuhao. But, alas, the book makes the Pinyin look ugly and fails completely at the first rule of Pinyin: use word parsing. (In the online images from the book, such as the one below, all of the words are se pa ra ted in to syl la bles.)

The Web site also has ugly Pinyin, with the CSS file for the Taiwan site calling for Pinyin to be shown in SimSun, which is one of the fonts it’s better not to use for Pinyin. But the word parsing on the Web site is at least not always wrong. Here are a few examples.

  • “跑神兒” is given as pǎoshénr (good).
  • And apostrophes appear to be used correctly: e.g., fàn’ān (販安), chūn’ān (春安), and fēi’ān (飛安).
  • But “第二春” is run together as “dìèrchūn” (no hyphen) rather than as shown correctly as dì-èr chūn.
  • And “一個頭兩個大” is given as yíɡe tóu liǎnɡɡe dà (for Taiwan) and yīɡe tóu liǎnɡɡe dà (for China). But ge is supposed to be written separately. (The variation of tone for yi is in this case useful.)

Still, my general impression from this is that we should not expect the forthcoming cross-strait dictionary to be very good.

Further reading: