separating Pinyin syllables: PHP code

A few weeks ago I had someone write to ask if I had a script that can divide Pinyin texts into their individual syllables. It so happens that I do have something that does just that. Since I sent out that bit of code, I might as well make it available to everyone (GNU GPL, and links back to Pinyin.Info are always appreciated).

It has lots of regular expressions, to make the code nice and compact. I’ve added comments for clarity.

##############################
### SEPARATE THE SYLLABLES
##############################
// In the lines below, \s means space
// This program assumes that ü is written as v
// The i at the end of a line means case insensitive
// \W is a single, non-word character (e.g., punctuation)

$search = array ("'([aeiouv])([^aeiounr\W\s])'i", // This line does most of the work
"'(\w)([csz]h)'i", // double-consonant initials
"'(n)([^aeiouvg\W\s])'i", // cleans up most n compounds
"'([aeiuov])([^aeiou\W\s])([aeiuov])'i", // assumes correct Pinyin (i.e., no missing apostrophes)
"'([aeiouv])(n)(g)([aeiouv])'i", // assumes correct Pinyin, i.e. changan = chan + gan
"'([gr])([^aeiou\W\s])'i", // fixes -ng and -r finals not followed by vowels
"'([^e\W\s])(r)'i", // r an initial, except in er
);

$replace = array ("\\1 \\2",
"\\1 \\2",
"\\1 \\2",
"\\1 \\2\\3",
"\\1\\2 \\3\\4",
"\\1 \\2",
"\\1 \\2",
);

$usertext = preg_replace($search, $replace, $document);

##############################

Since I’m always going on about the need for word parsing and not separating Pinyin into single syllables, some of you are probably wondering just why I of all people would have ever written such code. The answer is that it’s part of my Pinyin spell-checker, which is only a very basic utility in that it functions by checking for theoretically correct groups of syllables rather than real words (i.e., anything composed of correctly spelled groups of syllables, minus tone marks, will pass even if that word isn’t found in a dictionary).

Suggestions for improvements are always welcome.

Ovid Tzeng reiterates backing for Hanyu Pinyin

Earlier this week Ovid Tzeng, a former minister of education and current minister without portfolio, reaffirmed his support for Taiwan adoping Hanyu Pinyin and said that this is an important issue the government will need to deal with sooner or later.

Z?ng Zhìl?ng Jiàoyùbù zh?ng rènnèi, ji?nchí c?iyòng Hàny? P?ny?n, shì t? bèi huàn xiàlái de zh?y?n zh?y?. T? zuóti?n réng bù g?i qí zhì, qiángdiào guówài bùgu?n Zh?ngwén jiàoxué huò xuéshù q?k?n, h?n du? y?jing g?iyòng Hàny? P?ny?n, Táiw?n bùnéng shìruòwúd?, zhè su? f?i x?n zhèngf? zuì y?uxi?n sh?zhèng xiàngmù, dànshì y? lièwéi wèilái zhòngdà ji?nt?o shìxiàng.

Most of the source article for this discusses poet and academic Zheng Chouyu’s backing for Hanyu Pinyin. He stresses his view that this is a practical matter, not a political one.

source: Zhèng Chóuy? jiànyì: Zh?ngwén yìy?n k?y? c?i Hàny? P?ny?n (?????????? ??????), United Daily News, June 9, 2009

further reading: Hanyu Pinyin backer to return to Taiwan’s Cabinet, Pinyin News, April 29, 2008

Burger King, romanization, and a Taiwanese morpheme

Last week I was at my neighborhood shopping mall and saw an interesting ad in the Burger King there. Toward the bottom we find the following:
ad text reading ???, A??

???, A??
(M?i tàoc?n, A dà ji?ng)
(Buy a set meal, score a big prize.)

Here, the Roman letter “A” is used to represent a Taiwanese verb that means something like “get in an easy manner” or “make off with” — though the fine print says that customers just have a chance to get a prize, not that they necessarily will win one.

A is often used in A-qián (“A?”: to A money), a mixed Taiwanese and Mandarin term that means embezzle/embezzlement.

Perhaps the Ministry of Education has issued an official Chinese character for this morpheme. But even if they had most people would have no idea how to read it, and it probably would be of spurious origin to boot — just like most of the other characters the ministry has issued. Where a Taiwanese morpheme sounds like the English name of a Roman letter, the romanized form is likely to prevail over the Chinese character.

There are other interesting things about this ad. But I’ll get to those in another post.

Mandarin-language papers in Indonesia

The Jakarta Post recently ran an article titled “Chinese papers attracting younger readers.” That would indeed be encouraging news, given the considerable damage in Indonesia to Sinitic languages and Chinese culture during their suppression there for most the second half of the twentieth century.

But the specifics of the article are a little less encouraging.

Since controls against newspapers in Mandarin were eased in 2000, Jakarta has had up to six so-called Chinese newspapers. The number is now down to three, at least one of which is operating in the red.

Bambang Suryono (a.k.a. Lie Zuo Hui), editor in chief of the Yìnní Guójì Rìbào (Indonesian International Daily News, ??????) said that Chinese newspapers in general did not get many young readers or contributors who could write well.

After the reformation era in 2001, Sino-Mandarin courses started to sprout in the city. However, Bambang said new Chinese language students were not yet apt to read Chinese newspapers.

“They can write simple articles about their daily activities, but not on complex subjects.”

For that matter, not many people on his own staff can write in Hanzi.

Out of his editorial staff of 20, there are only four journalists that can write in Chinese, and the youngest is 58 years old. “It’s so hard to find young people that can write in Chinese,” Bambang said.

Sunardi Mulia, the editor-in-chief of a business daily Indonesia Shang Bao, said most of his paper’s readers were above 55 years old.

“They are those who once went to local Chinese schools,” he said. “More and more of those aging readers are passing away.”

Sunardi’s paper, has few reporters. “I have to really roll up my sleeves to report large events,” he said. And about 90 percent of the paper’s editorial team are over 50….

“It’s true more young people have begun learning the language. However, it will take time for them to improve because their cultural surrounding is completely different than previous generations.”

“A lot of their parents cannot speak Mandarin or read Chinese anymore,” said Sunardi.

“So their opportunities to practice and communicate with the language are very limited. They probably need 15 years or so before they can read fluently,” said Sunardi, 59.

Sunardi said his paper’s present marketing target was mostly foreigners, specifically businesspeople from China and Taiwan. But he was optimistic about the future for his paper — again, though, not because of people born and bred in Indonesia but because of new readers arriving as immigrants from China.

No. of copies printed per day, according to the editors:

  • Guoji Ribao: 60,000 copies per day, about one-third of those distributed in Jakarta
  • Shang Bao: 10,000 copies per day, most distribued in Jakarta
  • Harian Indonesia Yinni Sin Chew Ribao (formerly government controlled): 55,000 per day

source: Chinese papers attracting younger readers, Jakarta Post, May 15, 2008 (alternate source)

online texts in Hanyu Pinyin

Chris recently wrote and asked for a list of texts in Pinyin. This site, of course, has at least a few things in Pinyin. Unfortunately, however, they can be a bit difficult to find. So having a list is indeed a good idea.

Here are some readings in Hanyu Pinyin:

Some song lyrics

I should probably figure out a way to incorporate this into the recommended readings section. One of the problems with this site is that it has grown much, much larger than I ever expected, which has resulted in some pages not fitting well within the structure I initially established for Pinyin Info. Over the years various additional readings have been added to the site, a few of which are even in Hanyu Pinyin. But since Pinyin Info’s recommended readings section is set up for books rather than essays, songs, etc., this will involve a rethinking of that page.

I very much hope people can help expand the list by providing links to readings elsewhere in Pinyin. But before listing something in the comments, please make sure it is in real Hanyu Pinyin (e.g., with word parsing instead of bro ken syl la bles, with tone marks instead of tone numbers, and with proper capitalization and punctuation). Alas, most texts that are supposedly in Pinyin do not follow those rules.

Unicode tops other encodings on Web pages: Google

Google is reporting that in December 2007 Unicode became the most frequently used encoding on Web pages.

Just last December there was an interesting milestone on the web. For the first time, we found that Unicode was the most frequent encoding found on web pages, overtaking both ASCII and Western European encodings—and by coincidence, within 10 days of one another. What’s more impressive than simply overtaking them is the speed with which this happened.

Here’s Google’s graph:
graph showing percentage of pages in various encodings, 2001-2008, with ASCII starting about 56% in 2001 and declining to about 25% now, which is also about where iso-8859-1 and utf-8 are now

I wish Big5, the encoding most used for Web pages in traditional Chinese characters, had been included in the graph. And I suspect that it’s only within the past ten years — perhaps even within the timeframe of the graph — that more Web pages have been encoded in GB (used for so-called simplified Chinese characters) than Big5. (GB is shown on the graph in green.)

Of course, many (most?) Web pages don’t declare any character encoding. This is especially bad when they contain characters beyond the bounds of ASCII, since those characters will often end up rendered as garbage on systems different than that of the creator of the Web page.

So … should I have a post focusing on Unicode without again berating the Unicode Consortium for its continuing unscientific, egregious, and unforgivable use of ideographic? I don’t think so.

source: Moving to Unicode 5.1, Official Google Blog, May 5, 2008

Crazy English in the New Yorker

The latest issue (April 28, 2008) of the New Yorker has an article on the China’s Crazy English (F?ngkuáng Y?ngy? / ????) method: Crazy English: The national scramble to learn a new language before the Olympics, by Evan Osnos.

Li Yang Crazy English (as it is properly known, after Li Yang, the company’s founder, chief spokesman, and head cheerleader) uses untraditional and emphatic but not always proven methods, including shouting and vowel-associated gesticulations, to help students overcome their fear of using English and remember the sounds of their vocabulary words.

Chinese nationalism is also a big part of its approach.

From the article:

A long red-carpeted catwalk sliced through the center of the crowd. After a series of preppy warmup teachers, firecrackers rent the air and Li bounded onstage. He carried a cordless microphone, and paced back and forth on the catwalk, shoulder height to the seated crowd staring up at him.

“One-sixth of the world’s population speaks Chinese. Why are we studying English?” he asked. He turned and gestured to a row of foreign teachers seated behind him and said, “Because we pity them for not being able to speak Chinese!” The crowd roared.

Li professes little love for the West. His populist image benefits from the fact that he didn’t learn his skills as a rich student overseas; this makes him a more plausible model for ordinary citizens. In his writings and his speeches, Li often invokes the West as a cautionary tale of a superpower gone awry. “America, England, Japan—they don’t want China to be big and powerful!” a passage on the Crazy English home page declares. “What they want most is for China’s youth to have long hair, wear bizarre clothes, drink soda, listen to Western music, have no fighting spirit, love pleasure and comfort! The more China’s youth degenerates, the happier they are!” Recently, he used a language lesson on his blog to describe American eating habits and highlighted a new vocabulary term: “morbid obesity.”

Li’s real power, though, derives from a genuinely inspiring axiom, one that he embodies: the gap between the English-speaking world and the non-English-speaking world is so profound that any act of hard work or sacrifice is worth the effort. He pleads with students “to love losing face.” In a video for middle- and high-school students, he said, “You have to make a lot of mistakes. You have to be laughed at by a lot of people. But that doesn’t matter, because your future is totally different from other people’s futures.”

Very soon Sino-Platonic Papers will be issuing a long, critical study of Crazy English. Look for the announcement of that here in Pinyin News.

further reading: