software for Shanghainese

Professor Qián Nǎiróng (Qian Nairong / 錢乃榮) of Shanghai University has just issued free software to help with the writing of Shanghainese (上海话). People may now download the 1.3 MB zip file of the program.

Some examples:

shanghe 上海
shanghehhehho 上海闲/言话(上海话)
whangpugang 黄浦江
suzouhhu苏州河
shyti 事体(事情)
makshy 物事(东西)
bhakxiang 白相(玩)
dangbhang 打朋(开玩笑)
ghakbhangyhou 轧朋友(交朋友)
cakyhangxiang 出洋相(闹笑话,出丑)
linfhakqin 拎勿清(不能领会)
dhaojiangwhu 淘浆糊(混)
aoshaoxhin 拗造型(有意塑造姿态形象)
ghe 隑(靠)
kang 囥(藏)
yin 瀴(凉、冷)
dia 嗲
whakji 滑稽

The program offers two flavors of romanization. Here are some examples of the differences between the two styles:

New Folk Old Timers
makshy 物事(东西)
bhakxiang 白相(玩)
dangbhang 打朋(开玩笑)
ghakbhangyhou 轧朋友(交朋友)
cakyhangxiang 出洋相(闹笑话,出丑)
linfhakqin 拎勿清(不能领会)
mekshy 物事(东西)
bhekxian 白相(玩)
danbhan 打朋(开玩笑)
ghakbhanyhou 轧朋友(交朋友)
cekyhanxian 出洋相(闹笑话,出丑)
linfhekqin 拎勿清(不能领会)

Here’s a brief story on this:

Xiànzài, wǒmen zài wǎngluò zhōng liáotiān de shíhou yuèláiyuè duō de péngyou dōu kāishǐ xǐhuan yòng Shànghǎihuà. Dànshì yǒushíhou shìbushì juéde xiǎng biǎodá dehuà bùzhīdào zěnme dǎ, nòng de yǒudiǎn bùlúnbùlèi ne? Xiànzài, yī ge kěyǐ qīngsōng dǎchū Shànghǎihuà de chéngxù chūlai le.

Jīngguò liǎng nián nǔlì, Shànghǎi dàxué Zhōngwénxì Qián Nǎiróng jiàoshòu jí tā de yánjiūshēng hé dādàng zhōngyú yú běnyuè wánchéng le Shànghǎihuà shūrùfǎ de zhìzuò. Zhíde guānzhù de shì, zhè tào shūrùfǎ hái bāokuò xīn-lǎo liǎng ge bǎnběn, 45 suì yǐshàng de lǎo Shànghǎi rénhé niánqīng yī dài de Shànghǎirén dōu kěyǐ zhǎodào zìjǐ de “dǎfǎ.”

Háishi tóngyàng 26 ge zìmǔ de jiànpán, 8 yuè 1 rì qǐ xiàzài le Shànghǎihuà shūrùfǎ zhīhòu, nín jiù kěyǐ tōngguò shūrù “linfhakqin” dǎchū “līn wù qīng,” shūrù “dhaojiangwhu” dǎchū “táo jiànghu” děng yuánzhī yuán wèi de Shànghǎihuà le. Zuótiān, jìzhě tíqián xiàzài dào gāi ruǎnjiàn. Ànzhào shǐyòng shuōmíng, yòng quánpīn de fāngshì chángshì shūrù “laoselaosy” zhèxiē zìmǔ, píngmù shàng, lìjí chūxiàn le “lǎo sānlǎo sì” (Shànghǎihuà, yìsi shì “màilǎo, chōng lǎochéng de yàngzi”).

Jùxī, yóuyú Shànghǎihuà yǔ Pǔtōnghuà de dúfǎ yǒusuǒbùtóng, suǒyǐ zài pīnyīn pīnxiě fāngshì shàng háishi xūyào shǐyòng shuōmíng de bāngzhù. Bǐrú jìzhě fāxiàn, fánshì yǔ Pǔtōnghuà shēngmǔ, yùnmǔ xiāngtóng de zì, zài Shànghǎihuà shūrùfǎ zhōng zuìzhōng yòng de háishi Pǔtōnghuà pīnyīn, bùtóng de zé cǎiyòng Shànghǎihuà shūrùfǎ de pīnxiě fāngshì. Rú “chénguāng” de “chén,” “huātou” de “tóu” dōu fāchéng zhuóyīn, Shànghǎihuà pīnyīn shūrùfǎ zhōng yàozài shēngmǔ zhōng jiā yī ge zìmǔ h, pīnchéng “shen,” “dhou;” fánshì rùshēng zì, zé zài pīnyīn hòu jiā zìmǔk, rú “báixiāng” de “bái” jiù pīnchéng bhek.

Bùguò, dàjiā bùyào juéde tài nán. Jìzhě fāxiàn, Shànghǎihuà shūrùfǎ yǔ Pǔtōnghuà de shūrùfǎ zuìdà xiāngtóng zhī chǔzài yú, zhǐyào liánxù shūrù shēngmǔ hé yùnmǔ jiù kěyǐ, bùxū shūrù shēngdiào. Cǐwài, Shànghǎihuà pīnyīn shūrù xìtǒng háiyǒu lèisì “zhìnéng” yōudiǎn, kěyòng suōlüè fāngshì bǎ cíyǔ pīnxiě chūlai.

Zhǔchí Shànghǎihuà shūrùfǎ kāifā de Shànghǎi dàxué Zhōngwénxì Qián Nǎiróng jiàoshòu gàosu jìzhě, zhè tào shūrùfǎ bùjǐn néng dǎchū Shànghǎihuà dà cídiǎn zhōng 15,000 duō ge cítiáo, érqiě hái néng yòng Shànghǎihuà pīnyīn dǎchū Shànghǎihuà zhōng shǐyòng zhe de, yǔ Pǔtōnghuà cíyì xiāngtóng dàn yǔyīn bùtóng de chángyòng cíyǔ. Rú “Huángpǔ Jiāng” shūrù “whangpugang” , “lǐxiǎng” zéshì lixiang děng, gòngjì 10,000 duō ge cítiáo.

sources:

separating Pinyin syllables: PHP code

A few weeks ago I had someone write to ask if I had a script that can divide Pinyin texts into their individual syllables. It so happens that I do have something that does just that. Since I sent out that bit of code, I might as well make it available to everyone (GNU GPL, and links back to Pinyin.Info are always appreciated).

It has lots of regular expressions, to make the code nice and compact. I’ve added comments for clarity.

##############################
### SEPARATE THE SYLLABLES
##############################
// In the lines below, \s means space
// This program assumes that ü is written as v
// The i at the end of a line means case insensitive
// \W is a single, non-word character (e.g., punctuation)

$search = array ("'([aeiouv])([^aeiounr\W\s])'i", // This line does most of the work
"'(\w)([csz]h)'i", // double-consonant initials
"'(n)([^aeiouvg\W\s])'i", // cleans up most n compounds
"'([aeiuov])([^aeiou\W\s])([aeiuov])'i", // assumes correct Pinyin (i.e., no missing apostrophes)
"'([aeiouv])(n)(g)([aeiouv])'i", // assumes correct Pinyin, i.e. changan = chan + gan
"'([gr])([^aeiou\W\s])'i", // fixes -ng and -r finals not followed by vowels
"'([^e\W\s])(r)'i", // r an initial, except in er
);

$replace = array ("\\1 \\2",
"\\1 \\2",
"\\1 \\2",
"\\1 \\2\\3",
"\\1\\2 \\3\\4",
"\\1 \\2",
"\\1 \\2",
);

$usertext = preg_replace($search, $replace, $document);

##############################

Since I’m always going on about the need for word parsing and not separating Pinyin into single syllables, some of you are probably wondering just why I of all people would have ever written such code. The answer is that it’s part of my Pinyin spell-checker, which is only a very basic utility in that it functions by checking for theoretically correct groups of syllables rather than real words (i.e., anything composed of correctly spelled groups of syllables, minus tone marks, will pass even if that word isn’t found in a dictionary).

Suggestions for improvements are always welcome.

Ovid Tzeng reiterates backing for Hanyu Pinyin

Earlier this week Ovid Tzeng, a former minister of education and current minister without portfolio, reaffirmed his support for Taiwan adoping Hanyu Pinyin and said that this is an important issue the government will need to deal with sooner or later.

Zēng Zhìlǎng Jiàoyùbù zhǎng rènnèi, jiānchí cǎiyòng Hànyǔ Pīnyīn, shì tā bèi huàn xiàlái de zhǔyīn zhīyī. Tā zuótiān réng bù gǎi qí zhì, qiángdiào guówài bùguǎn Zhōngwén jiàoxué huò xuéshù qīkān, hěn duō yǐjing gǎiyòng Hànyǔ Pīnyīn, Táiwān bùnéng shìruòwúdǔ, zhè suī fēi xīn zhèngfǔ zuì yōuxiān shīzhèng xiàngmù, dànshì yě lièwéi wèilái zhòngdà jiǎntǎo shìxiàng.

Most of the source article for this discusses poet and academic Zheng Chouyu’s backing for Hanyu Pinyin. He stresses his view that this is a practical matter, not a political one.

source: Zhèng Chóuyǔ jiànyì: Zhōngwén yìyīn kěyǐ cǎi Hànyǔ Pīnyīn (鄭愁予建議:中文譯音 可採漢語拼音), United Daily News, June 9, 2009

further reading: Hanyu Pinyin backer to return to Taiwan’s Cabinet, Pinyin News, April 29, 2008

Burger King, romanization, and a Taiwanese morpheme

Last week I was at my neighborhood shopping mall and saw an interesting ad in the Burger King there. Toward the bottom we find the following:
ad text reading ???, A??

買套餐, A大獎
(Mǎi tàocān, A dà jiǎng)
(Buy a set meal, score a big prize.)

Here, the Roman letter “A” is used to represent a Taiwanese verb that means something like “get in an easy manner” or “make off with” — though the fine print says that customers just have a chance to get a prize, not that they necessarily will win one.

A is often used in A-qián (”A錢”: to A money), a mixed Taiwanese and Mandarin term that means embezzle/embezzlement.

Perhaps the Ministry of Education has issued an official Chinese character for this morpheme. But even if they had most people would have no idea how to read it, and it probably would be of spurious origin to boot — just like most of the other characters the ministry has issued. Where a Taiwanese morpheme sounds like the English name of a Roman letter, the romanized form is likely to prevail over the Chinese character.

There are other interesting things about this ad. But I’ll get to those in another post.

Mandarin-language papers in Indonesia

The Jakarta Post recently ran an article titled “Chinese papers attracting younger readers.” That would indeed be encouraging news, given the considerable damage in Indonesia to Sinitic languages and Chinese culture during their suppression there for most the second half of the twentieth century.

But the specifics of the article are a little less encouraging.

Since controls against newspapers in Mandarin were eased in 2000, Jakarta has had up to six so-called Chinese newspapers. The number is now down to three, at least one of which is operating in the red.

Bambang Suryono (a.k.a. Lie Zuo Hui), editor in chief of the Yìnní Guójì Rìbào (Indonesian International Daily News, 印尼國際日報) said that Chinese newspapers in general did not get many young readers or contributors who could write well.

After the reformation era in 2001, Sino-Mandarin courses started to sprout in the city. However, Bambang said new Chinese language students were not yet apt to read Chinese newspapers.

“They can write simple articles about their daily activities, but not on complex subjects.”

For that matter, not many people on his own staff can write in Hanzi.

Out of his editorial staff of 20, there are only four journalists that can write in Chinese, and the youngest is 58 years old. “It’s so hard to find young people that can write in Chinese,” Bambang said.

Sunardi Mulia, the editor-in-chief of a business daily Indonesia Shang Bao, said most of his paper’s readers were above 55 years old.

“They are those who once went to local Chinese schools,” he said. “More and more of those aging readers are passing away.”

Sunardi’s paper, has few reporters. “I have to really roll up my sleeves to report large events,” he said. And about 90 percent of the paper’s editorial team are over 50….

“It’s true more young people have begun learning the language. However, it will take time for them to improve because their cultural surrounding is completely different than previous generations.”

“A lot of their parents cannot speak Mandarin or read Chinese anymore,” said Sunardi.

“So their opportunities to practice and communicate with the language are very limited. They probably need 15 years or so before they can read fluently,” said Sunardi, 59.

Sunardi said his paper’s present marketing target was mostly foreigners, specifically businesspeople from China and Taiwan. But he was optimistic about the future for his paper — again, though, not because of people born and bred in Indonesia but because of new readers arriving as immigrants from China.

No. of copies printed per day, according to the editors:

  • Guoji Ribao: 60,000 copies per day, about one-third of those distributed in Jakarta
  • Shang Bao: 10,000 copies per day, most distribued in Jakarta
  • Harian Indonesia Yinni Sin Chew Ribao (formerly government controlled): 55,000 per day

source: Chinese papers attracting younger readers, Jakarta Post, May 15, 2008 (alternate source)

online texts in Hanyu Pinyin

Chris recently wrote and asked for a list of texts in Pinyin. This site, of course, has at least a few things in Pinyin. Unfortunately, however, they can be a bit difficult to find. So having a list is indeed a good idea.

Here are some readings in Hanyu Pinyin:

Some song lyrics

I should probably figure out a way to incorporate this into the recommended readings section. One of the problems with this site is that it has grown much, much larger than I ever expected, which has resulted in some pages not fitting well within the structure I initially established for Pinyin Info. Over the years various additional readings have been added to the site, a few of which are even in Hanyu Pinyin. But since Pinyin Info’s recommended readings section is set up for books rather than essays, songs, etc., this will involve a rethinking of that page.

I very much hope people can help expand the list by providing links to readings elsewhere in Pinyin. But before listing something in the comments, please make sure it is in real Hanyu Pinyin (e.g., with word parsing instead of bro ken syl la bles, with tone marks instead of tone numbers, and with proper capitalization and punctuation). Alas, most texts that are supposedly in Pinyin do not follow those rules.

Unicode tops other encodings on Web pages: Google

Google is reporting that in December 2007 Unicode became the most frequently used encoding on Web pages.

Just last December there was an interesting milestone on the web. For the first time, we found that Unicode was the most frequent encoding found on web pages, overtaking both ASCII and Western European encodings—and by coincidence, within 10 days of one another. What’s more impressive than simply overtaking them is the speed with which this happened.

Here’s Google’s graph:
graph showing percentage of pages in various encodings, 2001-2008, with ASCII starting about 56% in 2001 and declining to about 25% now, which is also about where iso-8859-1 and utf-8 are now

I wish Big5, the encoding most used for Web pages in traditional Chinese characters, had been included in the graph. And I suspect that it’s only within the past ten years — perhaps even within the timeframe of the graph — that more Web pages have been encoded in GB (used for so-called simplified Chinese characters) than Big5. (GB is shown on the graph in green.)

Of course, many (most?) Web pages don’t declare any character encoding. This is especially bad when they contain characters beyond the bounds of ASCII, since those characters will often end up rendered as garbage on systems different than that of the creator of the Web page.

So … should I have a post focusing on Unicode without again berating the Unicode Consortium for its continuing unscientific, egregious, and unforgivable use of ideographic? I don’t think so.

source: Moving to Unicode 5.1, Official Google Blog, May 5, 2008