separating Pinyin syllables: PHP code

A few weeks ago I had someone write to ask if I had a script that can divide Pinyin texts into their individual syllables. It so happens that I do have something that does just that. Since I sent out that bit of code, I might as well make it available to everyone (GNU GPL, and links back to Pinyin.Info are always appreciated).

It has lots of regular expressions, to make the code nice and compact. I’ve added comments for clarity.

##############################
### SEPARATE THE SYLLABLES
##############################
// In the lines below, \s means space
// This program assumes that ü is written as v
// The i at the end of a line means case insensitive
// \W is a single, non-word character (e.g., punctuation)

$search = array ("'([aeiouv])([^aeiounr\W\s])'i", // This line does most of the work
"'(\w)([csz]h)'i", // double-consonant initials
"'(n)([^aeiouvg\W\s])'i", // cleans up most n compounds
"'([aeiuov])([^aeiou\W\s])([aeiuov])'i", // assumes correct Pinyin (i.e., no missing apostrophes)
"'([aeiouv])(n)(g)([aeiouv])'i", // assumes correct Pinyin, i.e. changan = chan + gan
"'([gr])([^aeiou\W\s])'i", // fixes -ng and -r finals not followed by vowels
"'([^e\W\s])(r)'i", // r an initial, except in er
);

$replace = array ("\\1 \\2",
"\\1 \\2",
"\\1 \\2",
"\\1 \\2\\3",
"\\1\\2 \\3\\4",
"\\1 \\2",
"\\1 \\2",
);

$usertext = preg_replace($search, $replace, $document);

##############################

Since I’m always going on about the need for word parsing and not separating Pinyin into single syllables, some of you are probably wondering just why I of all people would have ever written such code. The answer is that it’s part of my Pinyin spell-checker, which is only a very basic utility in that it functions by checking for theoretically correct groups of syllables rather than real words (i.e., anything composed of correctly spelled groups of syllables, minus tone marks, will pass even if that word isn’t found in a dictionary).

Suggestions for improvements are always welcome.

One thought on “separating Pinyin syllables: PHP code

  1. Pingback: Pinyin news » convert Chinese characters to Unicode character references: javascript

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>