separating Pinyin syllables: PHP code

A few weeks ago I had someone write to ask if I had a script that can divide Pinyin texts into their individual syllables. It so happens that I do have something that does just that. Since I sent out that bit of code, I might as well make it available to everyone (GNU GPL, and links back to Pinyin.Info are always appreciated).

It has lots of regular expressions, to make the code nice and compact. I’ve added comments for clarity.

############################## ### SEPARATE THE SYLLABLES ############################## // In the lines below, \s means space // This program assumes that ü is written as v // The i at the end of a line means case insensitive // \W is a single, non-word character (e.g., punctuation)


$search = array ("'([aeiouv])([^aeiounr\W\s])'i",    // This line does most of the work

       "'(\w)([csz]h)'i",        // double-consonant initials

       "'(n)([^aeiouvg\W\s])'i",     // cleans up most n compounds

       "'([aeiuov])([^aeiou\W\s])([aeiuov])'i", // assumes correct Pinyin (i.e., no missing apostrophes)

       "'([aeiouv])(n)(g)([aeiouv])'i",    // assumes correct Pinyin, i.e. changan = chan + gan

       "'([gr])([^aeiou\W\s])'i",  //     fixes -ng and -r finals not followed by vowels

       "'([^e\W\s])(r)'i",    //    r an initial, except in er

       );
$replace = array ("\\1 \\2",

                 "\\1 \\2",

                 "\\1 \\2",

         "\\1 \\2\\3",

         "\\1\\2 \\3\\4",

                 "\\1 \\2",

                 "\\1 \\2",

       );
$usertext = preg_replace($search, $replace, $document);

##############################

Since I’m always going on about the need for word parsing and not separating Pinyin into single syllables, some of you are probably wondering just why I of all people would have ever written such code. The answer is that it’s part of my Pinyin spell-checker, which is only a very basic utility in that it functions by checking for theoretically correct groups of syllables rather than real words (i.e., anything composed of correctly spelled groups of syllables, minus tone marks, will pass even if that word isn’t found in a dictionary).

Suggestions for improvements are always welcome.

5 thoughts on “separating Pinyin syllables: PHP code”

Pingback: Pinyin news » convert Chinese characters to Unicode character references: javascript
davis on Monday, October 26, 2015 at 6:54 pm said:

Thanks for sharing this. Seems it works only with pinyin with tones (f?icháng rè), if user is using pinyin with numbers (pin1yin1), this script won’t work
Piero on Saturday, June 4, 2016 at 3:00 am said:

I made a javascript version based on this code

https://gist.github.com/pierophp/bb84754e5de43b4f406aaf0bda8a007e

Thanks
Pinyin Info on Friday, June 10, 2016 at 1:51 pm said:

Nice!
gautam on Wednesday, November 10, 2021 at 5:53 pm said:

thnk man i m looking for this….

Pinyin News

news and discussions mainly related to Chinese characters and romanization

separating Pinyin syllables: PHP code

5 thoughts on “separating Pinyin syllables: PHP code”

Leave a Reply