How to add tone marks to Pinyin automatically, sort of

PInyin text without and with tone marks

There are plenty of ways to type Hanyu Pinyin with tone marks. These usually involve typing the tone number after the vowel in question or entering a series of special keystrokes to produce the tone mark.

But some consider that too much mafan, or perhaps are unsure of which tones are correct. (Heads up, students learning Mandarin! This post will be useful.) So occasionally I’m asked this question:

Is there a way to type in Hanyu Pinyin and have the correct tone marks appear automatically — even without typing tone numbers or pressing additional keys? Oh, and for free too, please.

The answer is a qualified yes.

Google Translate’s Pinyin function has come a long way since its inauspicious beginning about eight years ago. For quite some time it has even offered a way to add tone marks automatically, though few people know of this function, which could still use a great deal of improvement.

To get Google Translate to produce Pinyin with tone marks as you enter text in toneless Pinyin, first you need to set the system to translate from “Chinese” to “Chinese (Traditional)” or from “Chinese” to “Chinese (Simplified)”.

Enter your text in the box and Pinyin with tone marks will appear below the box on the right.

(Click any image to enlarge it.)

Alas, there are some problems with the system.

A lot of perfectly normal things that are essential to proper writing in Hanyu Pinyin will cause Google Translate to break. So when adding your text, do not use any of the following:

  • capital letters
  • the letter ü (use “v” instead)
  • more than 160 characters (including spaces and punctuation) at a time

Up to 160 characters is fine

Image showing how Google Translate will produce Hanyu Pinyin with tone marks for texts of up to 160 characters

But more than 160 characters will break the function that adds tone marks to Pinyin

The following are optional in terms of getting Google Translate to give you good results, though they are not optional in properly written Pinyin:

  • apostrophes
  • spaces
  • punctuation

A second significant problem is that the system doesn’t deal well with proper nouns, failing both word parsing and capitalization, though at least it seems to recognize that proper nouns are units, even if Google Translate doesn’t write them correctly. sample showing how Google Translate fails to capitalize and parse Tian'anmen and Mao Zedong, producing tian'anmen and maozedong instead.

So although Google Translate won’t handle everything for you, it can nevertheless be a useful tool for including tone marks in Hanyu Pinyin.

Languages, scripts, and signs: a walk around Taipei’s Shixin University

Recently I took some trails through the mountains in Taipei and ended up at Shih Hsin University (Shìxīn Dàxué / 世新大學). Near the school are some interesting signs. Rather than giving individual posts for each of these, I’m keeping the signs together in this one, as this is better testimony to the increasing and often playful diversity of languages and scripts in Taiwan.

Cǎo Chuàn

Here’s a restaurant whose name is given in Pinyin with tone marks! That’s quite a rarity here, though I suspect we’ll be seeing more of this in the future. The name in Chinese characters (草串) can be found, much smaller, on a separate sign below.

cao_chuan

二哥の牛肉麵

Right by Cao Chuan is Èrgē de Niúròumiàn (Second Brother’s Beef Noodle Soup). Note the use of the Japanese の rather than Mandarin’s 的; this is quite common in Taiwan.

erge_de_niuroumian

芭樂ㄟ店

This store has an ㄟ, which serves as a marker of the Taiwanese language. Here, ㄟ is the equivalent of 的 — and of の.

Bālè ei diàn
bala_ei_dian

A’Woo Tea Bar

awoo_tea_bar

I couldn’t find a name in Chinese characters for this place. The name is probably onomatopoeia, as in “Werewolves of London — awoo!”

Xin Tang 10

I’ve just added to Pinyin.info the tenth and final issue (December 1989) of the seminal journal Xin Tang. I strongly encourage everyone to take a look at it and some of the other issues. Copies of this journal are extremely rare; but their importance is such that I’ll be putting all of them online here over the years.

cover of Xin Tang no. 10

Xin Tang 10

Although I’m giving the table of contents in English, the articles themselves are in Mandarin and written in Pinyin.

  • FEATURE ARTICLES
    • ZHOU YOUGUANG: The Next Step of Language Modernization
    • CHEN ENQUAN: Experiments Should Be Carried Out on the Phoneticization of Chinese Characters
    • LI YUAN: Romanized Chinese Must Be Finalized
    • LI PING: To Be a Promoter of Script Reform
    • ZHENG LINXI: Wu Yuzhang and Chinese Phonetic Spelling
    • ZHANG LIQING: How Should the Tones of Chinese Spelling Be Indicated?
  • LITERATURE
    • LIQING: Elephants
    • CHEN XUANYOU (Tang Period): The Wandering Soul
    • WU JINGZI (Qing Period): Third Daughter Wang
    • LU XUN: On the Collapse of Thunder Peak Pagoda
    • RUI LUOBIN: The Adventures of Chunmei and Mimi
    • COMIC DIALOGUES: Toad Drums
    • WEI YIJIN: Dreams at Twenty
    • DIAO KE: In Praise o f the Spirit of Bees
    • GE XIAOLING: A Song to the Disabled Children
    • YBY: The Story of the Magic Square
  • SHORT SKETCHES
    • DIAN EWEN: Interesting Tidbits about Script Reform Abroad
    • LI YUAN: A Few Statistics on Tones Notations in Romanized Chinese
  • LEARNING MANDARIN
    • Asking the Way
  • FROM THE EDITORS
    • Farewell to Our Readers

Pinyin sort order

The standard for alphabetically sorting Hanyu Pinyin is given in the ABC dictionary series edited by John DeFrancis and issued by the University of Hawaii Press.

Here’s the basic idea:

The ordering is primarily simply alphabetical. Diacritical marks, punctuation, juncture and capitalization are only taken into account when the strings being compared are otherwise identical. For example, píng’ān sorts before pīnyīn, because pingan sorts before pinyin, because g precedes y alphabetically.

Only when two strings are alphabetically identical is non-alphabetical information taken into account.

The series’ Reader’s Guide presents the specifics of the sort order. Since I don’t have to worry about how much space this takes up on my site, I have reformatted the information slightly to give the examples as numbered lists.

Head entry transcriptions with the same sequence of letters are ordered first strictly by letter sequence regardless of tones, then by initial syllable tone in the sequence 0 1 2 3 4. For entries with the same initial tone, arrangement is by the tone of the second syllable, again in the order 0 1 2 3 4. For example:

  • shīshi
  • shīshī
  • shīshí
  • shīshǐ
  • shīshì
  • shíshī
  • shíshì
  • shǐshī
  • shìshī
  • Irrespective of tones, entries with the vowel u precede those with ü.
    For example:

    Entries without apostrophe precede those with apostrophe. For example:

    1. biànargue
    2. bǐ’ànthe other shore

    Lower-case entries precede upper-case entries. For example:

    1. hòujìnaftereffect
    2. Hòu JìnLater Jin dynasty

    For entries with identical spelling, including tones, arrangement is by order of frequency….

    For most users, the most important thing to note is that the neutral tone is regarded as 0, not as 5. Thus, the order is notā á ǎ à a,” but “a ā á ǎ à.” And, because lowercase comes before uppercase, notA a Ā ā Á á Ǎ ǎ À à” but “a A ā Ā á Á ǎ Ǎ à À.”

    One can see this in action in the A entries for the ABC English-Chinese, Chinese-English Dictionary. And here are some sample pages from an earlier ABC dictionary.

    The ABC series follows the example of the Hanyu Pinyin Cihui (汉语拼音词汇 / 漢語拼音詞彙 / Hànyǔ Pīnyīn Cíhuì) (example), with only one minor difference, as noted by Tom Bishop:

    HPC [Hanyu Pinyin Cihui] gave hyphens and spaces the same priority as apostrophes, so that lìgōng sorted before lǐ-gōng, in spite of the tones. Usage of hyphens and spaces in pinyin is still far from being fully standardized. (The same is true in English orthography.) Consequently, for collation it makes sense to give less weight to hyphens and spaces, and more weight to tones, thus sorting lǐ-gōng before lìgōng. In ABC, hyphens and spaces don’t affect the sort order unless they change the pronunciation in the same way that apostrophe would; for example, ¹míng-àn 明暗 and ²míng’àn 冥暗 are treated as homophones, and they sort after mǐngǎn 敏感.

    Pinyin font: the Brill

    Some of the Pinyin-friendly font families I provide examples of on this blog are fun but not exactly the sort of thing you’d want to use in a book or other serious project. Others, though, are solid examples of the subtle and exacting art of type design. Today’s entry belongs in the latter group.

    Brill — a Leiden-based publisher of work in the humanities, social sciences, law, and science — has released “the Brill,” a new font family designed to support the Latin and Greek scripts “to the fullest extent possible.” IPA and the Slavic parts of the Cyrillic range are also covered. This can handle the needs of just about any romanized script, including Hanyu Pinyin.

    As someone with Brill explained to me:

    Instead of limiting the fonts’ character set to known characters and character-plus-diacritic combinations, we chose a dynamic model in which, using OpenType GPOS features, any base character can carry any diacritic above or below it, and in which diacritics can be stacked as well—not forgetting all the precomposed characters that are already present in the Unicode Standard, of course. Finally, a huge assortment of punctuation marks, editorial marks, and other symbols known to occur in Brill publications were added to the spec.

    In total, the Brill contains more than 5,100 characters. And that already immense range can be extended through combining diacritics, as noted above.

    Even better, the Brill is free for non-commercial use. You can download it after agreeing to the End User License Agreement license. (See the bottom of that page and then the bottom of the page that follows.)

    The Brill is available now in roman and italic styles. Bold and bold italic versions will be released later this year, probably before July.

    The Brill is considerably different than Brill Online, which has been available for some time and was aimed at helping users of Brill’s online reference works. Brill Online is based on v. 1.00 of the Gentium family of fonts. The glyph set was extended to support some very rare characters, such as Aegean numbers. “In essence it became a hybrid Latin-Greek-Cyrillic-IPA and ‘pi’ font family.”

    Thanks to Lin Ai of Zhongweb.net for the heads up that this had been released, and to Dominique de Roo and Pim Rietbroek of Brill for patiently helping me with my questions.