convert Chinese characters to Unicode character references: javascript

I’ve had a spate of requests recently for the code for Pinyin.info’s tool that converts Chinese characters to Unicode numeric character references (i.e., something that converts, say, “漢語拼音” into “漢語拼音”). Since I’m a believer in open-source work — and since people could find the code anyway if they look carefully enough in the Web page’s source code — I might as well publish it.

This tool can be very handy when making Web pages that use a variety of scripts. (It works on Cyrillic, etc., as well.) I often employ it myself.

Here’s the heart of the code:

function convertToEntities() { var tstr = document.form.unicode.value; var bstr = ''; for(i=0; i<tstr.length; i++) { if(tstr.charCodeAt(i)>127) { bstr += '&#' + tstr.charCodeAt(i) + ';'; } else { bstr += tstr.charAt(i); } } document.form.entity.value = bstr; }

This sleek little bit of Javascript is originally by Steve Minutillo and used here on Pinyin.info with his permission. I may have tweaked the code a little myself; but that was so long ago I don’t remember well. (I’ve had the converter here for about five years.) Anyway, if you use this please acknowledge Steve’s authorship; and of course I always greatly appreciate links back to Pinyin.info.

If anyone knows how to do the same thing in PHP — preferably with no more code than used above, please let me know.

7 thoughts on “convert Chinese characters to Unicode character references: javascript”

Nice, but you have a typo — the “for” line got truncated somehow. I don’t know what HTML I’m allowed to use here in this comments section, so, apologies if this gets mangled, but here’s a version that takes a string argument and returns a string:
function convertToEntities(tstr) {
var bstr = ”;
for (i = 0; i 127)
{
bstr += ‘&#’ + tstr.charCodeAt(i) + ‘;’;
}
else
{
bstr += tstr.charAt(i);
}
}
return bstr;
}

Oh, I see why it gets truncated, your form handler’s interpreting the less-than sign as the start of a tag. Also, my <pre> tag didn’t work. Sigh, now I’ve irreparably messed up your comments section! Try this:

Yikes! Thanks, Klortho.

I think I’ve got all of WordPress’s helpful emendations fixed now. But just in case, people can grab the same code source that my converter calls.

Hello!
Sorry to be so late to add my 2 cents. I’ll resume my experience with dealing with unicode/htmlentities, UTF-8, MySQL and PHP.

First you all should know that I know little about website programming and that it’s more a hobby to me than nothing else.

I wanted to write 2 PHP files to help me study 2 languages: Japanese & Russian. The first php file would have a form to enter data in a database (vocabulary Russian Spanish, for example), and the second one would show the contents of the database filtering by the different Lessons. As I told you, it’s a study tool for me.

So I wrote this piece of code:
”
$russian=$HTTP_POST_VARS[‘russian’];
$lesson=$HTTP_POST_VARS[‘lesson’];
$spanish=$HTTP_POST_VARS[‘spanish’];

$query=”insert into vocabrussiano (russian,lesson) values (‘”.$russian.”‘,'”.$lesson.”‘)”;
[….]
”
It worked, but although I declared the MySQL table as unicode, when I used the second PHP file to recover the data, I would get questions marks ??? instead of the russian characters. Even though I was using UTF-8 as character encoding.

Because I dont need to do searches in the database, I thought about storing the html entities equivalents for all the Japanese and Russian characters, and this is when I thought about Mark’s website and JavaScript.

But copy-pasting every word I want to write, was not useful, so that I thought about integrating the Javascript in my form. But since I dont like JavaScript, I kept looking for a solution with PHP.

And I came across a couple of interesting functions: ord() and htmlentities(). Ord() gives you the ASCII code of a character, but some people have coded their unicode versions and it seems the most useful is this one: http://hsivonen.iki.fi/php-utf8/ I tried with that one, but I couldn’t make it work. I was getting weird codes and messages. So I tried with something different: htmlentities(). This variable should transform special characters into htmlentities, but it shouldn’t work with Jap or Rus characters, still I decided to give it a try… And it worked:

$russian=$HTTP_POST_VARS[‘russian’];
$russian2=htmlentities($russian);

It converted the Russian characters, written with the Rus Keyboard in Windows, to its htmlentities. The problem I came across then is that when I tried to enter this second variable in MySQL, it would turn the Ӓ to Ӓ Thus when I recovered the data from the database, it would show wrong in the browser.

At this point, I change computer (I’m carrying the application in a portable USB drive, with Uniform Server), and guess what happened? Well, somehow, with the same configuration in both computers, same browser and same PHP files, the variable I was recovering from POST variables, had already been transformed to htmlentities. It was not russian text anymore. I am puzzled. But my simple code works now. I can enter the variable directly in the database and everything is working.

So, in resume, I am still not sure how anything of this works. I will have to dedicate more time (that I dont have) to reading about encodings and php and charsets and blah blah blah. Although my next step would be trying to use the Javascript code you sent me somehow. Maybe triggering the javascript after the word is written in russian so it’s transformed into htmlentities before even the form is sent… Hum~~ Gotta invetigate that.

Well, I hope my odissey is useful somehow for you.

Regards,
JMA

This script has very good help for my project and I need it’s convert into Unicode to Chinese characters . Do you have script like Unicode to Chinese characters ? If yes , Please forward to my email id.

Regards,
Hapi

Hey Pattanaik

Try to use this function to do the reverse process:

html_entity_decode($yourString, ENT_NOQUOTES, ‘UTF-8’);

It worked for me [;)] !

Hi,

I tried using html_entity_decode($yourString, ENT_NOQUOTES, ‘UTF-8’); posted by @Thomas but this is not working fine for me. May be there is some funtion written like html_entity_decode(); thats why is giving an error for html_entity_decode is not a function.

I have written a code for reverse of this Unicoding. There may be some enhansements in the code. Plesae let me know if you find any.

function convertToEntities(unicode) {
var tstr = unicode;
var finallyy=””;
var array= tstr.split(“;”);
for(i=0;i<array.length;i++){
mydata=array[i].replace("&#","");

if(mydata.match(/\s/g)){

finallyy =finallyy+" "+String.fromCharCode(mydata);
}
else{
finallyy =finallyy+ String.fromCharCode(mydata);
}

}
return finallyy;
}
It includes spaces also.
But I this code doesnt work for mix characters like if we mix english with our uncode. It will eliminate the english words and then gives us the chinese charcters excluding english letters.

Please let me knw if nay one of you finds a way around.

Deepika
deepikaviet09@gmail.com

Pinyin News

news and discussions mainly related to Chinese characters and romanization

convert Chinese characters to Unicode character references: javascript

7 thoughts on “convert Chinese characters to Unicode character references: javascript”

Leave a Reply