UTF-8 Unicode vs. other encodings over time

Some eight years ago UTF-8 (Unicode) became the most used encoding on Web pages. At the time, though, it was used on only about 26% of Web pages, so it had a plurality but not an absolute majority.

Graph showing growth of the UTF-8 encoding

By the beginning of 2010 Unicode was rapidly approaching use on half of Web pages.
graph showing a steep rise in the use of UTF-8 and a steep decline in other major encodings

In 2012 the trends were holding up.
UTF-8_website_use_2001-2012

Note that the 2008 crossover point appears different in the latter two Google graphs, which is why I’m showing all three graphs rather than just the third.

A different source (with slightly different figures) provides us with a look at the situation up to the present, with UTF-8 now on 85% of Web pages. Expansion of UTF-8 is slowing somewhat. But that may be due largely to the continuing presence of older websites in non-Unicode encodings rather than lots of new sites going up in encodings other than UTF-8.
growth in Unicode UTF-8 encoding on Web pages, 2010-2015

Here’s the same chart, but focusing on encodings (other than UTF-8) that use Chinese characters, so the percentages are relatively low.
asian_language_encodings_2010-2015

And here’s the same as the above, but with the results for individual languages combined.
asian_language_encodings_2010-2015_by_language

By the way, Pinyin.info has been in UTF-8 since the site began way back in 2001. The reason that Chinese characters and Pinyin with tone marks appear scrambled within Pinyin News is that a hack caused the WordPress database to be set to Swedish (latin1_swedish_ci), of all things. And I haven’t been able to get it fixed; so just for the time being I’ve given up trying. One of these days….

Sources:

4 thoughts on “UTF-8 Unicode vs. other encodings over time

  1. Ah, wrong DB character encoding… Yes, I made that mistake too when I started with these things, this leads to “funny” effects, so I’m not sure if you will ever be able to fix this, having unicode characters in a latin1 DB, and perhaps you already know this, but just in case you don’t:

    ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
    ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;

    I also found this oneliner (Fill in “dbname”.), if you can access MySQL through command line:
    DB=”dbname”; ( echo ‘ALTER DATABASE `'”$DB”‘` CHARACTER SET utf8 COLLATE utf8_unicode_ci;’; mysql “$DB” -e “SHOW TABLES” –batch –skip-column-names | xargs -I{} echo ‘ALTER TABLE `'{}’` CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;’ ) | mysql “$DB”

  2. I always tell people to prefer PostgreSQL over MySQL, and the most important reason is the inane encoding settings of MySQL. Years ago when I last worked with that RDBMS, I remember I had to set encoding to UTF-8 in no less than *four* places: the database, the table, the connection, and somewhere else. Crazy.

Leave a Reply

Your email address will not be published. Required fields are marked *