Saturday 11 October 2003

There ain’t no such thing as plain text

I wish Joel Spolsky had published his excellent introduction to Unicode and character encoding a week earlier, because then I wouldn’t have wasted a couple of hours trying to write a snippet of PHP code to convert Japanese characters to Unicode character entities. In the fourth paragraph of The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Joel Spolsky reveals what what finally provoked him into writing his essay:

When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.

That statement knocked me for a six. Historically—as Joel Spolsky infers—American programmers have been indifferent to dealing with languages other than English. But PHP started out in 1995 as a series of Perl scripts written by Rasmus Lerdorf who was born in Greenland, lived in Denmark for much of his childhood, then spent a number of years in Canada before moving to the United States. In 1997, Zeev Suraski and Andi Gutmans—both Israelis, who between them speak Hebrew, English, and German—completely rewrote the core PHP code, turning it into what became known as the Zend engine. If anyone would be sensitive to language and character set issues, you’d surely expect it to be these guys and their colleagues.

Yet a Google search on “unicode support in php” turned up an interesting, and ultimately dispiriting, series of threads. Firstly, this reply by Andi Gutmans to an October 2001 question on the PHP Internationalization Mailing List about “the current status of multi-byte character handling in PHP, and also some kind of forecast of when it is expected to work in a stable manner”:

No one seems to be working seriously on full Unicode support except for the mainly Japanese work Rui [Hirokawa] has done. I thought that the Email from Carl Brown was quite promising but adding good i18n support to PHP will require much more interest and volunteers. It seemed that not many people were very interested.

More recently, l0t3k replied to an August 2003 question about Unicode support:

i certainly am not an official voice of PHP, but some movement is happening (albeit slow and scattered) to provide some form of Unicode support. the Japanese i18N group have recently created a path to allow the engine to process scripts in various encodings, Unicode included [1].

[1] refers to another thread in which Masaki Fujimoto reported on progress with the i18n (internationalization) features of the Zend Engine 2, adding:

yes, I know most of you (== non-multibyte encoding users) do not care about this kind of i18n features (and somehow feel ‘more than enough’) as the comments in http://bugs.php.net/bug.php?id=22108 shows, so I paid close attention not to do any harm with original codes: everything is done in #ifdef ZEND_MULTIBYTE.

What’s really dispiriting is the conversation at PHP Bugs to which Masaki Fujimoto refers, where the issues of Unicode and internationalization are met with either indifference, hostility, or—as in this question—both:

And why on earth would you save PHP files in any other format than ascii?

Color me flabbergasted. If you tried to imagine the target audience for Joel Spolsky’s essay, this guy is standing right on the bullseye. As Joel explains:

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.

There Ain’t No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

But why did I want to use PHP to convert Japanese characters to Unicode entities anyway? Procrastination mainly (anything to avoid my essay about the George W Bush aircraft carrier stunt). Curiosity too. While working on another essay, about Ozu Yasujiro, I wanted to make a table listing his films: their Japanese titles, translations of those titles, the actual English titles, and the year of release.

Since I have a PC exclusively devoted to Japanese (so that I can use some native Japanese applications), I wound up creating the table in Word 2000 and save the document as HTML. When I examined the HTML in Dreamweaver on my main (English) PC, I noticed that Word had transformed the Japanese characters into the equivalent Unicode entities. When I type Japanese into Dreamweaver, on the other hand, the Japanese characters simply appear within the HTML.

As an example, the next two lines of text both read Ochazuke no aji (The Flavor of Green Tea over Rice, the title of one of Ozu’s films) and, if you have Japanese support enabled, should look the same:

お茶漬けの味

お茶漬けの味

But, if you check the source code in your browser, you’ll see Japanese characters in the first line and Unicode entities in the second, like this:

Japanese and Unicode characters for ochazuke no aji

I might be utterly mistaken, but I can’t help thinking that using the Unicode entities might be preferable (ie more reliable) than using the actual Japanese characters. Though, as long as the character encoding is set to utf-8, it may not make any difference. I’d be interested in what anyone else thinks about this. Since I thought it would be useful to get some advice from the experts, I’ve emailed Joel Spolsky and Masaki Fujimoto. (I didn’t think there was any point in bothering Mr ASCII.)

Update

Masaki Fujimoto and Joel Spolsky graciously replied to my email, basically confirming the points that Michael Glaesemann made in his comment. Joel Spolsky wrote that he has been using UTF-8 for all the translations of Joel on Software (currently translated into 28 languages) and “has not had a single person complain about not being able to read it”.

Whilst favoring the use of characters rather than Unicode entities, Masaki Fujimoto pointed out that entities offer two additional advantages:

  • avoiding implicit encoding translation (some software—including PHP can implicitly convert one encoding to another and using entites allows you to skip this);
  • avoiding any null-bytes problems (UTF-16 and UTF-32 can contain null-bytes, which can cause various kind of problems with unicode-unaware software.

Fujimoto-san also explained that the Japanese think of Uncode entities as a “kind of work around for the japanese-unavailable-environment” and so would never normally use entity references. He also noted out that it is not really all that difficult to make PHP completely Unicode-aware, with the main roadblocks being that:

  • because PHP does not distinguish binary data from strings it is not possible to change “string type” to “unicode-aware string type” without breaking any binary contents;
  • most of the core PHP developers live in Europe, so they are not so interested in the Unicode issue.

I have a feeling that the second roadblock will be easier to dismantle than the first, that the interest of European PHP developers in Unicode will increase proportionately with the economic influence of China.

Permalink | Technorati

Comments

i don't understand how you can repeat joel's suggestion that international applications with PHP are impossible, while regularly using PHP to maintain a weblog with text in multiple languages. isn't your weblog an "impossible" international application with PHP?

Posted by scott reynen on 12 October 2003 (Comment Permalink)

Hi, Jonathon. I've just started reading your site regularly (thanks for including an RSS feed!) and have been impressed with the depth in which you cover your topics.

As for using Unicode entities rather than signaling the encoding through other means (such as the HTTP content header, a meta tag, or the XML declaration for XHTML and other XML applications), I can only see this as practical in generated document situations, i.e., something is processing the page rather than you creating it by hand. I wouldn't enjoy pushing every document through Word to get the encodings, or heaven forbid, keying the entities outright. *shudder* If the text is processed in some way, for example through a content management system, it wouldn't be so difficult as it's all done behind the scenes. And looking at the code wouldn't be pretty. Also, the content would be really hard to edit if you didn't have access to something that would interpret the entities for you.

And you've just increased the file size. You're now writing an 8-character entity for each character: 64 bits—8 bytes!—for each character in ISO Latin 1. In UTF-8, some characters are 1 byte—the 'lower' code points—while others are 2, 3, or 4 bytes—the 'higher' code points. A lot of Japanese characters have 3-byte representations in UTF-8, I believe.

As an experiment, I copied your phrase and copied it 1,000 times into BBEdit. I saved the Unicode entities version as ISO Latin 1 and the straight text version as UTF-8 (including the byte order mark). The ISO Latin 1 file weighs in at 48,382 bytes, which corresponds nicely with 1,000 copies of 6 characters represented as an 8-character string at 1 byte (8 bits) per character: 48,000 bytes. The UTF-8 version is a lean 19,385 bytes. That comes out to just over 3 bytes per character, not taking into account the BOM and other cruft BBEdit might deem necessary.

(Just for curiosity, I checked out UTF-8 (no BOM) and UTF-16 (no BOM) versions: 19,382K and 14,382K respectively. Looks like the BOM is 4 bytes in UTF-8. In UTF-16, all characters are 2-byte encoded, giving you huge savings for Japanese text.)

Granted, there are browsers and other software agents that don't handle encoding properly, so I can see your desire to use Unicode entities for reliability. But luckily they're becoming more scarce as time goes on and more people are paying attention to internationalization issues.

The options for declaring the encoding are not perfect at this point, but for the most part you can get the job done. Content header encoding declarations can be made at the server level or using .htaccess files. Using meta tags is pretty straightforward. XML declarations (also called prologs) can cause problems with some browsers—MSIE for one. Although the prolog is technically required for XHTML, leaving it out doesn't seem to hurt anything in current browsers.

So, from an ease of use perspective, I think using a proper encoding declaration is better. With the proper declaration, you only need to use entities for & and http://www.webstandards.org/learn/askw3c/dec2002.html

I also found Zeldman's Designing with Web Standards really useful—for a lot more than just encoding guidance! http://www.zeldman.com/dwws/

Longer than I expected. Hope you find it worthwhile!

Posted by Michael Glaesemann on 12 October 2003 (Comment Permalink)

Scott, perhaps I should have been more specific about my own use of PHP -- I'm using it to display a random image (linked to a previous entry) each time the main index page loads and also to create my blogroll dynamically (via blogrolling.com). In neither of those cases am I using PHP to process Japanese text, which was my objective when I tried (and failed) to write the PHP snippet to convert Japanese text to Unicode entities.

Michael, thanks for your lucid, comprehensive explanation. Any one of the reasons you suggest -- the difficulty of creating the entities, the near impossibility of editing the text once it had been converted to entities, and the significantly increased file size -- would be reason enough to think twice about converting CJK characters to Unicode entities. These three factors taken together suggest that such an approach would be lunacy. Thanks also for taking the time to do the file size experiments -- and for the pointer to Zeldman's book, which I was aware of but hadn't got around to buying.

Posted by Jonathon on 12 October 2003 (Comment Permalink)

My apologies, Scott, for confusing my own inability to code in PHP with PHP's lack of Unicode support. Your "snippet of PHP code to convert Japanese characters to Unicode character entities" does exactly what I was attempting to do:

http://www.randomchaos.com/language/japanese-unicode.php?japanese=%E3%81%8A%E8%8C%B6%E6%BC%AC%E3%81%91%E3%81%AE%E5%91%B3

As I say in my follow-up post:

http://weblog.delacour.net/archives/2003/10/im_not_giving_up_my_day_job_to_become_a_php_programmer.php

"Color me embarrassed."

Posted by Jonathon on 12 October 2003 (Comment Permalink)

I'm glad my comments were helpful. Working with Japanese and English in layout software such as QuarkXPress and InDesign (and now my forays into XHMTL) has made me keenly aware of 文字化け and pushed me to learn how to prevent it, both with font choice and encoding issues in moving text from place to place.

Looks like my comment got bit by the entity bug as well! I thought I had properly encoded the 'ampersand' and 'less than' symbol, but some text is missing from the original comment. For completeness, here are the affected paragraphs in their entirety:

----
So, from an ease of use perspective, I think using a proper encoding declaration is better. With the proper declaration, you only need to use entities for 'ampersand' and 'less than';. From a file size perpective, a proper encoding declaration saves you lots of space. You'll never get 100% reliability, even using Unicode entities. Something's always going to screw up in some browser on some platform, even if all of the characters are properly represented.

For more a little more on encoding declarations, take a look at this page from the Web Standards Project: http://www.webstandards.org/learn/askw3c/dec2002.html
----

I look forward to reading more from you!

Posted by Michael Glaesemann on 12 October 2003 (Comment Permalink)

A somewhat belated comment. In a recent post on my own weblog, at http://www.brockerhoff.net/bb/viewtopic.php?p=603#603, I tried to write a short phrase in Tibetan. I first tried Unicode entities, but all browsers I tested failed to properly render Tibetan composite characters. Subsequently I wrote some short HTML test files using both entities and UTF-8 encoding, and neither worked properly. I'd be interested in learning whether there's any way around that.

Posted by Rainer Brockerhoff on 20 October 2003 (Comment Permalink)

Great to see bloggers talk about Unicode support, and see that UTF-8 is the standard encoding for top blogs.
I fully agree with Michael. Numeric Character References (NCRs, above called 'Unicode entities') were never designed to be used when you could use the characters directly, only for characters you couldn't otherwise encode, e.g. Japanese characters in an US-ASCII or iso-8859-1 (Latin-1) encoded file.
Of course nobody in Japan wants to use them, you wouldn't want to use them for English, would you? Everything would be totally unreadable. Even if it's only in the source, debugging is just much harder when you can only see the tags, but cannot read the actual text. The same thing of course applies to other languages.
An additional reason for not using NCRs is that they only work in HTML or XML. But try stuffing them into a DB and then searching the character, or using regular expressions, or anything like this. Of course you could keep your whole DB with NCRs, but it would blow up, and many operations wouldn't work anymore. Keeping everything directly encoded, ideally everything in UTF-8, is much more straightforward.
Some people above seem to suggest that NCRs would work better than (straight) UTF-8. This is not really the case; browser support for NCRs and for UTF-8 always went tightly together. Both fail for Tibetan because the browser doesn't have the right font(s) and rendering logic. Internally, it knows what characters these are.

Posted by Martin Dürst on 21 October 2003 (Comment Permalink)

Martin, thanks for your comment -- particularly the point about the negative consequences of storing Numeric Character References in a database. Given that a number of weblog tools store the content in a DB -- this Movable Type weblog uses MySQL -- it's clear to me now that keeping everything directly encoded and using UTF-8 is absolutely the best procedure to adopt.

Posted by Jonathon on 21 October 2003 (Comment Permalink)

This discussion is now closed. My thanks to everyone who contributed.

© Copyright 2007 Jonathon Delacour