Sunday 12 October 2003

I’m not giving up my day job (to become a PHP programmer)

Having colored myself “flabbergasted”, I now need to color myself “embarrassed” since Scott Reynen has comprehensively demonstrated that PHP does have limited Unicode support, which he uses to create his Daily Japanese Lessons. Even more impressively, Scott followed up by doing what I couldn’t manage—writing a snippet of PHP code to convert japanese characters to unicode character entities. As I admitted in Scott’s comments, “I should leave PHP coding to those who actually know what they’re doing”.

Regarding the issue of which is better—CJK characters or Unicode entities—Michael Glaesemann’s comment has convinced me beyond any doubt that it’s best to stick with the characters.

Permalink | Technorati

Comments

I have created a bookmarklet ( http://minutillo.com/steve/weblog/2003/5/15/character-to-entity-encoder-bookmarklet ) that will convert all non-ASCII characters in all input fields on the current page to their Unicode entity equivalents. Just put it in your browser's toolbar and then you can convert characters to entites on every page. I use it on my own site to post weblog entries with Chinese characters. I also sometimes use it to post comments on other people's sites, when I'm not sure their systems can handle the characters directly.

再見!

Posted by steve on 12 October 2003 (Comment Permalink)

Steve, thanks for this. I tried it and it works perfectly. I've left a comment on your May 15 post asking what you think of Michael's argument that one is better off sticking with characters rather than converting to entities.

Posted by Jonathon on 12 October 2003 (Comment Permalink)

Well, at least you admitted it. Some might not even do that. ;)

Posted by Jonathan Weed on 12 October 2003 (Comment Permalink)

Responding here so there can be more discussion:

Yes, the "real" characters are better, and should be used whenever possible. They take up less space when stored, and are easier to process (convert into new formats, searching, saving into non-HTML and XML formats, ...). They're just the right way to do things.

But, lots of software out there still hasn't caught up with Unicode and still treats characters as bytes, and can only deal with ASCII. When forced to deal with software like that, translating your characters into entities gives them a much better chance of getting through the publishing process unscathed.

When posting a comment to a foreign weblog, there are many places where characters can get munged:

- When the browser submits the comment (or any other type of data) to the server, there could be an encoding mismatch. If the browser is sending the comment in UTF-8 and the server expects Shift_JIS, for example.

- When the server receives the comment and stores it in a database, the characters again could be munged, if the DB tries to re-encode them in some internal format.

- When another user comes back to the site and views the comment, if the server doesn't carefully set the Content-Type header, the browser might not know how to interpret the characters.

Browsers, nowadays have most of these problems licked, as long as the server very carefully sets and minds the Content-Type headers. But, server side tools don't always do that, if the developers haven't kept these issues in mind. But entities, being just ASCII characters, avoid all these issues and work in almost every case.

Posted by steve minutillo on 14 October 2003 (Comment Permalink)

Steve, thanks for following up -- it seems that, as is all to frequently the case, there's no simple solution.

An interesting discussion is taking place on Simon Willison's weblog, in response to his post, Practical Unicode, Please!

http://simon.incutio.com/archive/2003/10/13/practicalUnicode

Doug Sauder points out that:

"First, because of Han unification, some Chinese characters require an identification of the language (or language/locale combination) in order to be rendered correctly. Note that the simplified glyphs used in PRC share code points with the traditional glyphs used in Taiwan. When the character encoding is GB-2312 (PRC) or BIG5 (Taiwan), the language is implicitly identified. Not so with UTF-8. Also, Japanese Kanji use the same code points for the Chinese characters but also use different glyphs for certain characters."

This is exactly the issue that I wrote about here:

http://weblog.delacour.net/archives/2003/09/mojikyo_fixes_a_bug_in_me.php

Posted by Jonathon on 14 October 2003 (Comment Permalink)

[Removed (spam)]

Posted by ekim on 29 November 2003 (Comment Permalink)

This discussion is now closed. My thanks to everyone who contributed.

© Copyright 2007 Jonathon Delacour