Saturday 10 May 2003

Enabling CJK language support

Following the lead of Trevor Hill at glome.org, Stavros has posted two entries that include Korean characters: This Is a Test of Korean and Seeing Asian Characters. The screenshot below shows how the Korean characters appear as question marks without Korean Language Support enabled (in Windows 2000):

Korean characters appear as question marks without Korean language support

With Korean Language Support enabled (instructions here), the Korean is rendered properly:

Korean characters appear correctly with Korean language support installed

Having enabled Korean, I also installed UniPad, a Unicode text editor that I expected would allow me to enter Korean text—I was hoping to eventually impress Stavros by displaying a Korean sentence in a weblog entry. No such luck. I just couldn’t figure out how to get the individual Hangul components (Jamo?) into a single syllable. (The UniPad Help indicates this isn’t supported so I tried Word 2000 and failed. Yet Stavros says he can do it with Microsoft’s execrable Notepad, which I also used without success.) So my Korean text-entry career is stalled. (Well, to be honest, it was stalled before I turned the key in the ignition, since the sum total of my Korean knowledge is what I’ve managed to glean from the introduction to the Berlitz Korean for Travellers phrasebook.)

In any case, enabling Korean (or Japanese) in the OS only gets me halfway there. If I’m to display CJK in my weblog posts, I also have to change my character encoding from charset=iso-8859-1 to charset=UTF-8. It seems there are two ways to do this:

  1. Hard code the character encoding sent in the HTTP headers as charset=UTF-8 in each Movable Type template.
  2. Change the character encoding by modifying the “Preferred Language” in my MT user profile (not an option because only “US English” is available).
  3. Set the PublishCharset flag in mt.cfg to UTF-8.

Since Stavros didn’t specify which procedure he adopted, any suggestions will be gratefully received (and adopted).

Something else that puzzled me is that although I couldn’t originally read the Korean characters in Trevor Hill’s post (this was when I had Japanese enabled in Windows 2000 but not Korean), the Chinese and Japanese characters appeared to render correctly. I say “appeared” because even though the Japanese characters are correct, I’m only guessing that the Chinese characters are also correct (the same three characters appear in both the Chinese and Japanese words for SARS).

Explanation of how diseases such as diabetes and SARS are rendered using Chinese characters in both the Chinese and Japanese languages

This would seem to indicate that identical Chinese and Japanese characters share the same Unicode entities, which is not what I’d have expected. It could be that if I enable Chinese in Windows 2000, the Chinese characters for SARS will be different. Which leads me to another question: do I need to enable both Traditional and Simplified Chinese? I assume I do—since Traditional Chinese is used in Taiwan and Hong Kong while Simplified Chinese is used in mainland China—though naturally I’m curious as to which Trevor Hill has used.

Finally, I should confess to having had some misgivings about this entire CJK enterprise. Even though it’s a pain to have to create images of Japanese text in Photoshop every time I want to include Japanese characters in a weblog entry, at least everyone can see the characters whether or not they have Japanese enabled in their OS. If I start to encode Japanese text in my entries, only visitors with Japanese support enabled will be able to see the Japanese characters. Everyone else would see the Japanese text as question marks or hollow boxes. This struck me as a significant problem, since I didn’t want to force visitors to enable Japanese in their browser or OS.

On the other hand, never in a million years would I render English text as an image since this causes major accessibility problems:

  • Text in an image can’t be resized.
  • The ALT text would have to replicate the text in the image.
  • Text in a image can’t indexed by Google.

Therefore, if I wouldn’t use “image text” in English, why should I use it for Japanese (or any other language)?

Accordingly, I’ve come to the conclusion that it’s actually preferable to render Japanese text properly, using Unicode/UTF-8 encoding. Anyone who is sufficiently interested in seeing the Japanese characters can enable Japanese support in their OS (Windows, Macintosh, or Linux). Everyone else can tolerate the question marks or hollow boxes or skip that entry.

It occurs to me—and I’m sure that I’m not the first to come to this conclusion—that the best way to popularize Unicode and to celebrate the intrinsic beauty of language is to write our weblog posts not just in English but in any other language we understand and love. So thanks to Trevor and Stavros (apologies to anyone else I’ve missed) for leading the way.

Permalink | Technorati

Comments

So, first of all, here's what I think about what you should do with moveabletype...

Number 1 above needs to be done to tell the browsers what charset they're seeing...

Number 2 above only changes the language that moveabletype uses for menus, etc... I don't think you need to or should change this if you want to use MT with an English interface...

Number 3 above is what will cause MT to specify utf-8 in the headers when it returns a page to the browser. Actually, I think you can reference the value of the PublishCharset statement in your templates too, with one of moveabletype's special tags, obviating the need for Number 1 as well...

Now, if you have problems after all that, such as having to change your encoding when posting (more than once), or something like that, you may have to enter the code and simply hack MT to always use utf-8. That's what I did, but I don't think it's necessary anymore since the PublishCharset statement was introduced... :)

Now, as for the character codes in my post you imaged, I may have used the Japanese IME to enter the Chinese, just because it was easier and I knew the characters already... The thing with Unicode is that sometimes the same characters share one code across all three languages, because they're all exactly the same character. So if you're typing in Chinese with a Chinese IME, some of the characters will be different codes than Japanese characters, some will be the same, and some may look the same but still have different codes...

This process of unifying identical characters into one code point was called "Han Unification", a hotly debated process a few years ago. There was a lot of argument about which characters were the same and what differences were significant... But so many of the characters are exactly the same across Chinese, Japanese, and Korean, it would have been a huge waste of encoding space not to go through this process, not to mention that there would be sooo many identical copies of characters in the standard... (There still are some now...)

Posted by Trevor Hill on 11 May 2003 (Comment Permalink)

Oh, BTW the chinese characters I used referring to SARS are all the same in Japanese, Traditional Chinese, Simplified Chinese, and Korean, because no one has simplified these characters at all -- everyone still uses the 'traditional' characters for them.

In a later post, I did use 'simplified' characters though... so for those you'd have to have the ability to read simplified chinese... :)

Posted by Trevor Hill on 11 May 2003 (Comment Permalink)

I shudder at the effort involved in doing all that to see characters, though I'll probably give it the old college try. But for those who can't or won't enable them, it would be good policy to always accompany them with transliterations.

Posted by language hat on 11 May 2003 (Comment Permalink)

languagehat - All that messing around is needed to publish posts using those characters, but it's a considerably easier process to patch up your OS (and for many not necessary) to *read* them...(hopefully I didn't miss your point, there).

Jonathon, FWIW I whacked in the charset=UTF-8 to the meta tags in my templates, and

1) added the 'NoHTMLEntities 1' statement to mt.cfg
2) hacked lib/MT/App.pm and changed the send_http_header to

sub send_http_header {
my $app = shift;
my($type) = @_;
$type ||= 'text/html; charset=utf-8';
# if (my $charset = $app->{charset}) {
# $type .= "; charset=$charset"
# if $type =~ m!^text/! && $type !~ /\bcharset\b/;
# }
if ($ENV{MOD_PERL}) {
$app->{apache}->send_http_header($type);
} else {
$app->{cgi_headers}{-type} = $type;
print $app->{query}->header(%{ $app->{cgi_headers} });
}
}

(hopefully that will look ok when posted).

Both changes were suggested by Trevor in the MT Forums...

Posted by wonderchicken on 11 May 2003 (Comment Permalink)

This discussion is now closed. My thanks to everyone who contributed.

© Copyright 2007 Jonathon Delacour