Sunday 05 October 2003

Not a bug in Mojikyo, but rather a feature of Unicode

It wasn’t a bug in Mojikyo, nor the fact that Windows is a sorry excuse for an operating system—rather it turned out to be the inherent design of Unicode that limits my ability to display (on a Web page) both variants of the Chinese character mentioned in my previous post.

Variants of the boku character used in the title of Kafu's Bokuto kidanA comprehensive explanation came via email from Mr Tanimoto of the Mojikyo Institute, confirming what Brian Hunziker and gaemon had suggested in their comments: that the two variants of the character “boku” (shown at the left) have the same Unicode number (or, in Unicode-speak, “share a single codepoint”). In his comment, Brian linked to a screenshot of the Macintosh Character Palette showing how Mac OS X allows one to choose between the two variants; in response to an email request, he graciously made new screenshots and gave me permission to reproduce them. In the illustration below, the green triangles to the right of certain characters indicate that alternatives exist to the character being currently displayed. At the bottom of the Character Palette, a button provides access to the character variants which share the same codepoint.

Macintosh Japanese character palette
(Scaled down screenshot— click to see full size image)

In a BYTE article titled Unicode Evolves, Ken Fowles explains how codepoints work:

The Unicode/ISO10646 standard provides one uniform 16-bit encoding that can store information from all the world’s commonly used scripts. The key word here is “standard.” Unicode itself is a standard, not a technology. Where technology gets involved is how the software makes use of the standard.

The Unicode concept of parking characters into a 64-KB space sounds simple enough — until you realize there are three or four times that many characters in the world’s written languages. So a key part of Unicode’s design is to handle that 64-KB space as valuable real estate since it has to support a large number of scripts in one consistent encoding.

Several parts of Unicode’s design help it maximize this use of what’s called a codepoint , the permanent Unicode address of each character. For example, diacritic marks in most other character sets are not stored as unique characters, but in Unicode each diacritic can be separately tracked and shared among several characters. Codepoints are conserved through Han Unification, sort of like a highway carpool lane where two or three characters with similar appearance share the same space. To Unicode, small differences in appearance should be handled as a font issue, not by inventing another character encoding. Also, Unicode does not guarantee a particular sort order, since software should handle that separately.

Thus, the two variants of “Kafū’s boku character” share a single codepoint (濹). The crucial concept—the one that led me to to wonder if a bug in Mojikyo caused it to produce the same Unicode character entity for each variant—is that, as Mr Tanimoto explained in his email, Unicode does not differentiate between design differences within the same character—each character is assigned a codepoint and “the judgement of which design is adopted is left to the font maker”.

The Mojikyo system, on the other hand, takes an entirely different approach by separately registering all the different designs of a particular character and assigning to each variant a separate Mojikyo number. Mr Tanimoto illustrated the relationship between Unicode and Mojikyo—as it applies to the boku character—with an ASCII diagram in his email, which I’ve recreated here:

Relationship between Mojikyo and Unicode numbers for boku character

As Brian Hunziker’s screenshot shows, the Hiragino Mincho Pro font includes both variants:

Detail of Macintosh Character Palette showing font variant selection

Unfortunately, as one might expect, the IME Pad (the Windows XP “equivalent” of the Macintosh Character Palette) and MS Mincho font combo leave a lot to be desired:

Windows XP IME pad

I wrapped the word “equivalent” in quotation marks because there is really no way that the butt-ugly Windows IME Pad can compete with the design, functionality, and appearance of the Macintosh Character Palette. Nor do any of the Windows Japanese fonts (MS Mincho, MS Gothic, and Arial Unicode MS) include the range of character variants included in Apple’s beautiful Hiragino Mincho Pro font.

“I hope I’ll not derail this into a Mac vs. PC discussion as that certainly is not my intention”, wrote Brian Hunziker in his comment. That’s OK, I’m sufficiently irritated to derail it myself:

<rant>The relentless mediocrity of Japanese support under Windows absolutely typifies Microsoft’s “near enough is good enough” approach to functionality and interface design. In fact, Windows Japanese support seems about as good as that offered by the Japanese Language Kit I was using on the Macintosh in the late eighties.

I get so tired of hearing about all the super-smart people who work for Microsoft when it’s abundantly clear that either they don’t have a clue about how to do things properly or else they don’t give a rat’s arse about anything but gouging money out of users and causing us grief.

Using the Windows operating system—as distinct from using Windows applications, many of which are superb—is like having to take photographs with a Soviet Zorki or Kiev camera when you could be using a Leica or a Hasselblad. Sure, you can take great pictures with a shitty camera but, since you’re constantly fighting the deficiencies in the equipment, there’s hardly any joy in the process. Elegance is one word that’s conspicuously absent from the Microsoft vocabulary.</rant>

Why don’t I switch? Primarily because I have thousands of dollars invested in Windows applications. Though, as I said to Brian Hunziker in an email, his screenshots “may have gently nudged me onto the slippery slope towards buying a Macintosh.”

Until then, I’ll rely on the Mojikyo Character Map to make up for the deficiencies in Windows, using Mojikyo’s RTF output to copy the character variant I need to Photoshop via Word (for some reason, the RTF output won’t paste directly into Photoshop for Windows). I only have access to all the character variants, of course, because I’ve installed the Mojikyo fonts. And, regardless of which operating system one uses, the chances are you’ll see Mojikyo font 050021 instead of Mojikyo font 079131 when I include the &#28665; Unicode entity—like this: .

That’s the reason that I’m using images to illustrate the characters—and to do that I’m taking advantage of another service offered by the Mojikyo Institute: links to 24x24 and 96x96 pixel GIF images of all the characters included in the Mojikyo character set. I’ve linked to the 24x24 pixel GIFs (in the previous paragraph), using these IMG tags:

<img src="http://www.mojikyo.gr.jp/gif/050/050021.gif" alt="Mojikyo font 050021" name="mojikyo_font_050021" width="24" height="24" />

<img src="http://www.mojikyo.gr.jp/gif/079/079131.gif" alt="Mojikyo font 079131" name="mojikyo_font_079131" width="24" height="24" />

The 96x96 pixel versions look like this:

Mojikyo font 050021    Mojikyo font 079131

and require the following links:

<img src="http://www.mojikyo.gr.jp/gif96/050/050021.gif" alt="Mojikyo font 050021" name="mojikyo_font_050021" width="96" height="96" />

<img src="http://www.mojikyo.gr.jp/gif96/079/079131.gif" alt="Mojikyo font 079131" name="mojikyo_font_079131" width="96" height="96" />

This means you can embed any of the Mojikyo characters in a Web page, without requiring that visitors have the Mojikyo fonts installed. (Note that the user license does not allow the GIF images to be downloaded, redistributed, or loaded onto another server.)

And, if you discover a Chinese character that is not currently contained in the Mojikyo character set, you can ask the Institute to create a new character (providing you tell them where you discovered the character).

So, to sum up, I couldn’t access the boku character that Kafū used because none of the default Windows Japanese fonts includes that particular variant. And I couldn’t display Kafū’s boku in a weblog entry because Unicode needs to preserve codepoints so that they don’t run out of permanent addresses. And I was able to find Kafū’s boku on my Windows PC with the aid of the Mojikyo Character Map because the Mojikyo Institute regards Chinese characters as “a very important cultural asset of the human race” and—like Apple—is committed to making that wonderful variety of characters widely available. The fly in the ointment is, as one might expect, Microsoft. (I’m sure Dave Rogers would agree—I was amused (though hardly suprised) when I followed his pointer to these Dan Bricklin photographs of the BloggerCon audience.)

Permalink | Technorati

Comments

Chinese character obscurities always prove an interesting read!

Just wanted to let you know that reading the page with Safari on Mac OSX renders 濹 the way KafĂ» intended it.

Posted by Kim A on 7 October 2003 (Comment Permalink)

That really surprises me, Ken. I'd have assumed that the other variant would be the default character. I think I'll skip down to the local Apple Store tomorrow and check out my weblog on one of their machines.

Posted by Jonathon on 7 October 2003 (Comment Permalink)

This discussion is now closed. My thanks to everyone who contributed.

© Copyright 2007 Jonathon Delacour