Saturday 08 February 2003

LANG? Enough already!

A couple of posts (lang=”ja” and the attribute selector and Can I CITE you on that?) and lots of comments later, we’re still no closer to resolving how to properly mark up Japanese words written in Romaji (Japanese transliterated using Roman characters).

I started marking up Japanese words—pretty much the only foreign words I include with any regularity—while implementing Mark Pilgrim’s Dive Into Accessibility tips:

Day 7: Identifying your language

You know what language you’re writing in, so tell your readers… and their software.

Who benefits?

  1. Jackie benefits. Her screen reader software (JAWS) needs to know what language your pages are written in, so it can pronounce your words properly when it reads them aloud. If you don’t identify your language, JAWS will try to guess what language you’re using, and it can guess incorrectly, especially if you quote source code or include other non-language content in your pages.
  2. Google benefits, even if you are writing in English, but especially if you are writing in some other language. According to the Google Zeitgeist, 50% of Google users search in languages other than English, and many of these users specify in their Google preferences to only search for pages in specific languages. Google’s language auto-detection algorithms are better than most, but why make Google’s job more difficult?

Except that, as the JAWS information page explains:

JAWS installs with an enhanced, multi-lingual software speech synthesizer, “Eloquence for JAWS”. Languages include: American English, British English, Castilian Spanish, Latin American Spanish, French, French Canadian, German, Italian, Brazilian Portuguese, and Finnish.

No Japanese. Similarly, my copy of IBM Home Page Reader supports the following languages: German, Spanish, French, Italian, Brazilian Portuguese, Suomi (Finnish), British English, and American English but not Japanese.

Even if Japanese were included, it’s doubtful how useful the “correct” pronunciation would be since Japanese is frequently transliterated without the macrons that indicate long vowel sounds i.e. taiheiyo senso instead of taiheiyō sensō (or as taiheiyou sensou, which is even worse).

If I set my Google preferences to search only for pages written in Japanese and do a search for “taiheiyo senso”, then apart from pages in which the phrase appears as a file or directory name, the result list only includes pages with taiheiyo senso written in Romaji.

But if I search for Japanese script for taiheiyo senso (Pacific War), the result list only includes pages in which taiheiyō sensō is written in Japanese script.

Thus, since anyone searching for a Romanized Japanese word will almost certainly want to see results in any language other than Japanese, it’s difficult to see how Google benefits from the inclusion of either lang="ja" or xml:lang="ja".

Bertil Wennergren provided further confirmation by taking the matter to “the high court” (comp.infosystems.www.authoring.html) but, as he noted in his comment, “no solution to the actual problem emerged.” Jukka Korpela summed it up:

This is depressing, but thanks for pointing this out. I think many of have not met this problem yet, either because we have Japanese support installed or because we haven’t visited pages where Romanized Japanese has language markup. The observation reminds us that we should not write language markup in too much detail, until the definitions and implementations have matured. (For an entire document, or for a block quotation, and for a book title, for example, language markup is surely recommendable, and not much work. But even for them, maybe it’s better to suppress the lang markup, if the text is transliterated or transcribed.)

So that’s it for me. From now on I’ll wrap Romanized Japanese words in a span tag, use CSS to italicize them, and—where the meaning isn’t immediately clear from the context—add a title tag to provide it. As in, taiheiyō sensō:

<span class="romaji" title="Pacific War">taiheiy&#333; sens&#333;</span>

Permalink | Technorati


Well I was going to ask this the first time around, but I thought it would be answered. I didn't see this show up but if it did, I apologize. I'm no where near as good with the standards as most of the people here are, but looking at the W3C spec for lang, it would seem to me that it shouldn't be "ja". Shouldn't it be something like "ja-romaji" because it's not specifically the japanese word but a romanized version of it. They have en-cockney, so I would deduce the code for yours would be ja-romaji. It seems you have to register the code though if it doesn't already exist. I'm curious what others think.

Posted by Eby on 9 February 2003 (Comment Permalink)

Japanese written as usual, and Japanese written in "romaji" (there should really be a "lang" attribute on that last word...) are one and the same language.

Cockney, however, is not exactly the same thing as English. It's a _kind_ of English. Thus you can mark it up as 'lang="en-cockney"'.

Japanese in "romaji" can be compared to English in Morse code. It's still exactly the same language.

The fact that a word like "manga" is written in Latin script and not in Japanese characters, does not need to be marked-up at all. It's evident from the text itself. Those are Latin characters. No need to tell anyone about it. What could be needed is an attribute that says, which _system_of romanization has been used. There are several of them. There are however no codes for that.

Posted by Bertilo on 9 February 2003 (Comment Permalink)

My first reaction was that the onus here is on browser manufacturers to figure out whether the language-tagged span actually NEEDS a language pack to display it correctly. Clearly, in these romaji examples, none is needed. If all the text is low-ASCII, and has no escape sequences as used in New JIS, clearly no language pack is needed.

It's not that simple, though. What if you are writing a web page in Russian, which uses high-ASCII, and want to cite a Japanese word expressed in Cyrillic (what's the Russian word for romaji?)? You'd have no way of knowing what to do there.

To really cover your bases, it would probably be safest to have a CHARSET declaration for the text chunk. I don't think that's orthodox HTML, and I'd be amazed if a browser actually recognized it, but that would be the logical way to discriminate between the language and the orthography.

Posted by Adam Rice on 9 February 2003 (Comment Permalink)

This discussion is now closed. My thanks to everyone who contributed.

© Copyright 2007 Jonathon Delacour