LANG? Enough already!
A couple of posts (lang=”ja” and the attribute selector and Can I CITE you on that?) and lots of comments later, we’re still no closer to resolving how to properly mark up Japanese words written in Romaji (Japanese transliterated using Roman characters).
I started marking up Japanese words—pretty much the only foreign words I include with any regularity—while implementing Mark Pilgrim’s Dive Into Accessibility tips:
Day 7: Identifying your language
You know what language you’re writing in, so tell your readers… and their software.
Who benefits?
- Jackie benefits. Her screen reader software (JAWS) needs to know what language your pages are written in, so it can pronounce your words properly when it reads them aloud. If you don’t identify your language, JAWS will try to guess what language you’re using, and it can guess incorrectly, especially if you quote source code or include other non-language content in your pages.
- Google benefits, even if you are writing in English, but especially if you are writing in some other language. According to the Google Zeitgeist, 50% of Google users search in languages other than English, and many of these users specify in their Google preferences to only search for pages in specific languages. Google’s language auto-detection algorithms are better than most, but why make Google’s job more difficult?
Except that, as the JAWS information page explains:
JAWS installs with an enhanced, multi-lingual software speech synthesizer, “Eloquence for JAWS”. Languages include: American English, British English, Castilian Spanish, Latin American Spanish, French, French Canadian, German, Italian, Brazilian Portuguese, and Finnish.
No Japanese. Similarly, my copy of IBM Home Page Reader supports the following languages: German, Spanish, French, Italian, Brazilian Portuguese, Suomi (Finnish), British English, and American English but not Japanese.
Even if Japanese were included, it’s doubtful how useful the “correct” pronunciation would be since Japanese is frequently transliterated without the macrons that indicate long vowel sounds i.e. taiheiyo senso instead of taiheiyō sensō (or as taiheiyou sensou, which is even worse).
If I set my Google preferences to search only for pages written in Japanese and do a search for “taiheiyo senso”, then apart from pages in which the phrase appears as a file or directory name, the result list only includes pages with taiheiyo senso written in Romaji.
But if I search for
, the result list only includes pages in which taiheiyō sensō is written in Japanese script.
Thus, since anyone searching for a Romanized Japanese word will almost certainly want to see results in any language other than Japanese, it’s difficult to see how Google benefits from the inclusion of either lang="ja" or xml:lang="ja".
Bertil Wennergren provided further confirmation by taking the matter to “the high court” (comp.infosystems.www.authoring.html) but, as he noted in his comment, “no solution to the actual problem emerged.” Jukka Korpela summed it up:
This is depressing, but thanks for pointing this out. I think many of have not met this problem yet, either because we have Japanese support installed or because we haven’t visited pages where Romanized Japanese has language markup. The observation reminds us that we should not write language markup in too much detail, until the definitions and implementations have matured. (For an entire document, or for a block quotation, and for a book title, for example, language markup is surely recommendable, and not much work. But even for them, maybe it’s better to suppress the lang markup, if the text is transliterated or transcribed.)
So that’s it for me. From now on I’ll wrap Romanized Japanese words in a span tag, use CSS to italicize them, and—where the meaning isn’t immediately clear from the context—add a title tag to provide it. As in, taiheiyō sensō:
<span class="romaji" title="Pacific War">taiheiyō sensō</span>

Well I was going to ask this the first time around, but I thought it would be answered. I didn't see this show up but if it did, I apologize. I'm no where near as good with the standards as most of the people here are, but looking at the W3C spec for lang, it would seem to me that it shouldn't be "ja". Shouldn't it be something like "ja-romaji" because it's not specifically the japanese word but a romanized version of it. They have en-cockney, so I would deduce the code for yours would be ja-romaji. It seems you have to register the code though if it doesn't already exist. I'm curious what others think.
Posted by Eby on 9 February 2003 (Comment Permalink)