Wednesday 05 February 2003

Can I CITE you on that?

I didn’t realize, when I asked in my previous post about how to correctly cite Japanese words written in romaji, that I was setting myself up for a thorough dousing in the complexities of markup and the semantic Web.

It has been my custom to wrap Japanese words such as nejimakidori in an <i> tag with xml:lang="ja" and lang="ja" attributes. I noted Sean Conner’s use of a <span> tag with lang="ja" plus a title="wind-up bird" to provide the meaning; except that this prompted various browsers to ask to install Japanese language support so he fudged it with lang="x-ja".

Bertilo pointed out that writing 'lang="x-jp"’ is no solution, since "x-jp" does not mean “Japanese”, and suggested using only xml:lang="ja" (if your pages are XHTML). Joe Clark said we should be using <cite lang="ja">. Adam Rice disagreed, saying that’s not a canonical use of <cite>. Language Hat asked: “What’s wrong with italics?” Spoutnik suggested using <dfn>. Kris agreed with Joe on the use of <cite> and also emailed me, suggesting that it might be worth trying UTF-8 character encoding instead of ISO-8859-1.

In the unlikely event that I’m ever asked to join a Web standards committee, I already have my answer formulated. You can easily guess what it is.

The entry for <cite> in the O’Reilly HTML Reference included with Dreamweaver MX reads:

The CITE element is one of a large group of elements that the HTML 4.0 recommendation calls phrase elements. Such elements assign structural meaning to a designated portion of the document. A CITE element is one that contains a citation or reference to some other source material. This is not an active link but simply notation indicating what the element content is. Search engines and other HTML document parsers may use this information for other purposes (assembling a bibliography of a document, for example).

Browsers have free rein to determine how (or whether) to distinguish CITE element content from the rest of the BODY element. Both Navigator and Internet Explorer elect to italicize the text. This can be overridden with a style sheet as you see fit.

<P>Trouthe is the hyest thing that many may kepe.<BR>
(Chaucer, <CITE>The Franklin's Tale</CITE>)</P>

It seems to me that this bears out Adam Rice’s assertion that <cite> is not the appropriate element to use with foreign words, particularly since I use Japanese words for quite a different reason than that suggested by both the O’Reilly and Adam’s examples—not to quote from a book or a speech but because there is no accurate English equivalent for the Japanese word or because I wish to include fragments of Japanese as a stylistic device.

For now I’m sticking to my original method but I’d love to get both these issues sorted out:

  • Which element should I use to quote Japanese words in romaji?
  • How can I stop the browsers asking to install Japanese language support for such words?

Further suggestions enthusiastically welcomed but, please, no invitations to join either a CITE or romaji subcommittee of any W3C standards body.

Permalink | Technorati


"Which element should I use to quote Japanese words in romaji?"

[span lang="ja" xml:lang="ja"]...[/span]

and css:
span[href="ja"] { font-style: italic; }

"How can I stop the browsers asking to install Japanese language support for such words?" (4K)

My MSIE5.1/Mac OS 9, with Japanese fonts installed does this strange thing. It looks like it is displaying a japanese font's roman characters; however, till it encounters a new opening tag, not just between the open and close tags.

Posted by Kris on 6 February 2003 (Comment Permalink)

I have an idea: let's all spend the next 10 years arguing about semantics.

Posted by Mark on 6 February 2003 (Comment Permalink)

Does the use of a transliteration qualify as an use of the word itself in the original language? If not, the language attribute should not be used at all, and another form of markup should be chosen.

I'm asking because a screen reader, for example, would be expecting Japanese characters in that context, but would find Western characters instead. How would a screen reader act in this situation?

Posted by Ronaldo on 6 February 2003 (Comment Permalink)

"I have an idea: let's all spend the next 10 years arguing about semantics."

Sorry I stepped on your CITE element :P

Posted by Kris on 6 February 2003 (Comment Permalink)

OK, Kris, I'll use the SPAN tag from now on. And that MSIE5.1/Mac OS 9 is certainly weird.

"Does the use of a transliteration qualify as an use of the word itself in the original language? If not, the language attribute should not be used at all, and another form of markup should be chosen."

That makes sense, Ronaldo. I could mark up the first case with LANG and XML:LANG attributes and the second with italics. But I'm not sure how it helps solve the problem of the browser asking to install Japanese language support.

Mark, in the world of semantics and markup, I think I'm entitled to say that I'm totally your creation.

Posted by Jonathon on 6 February 2003 (Comment Permalink)

Ronaldo asks an interesting question. It comes down to this: at what point does a foreign word get absorbed into a language? If it's still clearly a foreign word, then on a philosophical level, it doesn't really matter how it's represented on the page: it's still foreign. Language is not orthography: language is an oral phenomenon. Transliteration is just remapping the orthography, it is not changing the language. The following sentence is in Japanese:

mamonaku, densha ga mairimasu. abunai desu kara, hakusen no uchigawa he o-sagari kudasai.

We still treat a lot of French words as loan words in English, denoted by italics, even though they've been used in English for a very long time. It's not out of place to see "hors d'oeuvre" in italics in English.

But think of how silly "sushi" looks in italics. That's been a part of English for less time, but has been enthusiastically adopted. Other pop-culture references from Japan are less so: anime and manga probably still take italics. Karaoke, maybe not.

Note that, in going the other direction, Japanese absorbs a huge amount of English and repurposes it in completely novel ways, sometimes actually creating new English-like words or phrases. Nobody unacquainted with Japanese culture would guess what a "pink salon" is. In this case, "pink salon" is Japanese, even though it looks suspiciously like English.

In short, I don't have a good answer for Ronaldo.

Posted by Adam Rice on 6 February 2003 (Comment Permalink)

Hello from Brussel,
To the following question

"How can I stop the browsers asking to install Japanese language support for such words?"

I will give this simple answer:
"Don't declare the Japanese language in IE/WIN"

Because the support of the Japanese language is incomplete on IE/WIN, the browser "auto-selected" or "AutoComplete" it when seeing the language definition.

The Charset recognition :

And some more about :

-Character Autocomplete:
for East Asian characters, the recogniser's ability to suggest recognition based on incomplete characters.
-When AutoComplete is enabled, suggestions are provided for the value of ....
-GetDefaultRecognizer: method of the Recognizers collection called to retrieve the default recognizer for the Japanese language.

Too much for me (don't really understand that), just have a look at this (Character AutoComplete) for more info:

Loading the Default Japanese Recognizer (extract from MSN)
The GetDefaultRecognizer method of the Recognizers collection is called to retrieve the default recognizer for the Japanese language. Next, the IInkRecognizer object's Languages property is checked to determine if the recognizer supports the Japanese language. If it does, then the recognizer's CreateRecognizerContext method is used to generate a recognizer context for the form.

Some useful links to :
The GetDefaultRecognizer method :
The Recognizers collection :

I hope it can help you in some way
my english is still to improve.

Posted by spoutnik on 7 February 2003 (Comment Permalink)

With respect to transliteration, I believe it qualifies as a proper use of the word in the original language, even if the word has been absorbed into the host language. After all, it's just another representation of the word. But I think we have hit a limitation[1] in the HTML/XHTML support for language identification. The specification lacks a mechanism to indicate such transformations. A new attribute, or an extension of the lang and xml:lang attributes, would be required to avoid breaking document semantics and keep the language support for screen readers and applications that rely on language information.

So, my answer to your question about the correct way of marking up secondary language sections in documents is: as an attribute for marking transliteration doesn't exist, use whatever tag is applicable in the context and keep the xml:lang value. Also, the HTML specification seems to suggest[2] that the <span> tag is one of the proper ways to indicate such sections, specially in inline contexts.

To your other question I don't have an answer :) It's unlikely that browsers displaying the behavior observed implement any kind of document-level instructions to prevent the display of such dialogs. I'm not bothered about it if the site provides good content.


Posted by Ronaldo on 7 February 2003 (Comment Permalink)

I am not sure this answers anything at all. It probably just adds more questions, but are the ruby tag sets supposed to be used for words that reference other words? So if you have Japanese text that refers to English text, apparently the ruby set should be used. My minimal experimentation with these tags, however, were less than gratifying, since they are block elements. I would rather there was an inline version.

Posted by David on 7 February 2003 (Comment Permalink)

I have not played with the Ruby tags, but now I'm curious.

The Ruby tag is a clear descendent of the Japanese typographic technique "furigana" (literally, flapping writing or waving writing), where tiny characters appear above other characters (or for vertical text, to the right). Furigana are also sometimes known as ruby. Furigana are mostly used to give phonetic readings to obscure kanji that the average reader might not know how to pronounce. Occasionally they're used in more cutesy ways.

Why "ruby" you ask? It goes back to archaic terms for type sizes. Back when "pica" meant 12 pt type, "ruby" meant 5 pt type.

Posted by Adam Rice on 8 February 2003 (Comment Permalink)

I'm now not even sure that using the lang or xml:lang attributes is appropriate for this purpose and I doubt that the Ruby tags are applicable to this problem either. I think I'll write another post on this.

Posted by Jonathon on 8 February 2003 (Comment Permalink)

I took this matter to the high court (c.i.w.a.) The thread starts here:

But no solution to the actual problem emerged :(

Posted by Bertilo on 8 February 2003 (Comment Permalink)

Heh. If you think this is fun, just wait for the Semantic Web.

"No, no, you can't mark a veterinarian with the <doctor> tag -- he'll have heart patients dying in his waiting room! Use the <services> tag with the attributes [field="veterinary medicine", qualification="DVM", specialty="canus canus, felinus domesticus, equus equus, etc."]."

Posted by Prentiss Riddle on 9 February 2003 (Comment Permalink)

Geez, with all that markup, how are any of us going to get anything written in our blogs?

Posted by Adam Rice on 9 February 2003 (Comment Permalink)

[Removed (spam)]

Posted by nony on 16 March 2003 (Comment Permalink)

This discussion is now closed. My thanks to everyone who contributed.

© Copyright 2007 Jonathon Delacour