i18n: HTML Character set issues beyond HTML3.2

"i18n"? - the word "internationalisation" (or "internationalization" as some prefer to spell it) starts with the letter "i", ends with the letter "n" and has eighteen (in numbers: 18) letters in between. In the interests of brevity and to avoid arguments about "s" or "z", the term "i18n" (i-eighteen-n) has come into widespread use.

Road map to this area

Quick start to i18n
Notes on internationali[sz]ation
Supplementary articles:
- Checklist for HTML character encoding
- Browsers and fonts
- Text Direction
- Some selected techniques in relation to Unicode
- The charset playground
- FONT FACE techniques discussed and criticised
- The Netscape charset Burp!
- Some notes on Baltic codings
Nearby are the rough and ready unicode test tables which I generated programmatically.

There's also some tutorial-ish notes on FORM submission and i18n.

, an MSIE-specific technique for offering enbedded fonts, which can be applied without harming web-compatible browsers.

Details of resources here and elsewhere

Using HTML in a single non-Latin-1 locale had been working for a considerable time already, and you can find appropriate resources on the WWW that cover one or other of those locales. In various places I am including pointers to some of the resources that happen to be known to me, but here I am concentrating on the use of an extended character repertoire: using, at the very least, one non-Latin-1 repertoire together with the full repertoire of Latin-1.

Formal stuff

At the W3C there's a really solid document being developed as a working draft: "Character Model for the World Wide Web". Although in parts heavy-going for a beginner, this contains an excellent presentation of the terminology and character representation model as it is used in HTML, and should be on the reading-list of anyone interested in this area, most especially as, in my experience, many discussions of i18n are blighted by the fact that the participants do not understand the terminology, and persist in mis-using their mistaken concepts - learned perhaps from old word processing experience or some earlier and less-capable character representation model - in the WWW context.
The basis for the HTML i18n character model was originally set out by RFC2070: Internationalization of the Hypertext Markup Language,
The authoritative source of information about Internet usage of character codings is at IANA. The terminology and usage of character representation has developed quite a bit since the MIME specifications were originally laid down, and this causes quite a lot of confusion, in as much as the attribute which the MIME specifications call charset would nowadays more properly be called "character encoding", and not (as might be assumed) "character set". There is more discussion of this issue on other pages here such as the quick-start and the notes on i18n.

Materials offered here

Quick Start to i18n - comments are still very welcome.
also available in Greek translation (iso-8859-7) by Panos Stokas
or in windows-1253 coding or via language preference negotiation.
(I don't understand Greek myself, so please contact Panos about the translation.)
Notes on Internationalization of HTML (again, comments are welcome).
The Netscape charset Burp!
Old Report on browser tests of 8-bit character codes beyond iso-8859-1. (These tests are old now, and each test only contains one non-Latin-1 repertoire and coding, in conjunction with Latin-1 represented as entities etc.)

Background and supplementary materials

A brief mention here of iso-8859-15, a worked-over version of Latin-1 that is officially called Latin-9 but had been nicknamed Latin-0. One of its proponents, Alain LaBonté, who enjoys word games, mentioned that it can be pronounced "Latin Zeuro", but he preferred "Latin 9, c'est un latin tout neuf". Misha Wolf's comment on it was:

We have just the place for ISO 8859-15 here in London. It is called the Science Museum and is full of charming historical relics, like Babagge's difference engine, used by Ada Lovelace (I think that was her family name).

What a relief that we now have Unicode and won't have to implement this amusing piece of history.

Browser support was originally better for utf-8 than it was for iso-8859-15, and I think it's fair to say that there is no point in using iso-8859-15 for encoding HTML documents, although it has found fairly wide user acceptance, in the European area, for use in Usenet (8-bit plain text) postings.

For proper support of Celtic languages such as Welsh, a different 8-bit coding would be needed. Refer to Michael Everson's page on Celtic fonts.

Markus Kuhn offers a UTF-8 and Unicode FAQ for Unix/Linux which includes an explanation of the utf-8 transformation format. An official place to read about it is RFC2279, available from your usual source of RFCs or a copy at faqs.org.

The W3C's own i18n wizards presented a Tutorial: Weaving the multilingual Web at the 1999 International Unicode Conference.

There's an excellent and comprehensive article on character sets and MIME by Jukka Korpela, which covers the present topic and more. Also a page that focusses on the HTML issues.

Jukka Korpela called attention to a fascinating resource page which accesses a Database on letters, languages, character sets etc.

Alan Wood's Unicode Resources - Multilingual support on the Web are very good.

A.Prilop has a Multilingual Macintosh Resources page.

Some notes on Baltic codings which summarise a discussion with A.Prilop and others in Sept.2000

I made some rough-and-ready unicode test tables based on the Unicode database.

FONT FACE techniques discussed - and criticised.

I created code mapping tables to enable an earlier version of the rtftohtml program (which later became LOGICTRAN) to generate HTML4 by using &#bignumber; representations. The author incorporated this work into rtftohtml in version 3.8 and later. The original rtftohtml extras were made available here and include a test table for this part of the repertoire.

My original iso-8859 materials are there.

Nir Dagan's notes on Hebrew.

MSIE can unilaterally decide that your page is in utf-7 if you don't specify charset explicitly. [cited page is in German]

Perl

Perl, Unicode and i18N FAQ looks to be a treasure house of information and links.

FRAMEs warning

It has been observed that some versions/platforms of the popular browsers will misbehave if different charset codes are used in different frames and/or framesets of a given page. I haven't investigated this in detail myself, but I thought I should mention it.

I make no secret of my dislike for frames; but the problem is there, whether you like them or not.