I18n Quickstart

Intro

Understanding character code internationalisation (i18n) in HTML is basically very straightforward, once the principles and terminology are clear. But from frequent practical experience of misunderstandings, I feel that I need to set an alert. The problem, in many cases, is that people come to it from other authoring situations (e.g wordprocessors) with preconceptions that make it hard to grasp how things are meant to work in HTML. Often they feel they have a confident grasp of principles, if only they could get a straight answer to some question that is troubling them: then they discover that the answer they get doesn't seem to make sense, and they think the respondent is being deliberately obtuse. What it means is that they need to review their preconceptions, sorry.

But if you just want a plain "do this" recipe without the explanations, you could skip to the "Conservative recommendation".

(This document itself contains only Latin-1 characters and is designed to be readable on any browser).

Scope and history

This page was composed, early in the life of HTML4, with the intention of explaining the principles of internationalisation of HTML ("i18n") as they relate to the data that is passed over the network. There is also some practical discussion of browser support. We don't bother with the simpler case of documents that use only one non-Latin 8-bit repertoire: see the explanation on the covering page.

There has been some revision since, to take account of developments, but you may still find comments which relate to the late 1990s and are not really relevant to the situation at the time of writing of this comment (2003).

We concentrate on the transmission that is sent over the network from the server (httpd) to the client (browser). That means that we say little about the practical issues of actually authoring such pages and getting them from the keyboard to the server. It also means that we don't deal here with the quite different issues of forms submission from the client to the server, which is covered separately.

Review

The reader is invited to review these questions, to see how their own answers accord with the ones given here. No need to understand specialist jargon at this point; these are practical questions about what a browser should display in response to certain HTML constructs, according to HTML4/RFC2070 specifications (but the answers make better sense if you have a mental model in harmony with the basis of this HTML specification).

Be warned that you might not get the correct results with the browser that you are using, for various reasons. Even if the browser is a version that's technically capable of supporting the relevant WWW specifications, you might also need some additional fonts, operating system configuration etc. before it can do the job that it's designed to do. These issues can be discussed later: here we talk about what the specifications require.

Questions:

The first section is based on the koi8-r character code. I had specific reasons for this choice, if you're interested.

koi8-r questions:

In an HTML4 document that has been sent with charset=koi8-r, what is the meaning of the following, in HTML terms?

Q: an 8-bit character with the value 241 decimal?

A: It's the character at position 241 in the koi8-r code, that is, the YA character (looks like an R backwards).

Q: the HTML character reference ñ ?

A: it's the character at position 241 in the HTML4 document character set, which is always iso-10646/unicode, so it's ñ (n-tilde). (Character codes 0-255 are the same in iso-10646 as they are in iso-8859-1, the traditional HTML2.0 character code).

Q: the HTML entity ñ ?

A: obviously, that's n-tilde too (yes!).

Q: an 8-bit character with the value 153 decimal?

A: it's the character at position 153 in the koi8-r code table, i.e the "greater-than or equals" sign.

Q: what about ™ ?

A: in SGML terms, that is undefined. It appears to be a numerical character reference, but it refers to a position in the document character set which the SGML Declaration for HTML says is UNUSED (this region, 128-159 decimal, of the document character set is reserved for control functions, and not used by HTML).

Be aware that this construct may not be rejected by a formal SGML validator, since technically it is "undefined" rather than "illegal". The details are subtle, but, whatever SGML may say, the use of such constructs is improper in HTML: you should not use this construct - not even if it appears to produce some effect that you wanted, on the browser that you use. (There has been discussion about non-SGML Char Refs, on the W3C Validator mailing list around July 2001: such a construct is considered invalid in XHTML, and is now being flagged by this validator also in HTML documents.)

Q: the HTML character reference ρ ?

A: reference to the Unicode charts, or perhaps more conveniently to the HTML4 specification, shows that this is a Greek "rho". Yes, according to HTML4 rules you can use unicode numeric character references, even when you are only using an 8-bit "charset" (character coding). In fact, this is precisely the time when you do need numeric character references. (Pity about Netscape 4.* versions!)

iso-8859-7 (Greek)

Now we can ask a similar series of questions but with the charset being iso-8859-7 (Greek) instead:

Q: An 8-bit character with the value 241 decimal?

A: this is the character at position 241 in the iso-8859-7 code, which is "rho".

Q: ñ ?

A: This is still the character at position 241 in the HTML document character set, that is to say, ñ (n-tilde).

Q: an 8-bit character with the value 153 decimal?

A: In the iso-8859-7 (Greek) code, as in the other iso-8859-n codes, the range 128-159 decimal is reserved for control functions, not used for printable characters. This 8-bit character code is illegal in an HTML document with this charset. (And would be rejected by a formal validator.)

Q: ™

A: As before, this notation should not be used: it is technically "undefined" in HTML, and illegal in XHTML. That is true irrespective of the "charset".

Q: ρ

A: this is the correct numerical character reference for the Greek "rho", as already noted.

Western European "Windows"

Now we consider a similar series of questions in relation to the Western European "MS-Windows" character set, windows-1252. This is a special cause of confusion: all of the displayable character code values of iso-8859-1 co-incide with the same codes in this Windows code - but additionally, the Windows code assigns displayable characters in the area which the iso-8859-n codes reserved for control functions. In unicode, those characters have code values above 256 (the respective values are also noted in the HTML4 specification).

Q: an 8-bit character with the value 153 decimal?

A: in the Windows coding, this is the "trademark" character (unicode 8482 decimal). Technically speaking, an HTML document is well-defined if it contains this 8-bit character, provided it is transmitted with a correct charset value. However, there is no requirement on HTML client agents to support every possible coding, and the ISO codings are generally preferred over a particular proprietary coding, even Microsoft's.

Q: ™

A: As before, this notation should not be used. The correct representation of the trademark as a numeric character reference in HTML4 is ™ irrespective of the "charset" setting.

Unicode utf-8

We close this informal quiz by asking one similar question about an HTML document sent with charset=utf-8:

Q: an 8-bit character with the value 241 ?

A: that was a trick question. In utf-8 coding, only the octets (bytes) with values below 128 decimal have any meaning in isolation (they represent the corresponding 7-bit us-ascii characters), Octets with values 128 and above only occur as part of a sequence of two or more octets, with each complete sequence representing one unicode character.

FONT?

But what about fonts? Suppose one were to enclose any of the preceding constructs in e.g FONT FACE=Symbol?

The answer, in HTML terms, is that the characters mean what the HTML specification says they mean, irrespective of their visual appearance on the page! If you think about the accessibility of HTML documents to robot indexers, speaking browsers, etc., this should be rather obvious. The purpose of the (now deprecated) FONT FACE, as well as of the font suggestions in CSS style sheets, is to select fonts with different cosmetic appearance, and not as a trick method for extending the character repertoire. We shall come back to this topic later: for now, please assume that you have some font that correctly renders all of the mentioned unicode characters.

Language? (HTML LANG attribute)

It's important to note that HTML maintains the distinction between "language" and writing system, as two separate concepts. For example Greek or Japanese are still Greek or Japanese (language) even when they have been transcribed into Roman characters, and should be marked-up as such in HTML, while English is still English (language) when written in, say, Japanese transcription (i.e writing). And the writing system is determined by which characters are used in the document, not by language attributes in the HTML markup.

[It's not ruled-out that a browser could use HTML language attributes to select a font for presentation, and some indeed do so: but the results are supposed to be cosmetic, rather than changing the content in any way. Authors concerned with Chinese, Japanese, Korean (CJK) authoring would want to read-up about the "Han unification" Unicode issue, which I've skipped in this presentation.]

Mental Models

I can't stress too strongly that the key to understanding a complex situation is to break it down into manageable parts, and understand each of the parts separately, before fitting them back together again across simple, well-specified interfaces. All too often, people have no clear separation of the concepts in their own minds, and then it's easy to become confused, or to propose techniques that may appear to give the desired visual result while being quite foreign to the HTML specifications.

So, I think it's important to look at these aspects. Skip this if you insist; but an inappropriate mental model is sure to cause confusion later on.

Client/Server

The WWW is based on a client-server model: the interworking specifications concentrate on what the data means when passed between the server and the client. The details of what happens internally at the client, and at the server, is no business of the interworking specification: all that matters is that the sender can put the correct flow of data onto the network, and the recipient can correctly interpret the meaning of the data that is received. The term "correctly" means "in accordance with the open, published, interworking specifications" (it doesn't necessarily mean "what the author intended", since there are certainly some authors whose intentions are incompatible with the specifications!). The interworking specification does not need to concern itself with the fact that Macs have their own proprietary code for internal working, nor indeed that IBM mainframes work in EBCDIC: the datastream that's passed across the network is to be interpreted in accordance with the interworking specifications, regardless of these local details.

So, we need to concentrate on the meaning of the data (streams of bytes - "octet stream" in the jargon) that the server sends over the network to the client, and on the meaning of the HTML which those data represent, and we can - and should - distance ourselves from the details of just how this or that client agent might store the data, parse the data, and render it onto a display. And remember that some "clients" (e.g indexing robots) do not interpret the information visually, and have no idea what a particular font looks like: they interpret only the meaning of the data, as laid down in the relevant specifications.

It can be instructive to think in terms of an abstract client, which stores data in unicode, and which is capable of rendering any requested character in any requested font. Real-world clients may behave quite differently: they might store data in some different fashion, and their rendering might be organised differently. But so long as the client software behaves to the outside observer "as if" it was working like the abstract client, then its internal details are of no concern ("black box").

Layer-cake model

A key technique for understanding network applications is the so-called "layer-cake model".

Regarded as a layered protocol, the delivery of an HTML document via HTTP is fairly simple: HTTP delivers an octet stream along with a MIME content-type which specifies its coding ("charset"). What happens underneath that (in terms of Internet TCP/IP protocols, Ethernet accesses and so forth) is at lower layers of the cake, which can remain hidden for our purposes.

At the lowest level which we need to discuss here, we apply the MIME content-type information (specifically, the "charset" that indicates how to interpret the octet stream as coded characters) in order to derive from the octet stream a stream of characters. Although in the simplest cases (e.g iso-8859-n codes) each octet corresponds to one coded character, this is by no means always so. In some codes (e.g iso-2022-based codes) certain octet sequences do not represent a coded character at all, but shift from one character "page" to another, in order to extend the repertoire for subsequent characters. Whereas, in ucs-2 every character is represented by a pair of octets, while in utf-8 characters are represented by variable numbers of octets, from one to (in principle) six, according to an algorithm.

It is therefore convenient to think in terms of our "abstract client" which converts every coded byte stream, no matter which charset it was sent with, into unicode characters for storage and further processing.

At the next higher level, the resulting character stream needs to be parsed in HTML terms. While doing that, we encounter &entity; representations and &#number; character references, which need to be interpreted. As has been stressed repeatedly, these references (according to HTML specifications) always relate to unicode (i.e the HTML "Document Character Set"), and not to the transmitted character coding ("charset"). Indeed, in our "abstract client" the details of the transmission coding were removed at the input stage when we converted the incoming octet stream into unicode for storage, and that issue can then be entirely forgotten.

(And, of course, the converse considerations apply when a client is expected to send some character data back to the server, e.g as the result of submitting a form. But that goes beyond the scope of this present article.)

Well, on the "black box principle" the client's inside workings may be different, but, for standards conformance, their behaviour must be externally the same as that of the "abstract client".

Finally, having interpreted the character stream as HTML, the browser needs to render it onto a display. For this purpose it must make use of appropriate font(s) - this is a local issue, and the details depend on the client platform's font management system, whatever it may be (don't confuse this with the selection of a font with appropriate cosmetics, in accordance with suggestions in CSS or -legacy- HTML3.2).

Terminology

The terminology of character representation and encoding nowadays is about as straightforward as it can be, given the nature of the problem; but it can be rather confusing, especially for those who learned an over-simplified model in earlier times - and now need to un-learn some of that.

Valuable reading is available at the W3C: Character Model for the World Wide Web, and at the Unicode Consortium: [PDF] Chapter 2 of the Unicode Standard.

Anyone developing an interest in this field now, ought to make themselves familiar with the Unicode design principles and with the meaning of the terms "Character", "Code Point", "Character Set", "Encoded Character", "Character Encoding Form" and "Character Encoding Scheme" (see the just-cited Chapter 2 for the Unicode Consortium's presentation).

In particular, it needs to be clearly understood that, what the previously-defined MIME specification calls the charset= attribute (a terminology which was not unreasonable at the time it was defined) refers in current terminology not to "character set" as it might seem, but to what in current Unicode terminology is called the Character Encoding Scheme.

Fonts decoupled from character coding

In earlier implementations, fonts were a simple mapping of the 256 available 8-bit codes to displayable characters. Consequently, for each character coding you'd want a different font: for the hypothetical font style "Foo" you'd need, say, Foo Latin 1 for iso-8859-1, Foo Greek for iso-8859-7, Foo Russian for koi8-r, and so forth. Each new coding that you wanted to support, you acquired a new font and the job was done, or so it seemed. And vast arrays of such fonts can be acquired from download sites, each differing from another sometimes in just a few places, where the normal characters have been swapped out for exotic characters that were needed for some specialised purpose.

However, to get genuine internationalisation, this will not do. The client platform needs one of two things: a much larger font, or the ability for a browser to access several (cosmetically compatible) fonts in order to render an international document. Remember, in HTML terms the whole document is in the one character coding (that misleadingly-named MIME charset parameter), you cannot switch between codings in mid document.

Word processing, by contrast, typically uses a quite different model - but one which does not lend itself well to cross-platform document transfers, something which is absolutely essential to the success of the WWW. Rather than thinking in terms of the author "selecting a font" in the traditional sense (e.g switching from Foo Latin 1 to Foo Greek when you want some Greek characters), and passing that information to the client in order to get the desired exotic character displayed, HTML4 works in terms of passing the unicode character reference for the desired character, and leaving it the responsibility of the client to work out how, in terms of the fonts etc. at its disposal, to render that exotic character on the display.

Summary: what originally seemed to be a simple action: character-code in, font position out, has now become, in the abstract process model, three separate steps, each simple in itself, but decoupling the separate issues:

  1. The incoming octet-stream is decoded according to the "charset" attribute (=character coding!) into a stream of unicode characters,
  2. The stream of unicode characters is parsed as HTML,
  3. The result is rendered onto the display by making use of appropriate font resources.

That is the abstract model, which all correct implementations have to mimic, no matter how they are implemented internally.

Stop babbling and tell me how I can use this stuff

Yeah, OK, fair comment.

In theory, the various parts that have been described already are valid according to the HTML4 (or RFC2070) specification. Unfortunately, although later browser versions (Mozilla/ Netscape 6+, Opera 6+, MSIE4+, etc.) cover this rather well, there are still browsers in use (notoriously NN4.* versions, as well as browsers for some minority platforms) whose handling is incomplete or buggy; further, there are browsers which apparently make no attempt to cover this part of the HTML4 specification (e.g some TV web appliances). I can't resist an honourable mention for Lynx, which (if used in a utf-8-capable terminal emulation) is, in this regard, miles ahead of some browsers I could think of, whatever your opinion of it might be in other respects.

Even if you use the most conservative approaches to exploiting i18n, which I will describe next, you are still running the risk that some proportion of readers will not be able to read it, either because their browsers are not capable, or because they omitted to install the necessary optional components.

In 1998 I commented here how depressing it was that authors were still resorting to tricks with non-standard &#number; usage, or FONT FACE=Symbol, and getting the visual results they wanted on the mass-market browsers, instead of writing correct HTML4. But, as expected, browsers are progressively implementing the published specifications, and some of these tricks cease to produce the desired results. And be warned again that font tricks will not be indexed correctly by robots, and probably won't be handled correctly by screen readers and speaking browsers: in other words, they only give a visual impression of working, while being completely unsound at the more fundamental level.

A conservative recommendation

This recommendation is addressed specifically to authors in a Latin locale who wish to include moderate amounts of other scripts into their documents. This approach would surely be impractical in a non-Latin locale, e.g a Cyrillic document that wished to include small amounts of Greek, Hebrew, French etc. material. For a summary of these issues, and some practical solutions, see my checklist.

For documents that are chiefly in Latin letters, and that occasionally need a few non-Latin characters, or those mathematical signs that are covered in the available browsers, this is quite viable, although it needs a properly configured browser.

  1. Compose documents using only the characters of us-ascii (7 bit),
  2. Represent characters that are not in iso-8859-1 by using their &#bignumber; references ("bignumber" > 255),
  3. Represent "8-bit" iso-8859-1 characters by using their &entity; (preferred) or &#number; representations,
  4. Send out the document with its charset advertised as utf-8 (see discussion).

Explanation: Netscape browser versions 4.xx fail to render most of the unicode characters represented by &#bignumber; at most settings of the charset attribute: this is a bug in terms of the published HTML specifications. They can however work when the charset is set to utf-8. It's a very useful fact that us-ascii (7-bit) is a subset of utf-8. However, this forces you to represent your "8-bit" Latin-1 characters by means of &-constructions (unless you know how to generate correctly-coded utf-8 data streams - no great challenge if you have suitable software, but outside the scope of this note).

The result can validly be advertised with any charset that includes us-ascii, such as us-ascii itself, iso-8859-1, or utf-8. It's a fact that by advertising it as utf-8, Netscape versions 4.xx are capable (provided that some other requirements are fulfilled) of displaying it correctly.

Whichever you do, the HTML itself is fine; however, it should be noted that HTML4 introduced a rule that there is no charset default; and CERT advisory CA-2000-02 describes some actual security reasons for not omitting an explicit charset specification. So on that basis, we'd now have to recommend you to specify UTF-8 rather than leaving it open.

Caveats: putting your document on the web in this form, including any Japanese, Cyrillic etc. characters that you use, and telling the reader that they need to use one of the browser versions that are technically capable of viewing it, is unfortunately not the whole story. Although recent versions of the "Big Two" browsers are capable (within the limitations discussed here) of displaying these characters, they can't do it without access to appropriate fonts, and appropriate browser configuration for them to use those fonts. So these techniques are still only appropriate for the kind of audience that could be expected to take the trouble to configure their browsers properly, beyond what they would need to do for viewing only Latin-character pages. You might want to include a little browser test (e.g a string of relevant characters alongside an image of the same string) somewhere in your pages, if correct rendering of specific characters is vital to your content. Further discussion is offered in a page about browsers and fonts.

A further caveat is that this approach only facilitates NN4.* versions displaying a wide character repertoire. User input (i.e forms) in utf-8 is broken in NN4.* versions, as explored in a separate page.

"Layers": an email correspondent tells me that NN4.* versions appear to be broken when utf-8 is used in conjunction with "layers", and that he "therefore was forced" not to use utf-8. Well, I'm sorry, but "layers" in the Netscape sense never made it into any open interworking specification, and I can't find a great deal of sympathy for trapping oneself into the use of proprietary extensions. But at least I'm mentioning it here, in case anyone else is affected.

Using one non-Latin-1 repertoire together with Latin-1

Authors who are already using one "native" writing system e.g Cyrillic, Greek etc. in an appropriate 8-bit coding are presumably already aware of browser/versions which cannot successfully render characters from other scripts at the same time. If their documents contain only their own writing system, then this isn't a problem, and there are surely millions of documents on the WWW already where this is the case and where no change is needed, as long as those documents conform with the HTML specification.

Note however that some older WWW documents relied on the fact that Latin &entity; notations, and/or &#number; references in the range 128-255, appeared to give the desired non-Latin characters (perhaps by choosing a particular font, or by configuring a character coding in the browser) on many older browser/versions that pre-dated the proper support of the i18n specifications; but such documents are not proper HTML: browser implementations are moving more and more to conformance with the specifications, and so will not display what the author (misguidedly) intended. If those defective documents were to be converted to 8-bit character coding, instead of misusing &-notations, they could be accessible to both kinds of browser.

The suggestions here are specifically of interest if you need to make documents that call for an extended repertoire. Such documents will only be accessible to recent browser versions, that have been properly set up.

I don't myself work in a non-Latin-1 environment, so what you're getting here is rather second-hand. It's pretty obvious that if you are in a Cyrillic, Greek, etc. locale (or Japanese, Chinese, Korean... with which I have little contact or expertise), you wouldn't be at all happy with writing &#bignumber; references for your own native language; you'd want to author with the normal facilities for your writing system.

In theory you can use one non-Latin-1 8-bit repertoire, such as Cyrillic, or such as Greek, along with Latin-1:

  1. 7-bit characters (those below 128 decimal) are us-ascii in all of the character codings under consideration.
  2. 8-bit characters represent the chosen repertoire: Cyrillic, Greek or whatever your "charset" calls for,
  3. &entity; representations should represent what they say,
  4. &#number; representations (160-255) should represent the corresponding range of the HTML document character set, i.e iso-10646, which in this range is identical to iso-8859-1.

Does this work? Yes and no. Charsets such as iso-8859-7, koi8-r etc. have been supported by browsers for much longer and more widely than unicode has, and these codes include us-ascii (specifically, non-accented Latin letters), so there's not much of a problem there, at least as far as rendering body text in HTML documents is concerned. (There are glitches in other areas, such as FORM submission, non-Latin characters in ALT attributes, TITLEs etc.)

It gets more interesting when we try to utilise additionally the "upper half" of the Latin-1 repertoire. What follows was based on some old observations of browser behaviour reported separately. But this is now several years old; modern WWW browsers don't suffer from these shortcomings, so the issues now only affect readers who are using either older browsers such as NN4.* versions, or limited-functionality browsers such as some TV web appliances not supporting this part of HTML4.

According to the HTML4 specification, while you are using a charset that represents a non-Latin repertoire you may include the accented Latin-1 characters etc. by means of their &entity; or &#number; representations, and this was working in protocol-conforming browsers such as Alis Tango, and in Win MSIE since all except the earliest releases of IE3. So the best chance of success comes from using the &entity; form in this situation - but neither form works properly in Netscape, up to and including versions 4.xx, in general (Lynx got this sorted out a considerable time ago!).

The following approaches can be considered, depending on what you want to achieve.

  1. Following the procedure described in my conservative recommendation isn't likely to appeal much to authors who use non-Latin alphabets, as the use of &-notations for everything that isn't a 7-bit ASCII character is going to cause intolerable document bloat. At most, it could be interesting for authors in Latin locales, who might consider programmatically converting the accented letters etc. into &-notations.

  2. Beyond that, though, if you need a wide character repertoire, outside of the restrictions of the 8-bit coding appropriate to your locale, it looks as if the document that's actually sent to the 'net needs to be in utf-8 format if it's to be accessible to Netscape 4.* versions, until that thing finally dies out.

    It may well be that you would like to author your documents in the 8-bit charset that is appropriate to your locale, with the addition of &-notations for characters that are outside of that repertoire. Such a document is valid HTML (if sent with the appropriate charset attribute), even though it isn't supported by Netscape. So, because it is valid you can perfectly well process it with available software tools: a couple are mentioned on the techniques page, but maybe you already have access to appropriate character code conversion software.

See note on older browsers.

Other approaches

Let's be quite clear about this. Some users have asked me about the "conservative recommendation" above, demanding to know why I'm telling them that they have to code in us-ascii and use &-notations. No! This is just one possible approach, out of several; I'm recommending it only because of its simplicity and relative robustness in the face of various hazards, together with the breadth of browser coverage which it offers.

For those who feel able to produce and handle alternative character representations, my checklist presents several alternatives and discusses their applicability.

Response to some reader comments

"Win3.x and Win95 don't use Unicode, so I can't use these techniques. I must be limited to an 8-bit font"

Well, first I should mention that Win95 does have support for unicode (fonts etc.) for use by applications, even though the operating system does not "work in" unicode itself. But this isn't quite the point.

The Alis Tango browser (so old that it's now effectively obsolete) had been supporting RFC2070 under Windows 3.1 or later, on a 386 PC with 8MB memory, for quite some time. In any case, a properly implemented browser can achieve an extended repertoire by using several 8bit fonts, it doesn't necessarily have to use a "unicode" font.

MSIE3 (except for the earliest releases), even in the 16-bit version, happily supports one of several 8-bit codings used in conjunction with Latin-1, at least. For example when I view the koi8-r test table, the browser uses the "ER Bukinist 1251" font as a source of Cyrillic characters, and some other font for the accented Latin letters etc.

When Win95 is installed, there is an option for using international fonts: you'd want to select this in setup. 32-bit MSIE4 under Win95 merrily displays characters such as θ (theta), ‰ (permille) and so on, in a document that's sent in a non-Latin-1 coding (I tried koi8-r as an example).

IE5 goes further, in that it can install language[4] support selectively, and when it encounters a document with a writing system that you haven't installed yet, it warns you, and offers to download and install the component that you need.

The big problem, I'm afraid, is still Netscape 4.*, that can only be coaxed into a semblance of correct operation if it thinks the charset is a unicode coding such as utf-8.

So, there's no dispute that there are still some real practical problems in using this sort of thing on the WWW, except for a specialised audience who could be expected to have the motivation to get a suitable browser. But, it's a mistake to believe that the absence of unicode support in Win95 or in Win3.x is or ever was an insurmountable part of the problem.

How can I put [Cherokee, Old Gaelic, Hieroglyphics, Klingon - [5]] on my web page?

As has been explained, HTML4 has chosen to represent characters by means of the Unicode standard. In short, if the characters you want exist in Unicode, then that is how you represent them; if they don't, then you can't represent them at all, in standard HTML. Even if you can represent them, there is no guarantee that your readers will have browsers that support that particular part of the unicode repertoire.

Sure, you will say "I have a font for this writing system, I want to use that". And indeed in older browsers this "worked" without problems: you fed your document with some Latin-1 characters, specified your special font, and the characters which you had intended, appeared on the display. It all seemed to be working "as desired"; but I'm afraid that it wasn't working "as designed". It's a fact that you can continue to play these games with fonts to some extent, although they aren't guaranteed to work, there may be problems cross-platform (which was precisely what the HTML4 approach was intended to avoid), and they will work less and less as browsers are developed to conform with the published specifications. This may seem harsh for someone who has a valid requirement for a specialised writing system that hasn't been adopted into Unicode, or that isn't supported by the available browsers. There's a separate page about FONT FACE that may help shed some light, and, if all else fails, mentions a possible diplomatic compromise.


Notes

1. koi8-r is an example of an actually-used 8-bit code which, unlike the iso-8859-n series, assigns printable characters in the range 128-159 decimal inclusive. Its upper half is quite unlike iso-8859-1, which makes it a richer illustrative choice than Windows-1252 (we cover that later). For those reasons, it makes a good example for our review. What's more, it contains some of the Latin-1 characters (copyright, superscript-2, no-break space etc.) at completely different code positions than in the iso-8859-x series, so there's every prospect of mayhem when browser designers haven't done their homework properly.

2. Learning how the utf-8 code works is straightforward enough, but rather fiddly. I don't propose to cover it here, as I'll be taking the view that if your authoring tools don't support it transparently for you, then you probably don't want to create it yourself. Of course, you can tackle it if you want to, don't let me discourage you. I just don't think it's helpful to get involved in it here, beyond noting this very useful feature that us-ascii (i.e the 7-bit code) is a subset of utf-8.

3. Using one non-Latin repertoire in conjunction with only us-ascii has been working with browsers for a considerable time already, since (on a dumb enough browser) it requires nothing more than a font suitable for the chosen 8-bit repertoire. Indeed, early browsers that did not even support the charset attribute were able to be used for reading Cyrillic or Greek or other 8-bit coding by the mere strategem of configuring the browser to use an appropriate font (example: ER Bukinist KOI8-R for the koi8-r coding). However, these old browsers made no distinction between the transmission coding and the HTML document character set, and if asked to display e.g ñ or ñ (see the examples above) would display YA (koi8-r) or rho (iso-8859-7) etc. instead of the n-tilde which the i18n standard requires. Unfortunately you may still find old HTML documents which assume that this is correct behaviour, and so include these &-representations with the intention of displaying those non-Latin characters, which is quite wrong, and will not produce the intended results on specification-conforming browsers.

4. Microsoft refer to their support for non-Latin writing systems as "language" support. As I noted earlier: in HTML terminology this isn't really correct, the two issues are independent of each other as far as HTML is concerned.

5. I am informed by a fan that there is an official Klingon font from the Klingon Language Institute.


|Up|More|Next | |RagBag|About the author||