Checklist for HTML character encoding

This page presents a number of character repertoire scenarios and makes recommendations for optimising accessibility on older browser/versions. The emphasis here is on clear recommendations - which can be rendered on an appropriate range of browsers if they have been properly configured - rather than on explaining too many exceptions and special-cases: other supporting material in this area should be helpful in better understanding the choices. Handling of forms input isn't covered here (some incomplete notes on forms submission are available separately).

Important: this web page concentrates on the form of the document as it will be sent out as text/html from a web (HTTP) server, and does not address how to author it in the first place nor how to get it onto the server. Those details are too many and varied (and OS-dependent) to deal with them adequately here; whereas what is sent from the server to the client is clearly-defined and platform-independent (if it was not, then it would be a failure in WWW terms!), and that is what we are concentrating on here.

Compatibility for older browsers is only really relevant for the content-type of text/html: that is to say, either "HTML proper", or compatibility-mode XHTML/1.0 ("appendix C"). Those who want to use overtly XML-based content types will inevitably be incompatible with older browsers (and quite a few current ones, indeed), and so, if utf-8 is appropriate, then just go ahead and use it. There's a note about writing xhtml/1.0 for compatibility.

Terminology

As I've commented elsewhere in this area, the terminology used in relation to the representation of characters in HTML and the WWW often causes confusion to those who gained experience in character handling in a different field, e.g word-processing. This is not the place for a full tutorial: I've tried to keep the checklist understandable even without a deep knowledge of the topic, but I do encourage readers to develop a familiarity with the HTML character model to avoid unnecessary confusion.

"8-bit Character Repertoire" refers here to a repertoire of no more than 256 characters that is supported by one of the various 8-bit character codes. Examples would be the 8-bit codes defined by ISO (iso-8859-n for various values of n) or by others (localised encodings such as Thai encoding TIS-620, or vendor-defined encodings such as Windows-1250, macRoman...).

An 8-bit repertoire represents the largest repertoire that many early browsers could support at any one time, at least by means of specification-conforming techniques (and remember, an HTML document has to be entirely in one encoding, it's not possible to change coding partway through a single document). Many of these older browsers could support different "8-bit character repertoires", with the browser automatically switching in response to the incoming character encoding (that MIME charset= attribute) of each individual document.

Some more-recent browser versions, even if not offering full support for Unicode, nevertheless could deal with a wider document repertoire than just one 8-bit encoding, like for example scenario 5.

"advertise as" refers to the specified "character encoding" (HTML terminology) with which the document is to be sent out from the HTTP server. This should be defined by the "charset=" parameter (correct MIME terminology, but now rather misleading in an HTML context) specified on the HTTP Content-type: header. The specifications also allow for this to be specified via meta http-equiv within the HTML document, but this is less satisfactory both on theoretical grounds, and on some practical considerations, as is discussed in more detail elsewhere.

"Coded character" refers to the character itself, expressed in the advertised character encoding: i.e a single octet (byte) if this is an 8-bit coding, or an appropriate sequence of octets if this is a multibyte coding. This is in distinction to a character expressed by one of HTML's &-representations: character entity of the form &entity;, or numerical character reference of the form &#number; decimal (widely supported) or &#xhhhh; hexadecimal (somewhat less widely supported), remembering that these numbers in &#-notation refer to the character's position in iso-10646/Unicode, irrespective of the character encoding (charset) used.

Theoretically, the three different kinds of character representation - the coded character, the numerical character reference, and (where available) the named character entity - are fully equivalent as far as HTML is concerned; the point of this checklist is the practical issues that favour the choice of one representation rather than another in various actual situations (the "scenarios") presented below.

I'm not attempting to cover characters of the Chinese, Japanese, Korean (CJK) kinds, as these are outside my area of expertise.

Scenario	Character Repertoire	Recommendation	Notes
1	Latin-1	Use `&entityname;` notation (widely supported and more mnemonic), or `&#number;` "Numerical Character References". Advertise as `iso-8859-1`	Particularly recommended for those working cross-platform (e.g Macs) without the relevant expertise in handling 8-bit coded text. See also the WDG's advice
2	Latin-1	Alternative to scenario 1: Use 8-bit coded characters, advertised as `iso-8859-1`	Contrary to rather widespread superstition, 8-bit coded characters are entirely legal on the WWW (see Note A).
3	Latin-1 with Windows typographical extras (matched quotes, em-dash etc.)	Not Recommended, see Note B.	If extended character coverage is being used anyway, then use the methods of scenario 6 (or 7).
3a	Windows Latin-1 repertoire (including euro)	Proprietary and not really recommended, but admittedly rather widely supported, even by relatively old browser versions which cannot handle scenario 6: use 8-bit characters, coded with `charset=windows-1252` Alternatives: Note B.	The more forward-looking approach is to follow the methods of scenario 6 or 7. A composite of Latin-1 with one other 8-bit repertoire (e.g Latin-2) could be done as in scenario 5. See also discussion in Note B. `€` is now rather widely supported, and might be used in scenario 1 or 2 if desired.
4	One 8-bit Repertoire	Choose an 8-bit encoding appropriate to the desired repertoire (preferably an ISO code, e.g iso-8859-7 for Greek, or one that is widely used in its native habitat, e.g TIS-620 for Thai). Use 8-bit characters, advertised accordingly.	This form of document is accessible to a very wide range of browser versions, even old ones, although it might require additional setup or fonts to take advantage of the browser's capability. Some issues are explored by J.Korpela. For HTML use, avoid iso-8859-15.
5	One 8-bit non-Latin-1 Repertoire, together with Latin-1	Code the non-Latin-1 characters like scenario 4, as 8-bit coded characters, advertising the document with the appropriate encoding ("`charset`"). Express the Latin-1 characters like scenario 1, as `&entity;` references.	This form of document is entirely valid and accessible to any client which conforms to RFC2070, but many characters fail on Netscape 4. versions*. See Note F. See scenario 8 for possible workaround.
6	More than one 8-bit repertoire, but predominantly Latin text	Code everything using only us-ascii characters (i.e 7-bit), expressing all other characters, even Latin-1, by means of &-notation. For Latin-1 characters, `&entityname;` is recommended, whereas for non-Latin-1 characters, `&#bignumber;` (unicode values) is preferred, even where an HTML4 entity is defined, since browser support is more widespread. Advertise the document as `utf-8`	This of course needs a browser version which supports enough of HTML4/RFC2070 to understand what's needed. It therefore shuts out some old browser versions which could have coped with scenario 4 or 5. The browser might need some extra setup to enable this capability, e.g extra font(s) and settings. Refer also to scenario 8. See Note C.
7	More than one 8-bit repertoire, not limited to predominantly Latin text	Use actual `utf-8` coded characters, advertised as such.	Just like scenario 6, this is an entirely valid form to send out documents, and is acceptable to any RFC2070-conforming browser as well as to Netscape 4.* versions. Browser coverage for the two forms seems rather similar. The expected difficulties are not in the browsers, but in authors (mis)handling this unfamiliar data format.
8	Compromise solution for scenario 5, using techniques of scenario 6 or 7 for browsers which support it.	Make the document in two different forms (or generate them as required "on the fly"). Use server negotiation (based on the client's `Accept-charset:` header) to send the utf-8 version to those clients which indicate ability to accept utf-8 (this includes Netscape 4.* versions, which are otherwise defective in this regard), while sending the "scenario 5" version to any other clients. See Note D.

Commentary

Versatility:

All of the schemes recommended here utilise valid techniques according to published specifications and can (subject of course to the limitations of each scheme) be programmatically converted from one form to another. Thus, it isn't essential that your authoring tool produces the precise form recommended: there are ways of programmatically converting one form to another. There are numerous ways of doing such a conversion in an HTML-aware fashion, including simple command-line utilities and pipelines depending on your preferences - some of which could be used for on-the-fly conversion in the server, if you wish. For those who prefer a point-and-click solution prior to uploading to the server, it may be mentioned that your HTML could be loaded into Mozilla Composer (or one of its derived authoring packages such as Nvu), and then saved with encoding conversion: characters in the content will be converted between encoded characters and &-notation as appropriate for the newly-specified character encoding. To be specific, the Composer/Nvu menu item for this is File> Save And Change Character Encoding.

A word of warning: even though the browser versions discussed here are technically competent to do what is being described of them, it's not certain that a particular browser installation will do it properly: the user might need to supply some fonts supporting a specialised repertoire (e.g Thai, Armenian...) or install optional rendering features (e.g right-to-left text, Indic script support...). It might be helpful to supply a little test-case page, with a screen-shot image for them to compare, and some notes on any special actions they'd need to take to set up their browser for this situation.

Search engines and other non-rendering HTML clients

Points of interest are not only the accessibility of your documents to users' browsers, but also to search engines. A.Prilop cautioned that search engines had been slow to support indexing of utf-8-encoded content - some earlier problems with AltaVista search seem to have been fixed, but for best results across search engines it might still be advisable to offer appropriate 8-bit encoding(s) as alternatives to a utf-8 version, along the lines shown in scenario 8. The relevance of this is fading with time, however (2005).

Note that even those search engines which support utf-8 may have no support yet for utf-16 or utf-32 encodings: in WWW situations where a unicode character encoding is desired, then we definitely recommend the use of utf-8. As for utf-7, it is now considered obsolete by the Unicode consortium, and there seems to be no justification for using it in a WWW context (HTTP is a guaranteed 8-bit protocol, after all), quite apart from dubious search-engine support.

Use of Latin-1 character entities (i.e in the form &name;), in preference to other representations of these characters, can be beneficial as far as locating Latin-1 strings in any encoding, but of course this doesn't help when the characters to be located are not in the Latin-1 repertoire.

When we come to the non-Latin-1 character entities of HTML4, on the other hand, there's a dilemma. There seems no doubt that the &#bignumber; format is more widely implemented than the &entityname; form, if only because of Netscape 4.* versions. On the other hand, a browser which does not implement &entityname; is likely to display something reasonably intuitive (i.e the uninterpreted source code), whereas one that doesn't implement &#bignumber; is likely to display incomprehensible garbage. So it's hard to give general advice about which form to prefer: it depends on the context, and on your priorities for the fallback behaviour in old browsers (recent ones are not a problem).

Combining marks

Observations indicate that "combining marks" (the Unicode General Category values Mn, Me and Mc) are not as well supported by browsers as are precomposed letters. Also, support in search engines for combining marks seems to be poor: support is demonstrably better for precomposed letters.

The advice therefore is to use precomposed accented letters wherever they exist, in preference to base letters plus combining mark(s), because they work better with current browsers and fonts, and with search engines. This is certainly true for Latin, Greek and Cyrillic, at least.

Footnotes

Postlude about CSS

CSS is also a text format (Content-type: text/css) and should be delivered with a proper specification of character encoding (charset). The principles are much the same as for HTML, but many of the details are different.

In CSS, for the most purposes, us-ascii is entirely sufficient to represent the operative parts of the stylesheets. On the few occasions where it is necessary to refer to characters which can not be represented in us-ascii - for example a character string to be inserted by :before or :after pseudo-elements - then they can be represented by the backslash notation shown in the immediately following section of the specification.

However, this isn't the whole story: in addition to the operative parts of the CSS, there are likely to be comments, and users will want to write these comments in their own language and writing system.

There is a good reason for explicitly specifying the encoding. To take just one example: the browser had somehow got the idea that the CSS stylesheet was encoded in utf-8, whereas it was in fact in iso-8859-1. In the entire stylesheet there was just one instance, in a comment written in German, which contained an umlaut. This resulted in the entire document being ruled-out as invalid utf-8 encoding, and the browser ignored the whole thing. Which, according to the specifications for utf-8 encoding, it is quite entitled to do. It came as a bit of a surprise that something included in a comment could render the entire stylesheet invalid. So, we should understand how to communicate the character encoding from the server to the browser.

How to inform the client of the character encoding

Just as in HTML, we can specify the encoding by means of the MIME attribute charset= on the real HTTP header. This follows the principles of the W3C note, Setting the HTTP charset parameter, even though the note doesn't actually mention CSS. This setting takes priority if it is present.

Analogously to HTML, there also exists the possibility of defining the character encoding in the stylesheet itself, although the syntax (the @charset-rule) is completely different, and very tightly defined.

Under some circumstances, the encoding (if it is one of the Unicode character encoding schemes) can be self-defining by means of the BOM (Byte Order Mark). Details are in the CSS specification. However, relying on the BOM can cause problems, e.g the W3C CSS Validator does not support it.