User-defined character encoding (charset) to extend repertoire?

In a usenet discussion, Andreas Prilop took a critical look, from both a practical and theoretical point of view, at the options for dealing with back-level browser compatibility and with non-standard font repertoires, and described a diplomatic compromise. Before looking at the details, however, he does stress that this should not be seen as advocating or promoting this usage: what he actually advocates is deployment of the HTML4 technique, whereas this is just a diplomatic temporary fix. This, as you can tell from the other material in this area, is also my own view: by now (2005) for any realistic deployment in the WWW arena, the required characters should be taken from Unicode, and bona fide HTML4 techniques used.

The only exception would be in relation to writing systems which have not been codified in Unicode, where a community of consenting users might have recourse to a mutually agreed "user defined" coding, in the way that we set out below. (Note however that the Unicode Private Use Area offers an alternative solution to such a requirement). It should be noted that most existing web pages which promote "user defined" coding techniques use ways of representing their characters (namely, using &-notation, either numerical character references or character entities) which are inadmissable under W3C HTML specifications, since those definitively represent the corresponding Unicode characters. Here, we allow only the other form of character representation, namely the character itself (represented in a user-defined character encoding), which, at least, can be interpreted as falling into an empty space in the W3C specifications, rather than being in contradiction to them.

So, in order to make use of this compromise, the relevant characters should be included as 8-bit characters, and the document not advertised as being in any particular charset. From a theoretical point of view, it would be preferable to advertise it as charset=x-user-defined, or maybe even an invented value following the charset=x-* pattern, but this has some negative consequences in practice: so it may be appropriate to send the document out with no charset specification, with the users of this technique understanding that they are expected to set their browser to user-defined encoding. In effect, the user is making no particular claims as to the 8-bit coding that they are using. By using actual 8-bit characters, they are not misusing the Latin-1 entities (&name;) nor their numerical character references (&#number; for values below 256) to represent extraneous characters. Of course, the usual caveats about careful handling of 8-bit codes during cross-platform uploading, etc, apply here just as elsewhere.

This is simple enough to explain when the entire document is using one non-standard repertoire: the user could be invited to simply install the relevant non-standard font and to configure their browser to use that font for the "user-defined" encoding. So far, so good. The "user defined" encoding in such a case is whatever your non-standard font arrangement implements.

Less palatable to an "HTML purist" is the idea that some portions of a document might be surrounded with FONT FACE= markup that specifies a non-standard font, while the rest of the document is treated as normal, let's say iso-8859-1, coding (although the charset doesn't actually say so). At risk of sounding monotonous, let me remind you that this is not HTML's way of extending the character repertoire. From the point of view of character coding theory, we now have a document in which different codings are being used in different portions of the document, but without using a code-switching mechanism that works at the character code level - instead, the code switching points are indicated by a higher-level markup, namely the FONT tag in HTML: this is unsatisfactory from an architectural point of view. Nevertheless, it does give a visual impression of working, on a reasonable range of available browser versions, so it's understandable that some authors wish to use it, at least as a temporary compromise.

Interestingly, 8-bit characters are also used for this purpose in the tth (TeX-to-HTML) converter by Ian Hutchinson, which, in its traditional output mode, relies on the Symbol font trick: the author discusses the extent to which it is supported in various browsers/platforms and offers some browser workarounds for getting the desired effect in a few additional situations. It appears that Mac users stand a chance of getting the desired effect if they pretend that the character coding is macRoman (which gives us yet another reason to not advertise such documents as being in iso-8859-1). In some versions of Netscape, users can only select their own charset if the page provider has not imposed one, which is a practical reason for not advertising x-user-defined. He remarks that Latin-1 characters (represented by &-notation) may still be displayed correctly. The author cautions against some WWW authoring packages that will forcibly convert the 8-bit characters to &-notations against the author's intentions (and A.Prilop counsels users of Netscape Composer to prevent this by setting its coding option to User-defined)..

Having cited that document by Ian Hutchinson, I might just comment on the way in which his document continues to plead for his method as correct HTML, while the fact that some recent browser versions and other platforms don't co-operate is presented as a fault in the browser. (However, his recent versions have gone a little way towards conceding that not everyone shares his view of the situation, and he does now offer support for generating Unicode instead.) I would take a different standpoint: that he has made an understandable compromise for the purposes of what he wanted to achieve in the short term, but that the documented problems are a classic demonstration of the way in which RFC2070 really did address the platform portability issue correctly, whereas his FONT FACE kludge leads to all kinds of difficulties in the longer term. I don't for a moment want to decry the amount of effort and ingenuity that has gone into his program, but that is quite a different topic.


I'm really not trying to promote the technique here, either, but, given that some people are determined to use it anyway, it's interesting to note that some of the objections of principle can arguably be defused by using 8-bit characters, and taking care not to advertise a misleading charset.

|Up| |RagBag|About the author||