The responsibility for the selection of topics and the statements made about them is purely my own, and all information is offered with the usual disclaimers. However, I do try to indicate sources of more-authoritative information for the various assertions that I make.
quot
The selection of a font or style takes place quite separately from the character code mechanisms that we are considering here. Although a glyph, for example little-a-grave, might look cosmetically different in different fonts, in italics, etc., they all are instances of little-a-grave, and considered to be the same glyph, and represented by the same character code point.
Several different character codes feature in the discussion below. (Except for EBCDIC) they are all extensions of the 7-bit US-ASCII code, and therefore they coincide with US-ASCII and with each other in the lower half, code points 0-127 (decimal). In the upper half they differ, both in the repertoire of glyphs which they represent, and in the assignment of glyphs to code points. The main body of the note does not consider national variants of 7-bit ASCII (as laid down in the old standard. ISO646), but there is a digression for those who would like to know more. Nor do we consider the use of the 8th bit as a parity bit, this is irrelevant to and incompatible with our discussion.
So, when people refer to "the Latin-1 code" or "the ISO Latin-1 code", it might be assumed that they are referring to the "ISO-8859-1 code"; however, there is the possibility that they are referring to CP850 (in the Microsoft manuals this is called the "Multilingual Latin-1 code"), or to some other code that represents the ISO Latin-1 repertoire of characters even though the code itself is not an ISO code.
In the ISO-8859 FAQ, it said vaguely that the codepage CP819 is "supposedly fully ISO-8859-1 compliant"; Netscape release notes also mention treating CP819 as a MIME Charset synonym for ISO-8859-1. See further discussion.
The HTTP specification mandates the use of the code ISO-8859-1 as the default character code that is passed over the network. The HTML specification is also formulated in terms of the ISO-8859-1 code, and an HTML document that is transmitted using the HTTP protocol is by default in the ISO-8859-1 code (at least, this was true prior to the HTML4.0 spec).
The MIME protocol, that is used by HTTP and by MIME mail, contains a clearly defined mechanism for explicitly defining a character encoding, but, at the level of the HTTP1.0/HTML2.0 specifications, browsers are not actually required to support any code other than the default i.e ISO-8859-1. I am writing this briefing and the related materials (except where specifically stated) entirely in terms of ISO-8859-1. I do not mean this as any kind of insult to those for whom ISO-Latin-1 is not the natural repertoire, I assure you; on the one hand the W3C had been working for a long time on an internationalization (i18n) draft, which lays out how browsers ought to support an extended range of characters, but on the other hand, when they made a practical attempt to document the common features of popular browsers as at some point in 1996, they weren't able to include any such extended characters - sad, but realistic. Subsequent developments included RFC2070, and then the i18n part of the HTML4.0 specification, although coverage in the popular browsers still leaves something to be desired (as of 1999).
In the rest of this briefing I occasionally refer to "native" 8-bit character codes: by this I am referring to the character storage codes that are used on certain platforms, e.g DOS Codepages such as CP437 or CP850, the Mac proprietary storage code (see Inside Mac for documentation), the EBCDIC code used on IBM mainframes, and so forth. I am not referring to non-Latin encodings such as Korean, Japanese, Hebrew...
As far as authors of HTML are concerned, character coding is an issue for them in two contexts: (1) where authors create files that actually contain characters from the upper half of the 8-bit code table, and (2) where they refer to such characters by their number;
representation. If authors confine their use of characters to the low half of the 8-bit table (i.e the area defined by the US-ASCII 7-bit code), and represent any characters from the upper half by their &entity;
or number;
representation, then point (1) is not an issue, and furthermore, when transferring files between platforms by various means - Internet FTP, email, diskette etc. - there is no need to worry which particular 8-bit code is native to the sending and receiving platforms. For these reasons, this is an approach that is much to be recommended. Where a file has been composed in another form (for example, by typing in accented characters using a non-English-language keyboard), it might be wise to use one of the utility programs that convert to an &-representation of the characters in question.
What happens in practice is that the & representations are not interpreted by the web server, but are passed as they are (i.e a string of US-ASCII characters) to the browser for interpretation by the browser.
The standard requires that the number;
representation be interpreted by reference to the code points in the ISO-8859-1 table, and not according to the native storage code of the platform on which the browser is executing. Implementations will probably achieve this by mapping (translating) the character into the platform's native storage code and offering it to the normal display routines. Another approach that is possible in theory is to define ISO-8859-1 as a private code within the browser, and to use private font tables (this approach tends to lead to unpleasant consequences elsewhere, though). Caution, in practice some (mostly older) browser versions don't behave in the way that is intended by the standard.
As was remarked above, if any codes from the upper half of the code table are placed onto the network, the standard requires that they be expressed in the ISO-8859-1 code. If, therefore, we have a document that does contain such characters, on a platform whose native storage code is different from ISO-8859-1, then the platform's Web server will have to map (=translate) these characters into ISO-8859-1 in order to place the document onto the network using the HTTP protocol. Let me stress, though, that the server is certainly not expected to look inside the HTML document for number; representations and make any change to those: the standard requires those to be composed in terms of ISO-8859-1 code values irrespective of the character code that is being used for storing the HTML document.
recode
utility can perform conversions before or after a file is transferred between dissimilar platforms. I have seen several different DOS2UNIX/UNIX2DOS
utilities, some of which merely adjust the newline convention whereas others also map between ISO-8859-1 on the unix side and CP850 (perhaps) on the DOS side: if you plan to use such a utility, make sure that your version does the right thing for you.
MIME-Mail protocol does include facilities for announcing the specific encoding in use - but typical implementations of MIME mail agent (e.g PINE) do not necessarily have any facilities for resolving such discrepancies, they merely alert the user to the fact that the incoming file uses an encoding that is different from the local one.
It is essential to bear in mind that, in addition to the range (decimal 0-31 and 127) that ASCII allocates to control characters, ISO-8859-1 does not assign displayable characters to code points in the range (decimal)128-159. Some platforms (e.g MS Windows) that are otherwise ISO-8859-1 conformant, might use these codes to represent additional character encodings, but they cannot and should not be relied on for communicating information on the World Wide Web - they could display as anything, or nothing, on other platforms or browsers.
The IBM PC code called "Multilingual Latin-1", CP850, has already been mentioned. This code also includes the ISO Latin-1 repertoire of glyphs (as well as some additional glyphs that are not in the ISO Latin-1 repertoire). But the characters are not in the same places in the two codes. and, furthermore, CP850 assigns characters throughout the upper half of the code table whereas ISO-8859-1 keeps thirty-two characters undefined. It follows, therefore, that anything represented in ISO-8859-1 can be translated into CP850, but there are characters in CP850 that do not correspond to any defined character in ISO-8859-1. It is conventional to translate these characters into the "undefined" range, decimal 128-159 of ISO-8859-1, on the understanding that their meaning is undefined as far as the standard is concerned. An attempt to display such characters could result in the equipment displaying anything, or nothing, without being in violation of the standard (in principle it could even result in executing some spurious control function, although I'm not aware of this happening in practice to any extent.)
The Macintosh uses a code that mostly covers the ISO Latin-1 repertoire, although some glyphs are missing, and includes some other glyphs. Again the assignment of glyphs to code points is not the same as in ISO-8859-1, and a code conversion is required.
Previously, the EBCDIC code used by IBM mainframes was also an issue, and mention of this will be found in the materials referenced, but with the move away from mainframes this will concern us normal mortals less and less.
There is a paper by Andr? Pirard of Univ. of Li?ge in Belgium, referred to in a usenet posting which I quote comprehensively below. Let me stress that the recommendations for translating characters that fall outside the ISO Latin-1 repertoire are not part of any formal standard: they are, however, a de facto "standard" that is followed by a lot of fine Internet software for the Mac, such as Fetch, some usenet newsreaders, most browsers, etc. He must translate the codes in the "undefined" region to preserve the integrity of the file (see his text for futher explanation) but you are not entitled to use these characters in your Web documents - if you do, then users on different platforms are going to see different glyphs, or none.
Especially I want to warn readers not to take seriously the misguided efforts of some web authors who have manufactured a document containing all possible character codes, without comment, and have invited readers to display them on their own browsers. Without a text describing what should be displayed at each code point, the confused reader might assume that what they are seeing is the same as would be displayed at the corresponding point on any other browser. It is not, and they are being misled. Only the code points that are assigned to displayable characters in the ISO-8859-1 code are expected to have this property (and even there, some browsers are in violation of the HTML2.0 standard, so without a text description alongside every relevant code point, such a table is worse than useless, no matter how well-intentioned its author might have been). An alternative way of presenting a character code unambiguously is to present it as an image; the only problem with that approach is to make sure that people do not confuse similar-looking glyphs with each other, e.g mistaking an apostrophe for an acute accent, a German sharp-s for a Greek beta, or a degree sign for a superscript zero. Presenting both an image and a description would be the ideal.
In this section I have tried to cover concisely the principles that are involved in dealing with such non-ISO-8859-1 platforms, and have given brief notes on how they work out in practice. The Mac is of sufficient importance to justify a separate article.
The HTML2.0 specification (RFC1866) contains at the end (section 14) a section entitled "Proposed Entities": this list includes the already well-known ISO Latin-1 accented letters etc., but also introduces a proposal for additional entity names. These were certainly not in general use as of HTML2.0, and so, presumably, were intended for future implementation, and some browser developers have indeed progressively added support for them, while others seem to have made progress at a snail's pace if at all.
I am assured that the policy of the HTML developers is to use the SGML entity names, as laid down in the files ISO*
in, for example, the directory ftp://ftp.ucc.ie/pub/sgml/ (Dan Connolly gave me an alternative pointer to a server in Norway - but the information content should be the same!). The names that are relevant to this discussion are contained in ISOnum
and ISOdia
(you will see that those entity sets define also many characters that are not included in the ISO-8859-1 code, and that therefore aren't properly usable in HTML according to current standards). HTML provides no mechanism for using floating accents, so that the glyphs for umlaut/diaeresis, cedilla, and macron are of rather little benefit: however, they do have entity names, and for completeness they will be kept in the discussion.
In the archives at the W3C may be found a draft called HTML+, that pre-dated the now-expired HTML3.0 draft. Both drafts are now considered obsolete (although they make very interesting reading!), but a few browsers support some of the entity names that are peculiar to HTML+, so it is still mentioned here. It's worth noting that the text of the (uncompleted and now expired) HTML3.0 draft used the entity names "endash" and "emdash", but the associated DTD contained "ndash" and "mdash" - presumably this discrepancy would have been resolved if the draft had ever been finished.
In this section, I am only discussing the names for characters of the ISO-8859-1 code definition. The Trade Mark (TM) glyph is not defined in this code, and I discuss it separately in another section of this briefing.
A further version of the HTML entity names list can be found in Martin Ramsch's table at http://www.ramsch.org/martin/uni/fmi-hp/iso8859-1.html. However, this folds in some material relating to Hyper-G Text Format, which is not the same as HTML. The characters over which there seems to be disagreement are the following ones (see RFC1866 section 14).
HTML+ RFC1866 ISOnum/ISOdia Description ----- --------- ------------- ----------- die uml die, uml diaeresis/umlaut macron macr macr macron, overbar degree deg deg degree Cedilla cedil cedil cedillaRamsch's table agrees with the ISO list, except for designating the macron as
&hibar;
, a name that I haven't seen elsewhere but was, I found, supported by X Mosaic, and by adding the alternative names brkbar
for brvbar
and Dstrok
for ETH
. I propose to ignore these last two as being Hyper-G specials, but in view of having found support for hibar
I have retained it in my survey.
Test cases for these entity names can be found in the preface to my character code test tables so that they could be included in my tests for browser coverage.
quot
"
entity in its DTD, in spite of the general intention that HTML3.2 would be compatible with HTML2.0.
On investigation, attention was drawn to an item in the www-html mail archive, in which Christopher R. Maden states that in the relevant SGML documents, the quot
entity is identified with the apostrophe (ASCII 39, x27), not with the quotation mark (ASCII 34, x22). Dan Connolly, on the other hand, states that the omission from HTML3.2 was a mistake.
I have no further details of the progress of discussions subsequent to that time, but I note that the "
has re-appeared in HTML4.0 just like it was in HTML2.0, as can be seen in the relevant section of the HTML4.0 recommendation.
It's no big deal to generate a table containing all possible code values, or all possible values of n;
, and to display them on various browsers and platforms, and to compare the results. Don't just compare several different browsers on the same platform: that is no better than buying several different English newspapers in order to get a better idea of the news in Texas! Regular X-based fonts will display a blank, or nothing at all, in response to the unassigned codes. MS-Windows based browsers will normally display some well-defined glyphs in the relevant positions; Mac-based browsers will likely display something too, which might be different. However, Mac-based users should take care to compare the results from later versions of Netscape (2, 3 etc.) with earlier Netscapes such as Mac 1.12: it is evident that Netscape are redefining the rules as they go along.
The only advice that I can possibly give you is to steer very clear of these undefined code points. The whole intention of HTML was to represent content in a portable, platform-independent fashion. There is no way that you can guarantee to get the correct results on the reader's screen. That's what standards are for: the evidence of Netscape stealthily redefining the rules (I saw no mention in their release notes that they were shifting the undefined Mac character codes around) just makes it that much more important to stick to the standards, and not get tempted by non-standard features of a commercial browser, if you want to get a message reliably out to WWW readers. If you cannot be confident of browser coverage of the construct that you'd like to use (e.g ™
for the trademark glyph) then you would be better advised to use a substitute that has good browser coverage.
When un-zip-ed, this file produces some interesting information and some software, including a file ISOLATIN.CPI that can be used to support CP819 for output in MS-DOS (assuming you have EGA or VGA; the CGA display is stated to support only its own hardware code page). There is also some mention of keyboard support. Disclaimer: the above refers entirely to material that I found on the net. The nearest thing to an authoritative source is the file Doc\Isocp.txt contained in the above-cited ZIP archive (however, Mr. Kostis tells me in email that his 1993 address seen in that file is no longer valid).
I have not tried using DOS for any significant period with CP819 selected, so I have no personal experience of how it works out: in response to earlier versions of this page I received several emails that comment favourably on this method of working in DOS, and in Jan 1997 I got a more detailed email from Portugal about successful use of this method.
I don't believe there are any subtle differences from iso-8859-1: the IANA character set registrations list CP819 along with IBM819 as a synonym of iso-8859-1, so this seems to be just another codification of the same character coding.
For its internal text encoding, MS Windows works in terms of its own character code: this code is identical with ISO-8859-1 at the code points that are assigned to displayable characters by ISO-8859-1, but in addition it assigns displayable characters to some of the code points that ISO-8859-1 explicitly leaves undefined.
The chief cause of confusion, I guess, is the notorious DOS convention for typing ALT/nnn to get the character whose code position is (decimal) nnn. In MS Windows, this convention has been extended (although I didn't find this explained anywhere in the normal user manuals or help information):
You can easily verify that MS Windows code is incompatible with MS DOS code in this respect, if you type in some accented characters using an MS Windows application, say Notepad, and then view the resulting file in MS DOS; or conversely if you type them in using a DOS application such as (DOS) EDIT and then view the result in MS Windows.
Let us take one example: o-circumflex. In DOS CP 437 or 850, the code point for o-circumflex is 147 (decimal). So, in DOS you type this in as ALT/147, and that is what is stored. However, when you use ALT/147 in an MS Windows application, the MS Windows (ISO-8859-1) encoding of o-circumflex is actually stored into the file, i.e the character code 244 decimal. Basically this is very simple in principle, but can lead to much confusion in practice, and you can play around a little if you want, typing ALT/nnn codes into a file one way and viewing the file the other way, and driving yourself demented trying to follow what is happening by use of a CP850 (or 437) DOS code table on the one hand, and an ISO-8859-1 (or MS Windows) code table on the other. Have fun!
Anyhow, the long and short of this is that you can perfectly well deal with 8-bit accented letters in MS Windows if you wish (subject, of course, to the caveats mentioned elsewhere in this briefing), as long as you only handle the file with MS Windows, and not mix it with DOS.
The above would no longer be a problem if everyone operated DOS in the code page 819 discussed above. Whether that would be practical, I can't say - see the discussion there.
n
for which n;
displays the TM glyph on that informant's browser: but readers who have taken on board my explanation so far will realise that this is of no use, since it will display something different, or nothing at all, on some other browsers. The value of n
is (or was - see discussion above about Netscape quietly re-arranging their Mac version) different according to whether the informant uses a Mac-based or an MS Windows-based browser, and X-based browsers do not display this glyph at all in their normal fonts. Surely, for a glyph like this that is used principally for legal reasons, there can be no excuse for sloppy usage that has no guarantee of displaying anything to some readers.
There are a number of kludges that you might consider using at this time. Bearing in mind that by now a good range of browsers already honour the SUP tags, you might even enclose "(TM)" in those, like this: (TM)
, and here is what your current browser does with that: (TM). A browser that did not understand the SUP tags would simply display the "(TM)" in its current font. Nest the (TM) also in ...
if you wish.
The recommended entity name ™
was supported by certain browsers for some time already, but others still lack that support. Here is what your present browser displays in response to this entity: ?. The standards-compliant way to code this as a numerical character reference is by its Unicode value, 8482, and here is what your current browser does in response to that: ?. I cannot recommend that you use either of those representations at this time, especially for a mark like this that has legal significance: MSIE 3 supported only ™
and not ™
, while NS 3 supported only ™
and not ™
. Update Jan'98: both MSIE4 and NS4 support ™
, although NS4 still does not support ™
.
It is often claimed on usenet that one or more no-break space(s) can be used as a space filler, and indeed this can be seen in many HTML documents on the WWW. Some advocates suggest
- others
, while others advocate alternating one or other of those with ordinary spaces; it is indeed an observed fact that many browsers in use in 1996-7 were producing that effect, and this became pretty much universal. (The HTML4.0 specification subsequently codified the no-break space as not being a white space character, from which we may deduce that it would not be eligible for compression under the white space rules; but it explicitly does not go into any further detail about its treatment.)
In constructions like the much-touted ...foo
, (or the corresponding thing with
), the no-break space is not joining two words together as envisaged by the spec.
Some authors say that they demand two spaces between sentences (e.g after a full stop - US "period" - question mark or exclamation point), because their style rules demand it. They therefore desperately seek some kludge for inserting such an additional space, even in some cases going so far as to imbed a small blank image. It would be my own personal contention that this is a browser issue, since any imposition of style rules ought to be according to the reader's locale, not the author's. Admittedly, there is no unambiguous definition of a sentence end in HTML: there are situations where a full stop, question mark or exclamation point, followed by a single space, might appear within a sentence, and an expansion to two spaces would not be desired by the author or the reader. I leave this point for others to debate.
To those authors who desire the first line of every paragraph to be indented, on the other hand, I say very definitely that this is a browser (or style sheet) issue. The start of a new paragraph is clearly defined in HTML, and the question of how to present a paragraph is a browser issue, purely and simply. Readers might also wish to have such presentation details under their control, rather than being imposed by the author. Those authors who struggle to indent each new paragraph by means of
tricks and/or blank images are, in my view, quite misguided. And a reader who cares enough to have configured their personal style sheet to indent paragraphs, are not likely to be amused when they find they get a double-dose of indenting thanks to the author's kludges with transparent GIFs or
.
NOTE -- the SOFT HYPHEN character (U+00AD) needs special attention from user-agent implementers. It is present in many character sets (including the whole ISO 8859 series and, of course, ISO 10646), and can always be included by means of the reference . Its semantics are different from the plain HYPHEN: it indicates a point in a word where a line break is allowed. If the line is indeed broken there, a hyphen must be displayed at the end of the first line. If not, the character is not dispalyed at all. In operations like searching and sorting, it must always be ignored.(reproduced from the RFC complete with typo ;-)
The HTML4.0 specification says something similar, except that it implies that browsers are not mandated to support this character, which is a pity as there seems to be no viable alternative for achieving this useful result.
The meaning of this when the soft-hyphen occurs inside of a word is unambiguous; but hardly any browsers actually implement this yet (an honourable exception: Lynx). What to do with a soft-hyphen that occurs in isolation is unclear: a browser might still display a hyphen in this situation - or maybe only display it when it comes at the end of a line (recent versions of Lynx seem to be like that). Best to ignore what is displayed on this line of my test tables, since there is no formal specification of what a soft-hyphen should do in that kind of context.
(There used to be a usenet FAQ on ISO-8859-1, and a fascinating array of resources on character codes and internationalisation by its author; these are still referenced from other usenet FAQs, but sadly the document itself had disappeared by Dec.1999.)
The Kermit team has done much work on documenting the usage of character codes: a visit to their Web Pages is well worth while, and materials can be found in their archive at Columbia. The "texts about ISO-8859-1" by A.Pirard (following quoted article) set out the design criteria well, and should be read for a clearer understanding of the issues involved.
(End of quote.)Q: Is there a standard site where one can find the latest versions of these tables?
A:
There was no site where one can find the latest versions of the Macintosh<->ISO 8859-1 translation tables. But now, after your mail, I put the tables on an ftp server as a Macintosh .sea archive.BTW, you can find there the Andr? Pirard's texts about ISO 8859-1, other codes and several computer "languages". These texts about communication programming for international characters, can be found, too, in an ftp server at Columbia, USA or an ftp server at Univ of Liege in Belgium.
The software using these tables is not all using taBL resources. I.e. FTPd and Talk have the taBL, but not Anarchie. With resources, the developer can allow the user/manager to put an other one. Without resource, the translation is "hard-coded".
I prefer taBL resources. For software as Eudora and Telnet, it's a must because people may need several translations according to the environment, i.e. for Telnet, the code of the connected computer. For software able to transfer files/texts only, the choice is very small and Macintosh<->ISO 8859-1 is the best standard to use, IMHO.
Jean-Pierre
The code mapping that's documented in A.Pirard's materials has become the de-facto standard in Mac-based Internet software, such as Fetch, usenet newsreaders, most WWW browsers, etc.
The Kermit team had also documented the Mac problem, and indicated a solution based on the same principles, but they designed a different code mapping that has not, in the event, gained general acceptance.
An email from recommends the GNU recode program for a very versatile range of character conversion options, including 8-bit Latin-1 to HTML entity encoding, as well as converting between different character codes. I have to confess I had not been previously aware of this program, but having looked at the manual for it, I have no hesitation in accepting his recommendation. Beware, though, that this program can, by default, replace your input file in-place, and with some combinations of options the change would be irreversible!.
From the point of view of HTML (indeed of SGML), every document has a "document character set", which in the case of HTML2.0 happens to be the ISO-8859-1 (8-bit) part of the much bigger ISO-10646 code. Furthermore, when the document is transmitted over the network, it is transmitted by using an encoding which, in HTTP/1.0 etc, is ISO-8859-1. The result is that we tend to confuse the "document character set" with the "transmission encoding". However, it's not too difficult to see that these are not the same thing. Let's consider an HTML document that's stored on an EBCDIC-based mainframe, in order to make the illustration as obvious as possible. In a document that contains the following HTML: ©
(i.e the numbered character reference representing the copyright sign), the individual characters: ampersand, hash, one, six, nine, semicolon, each have to be translated into EBCDIC; however, the number (169 decimal) remains as 169 decimal, referring to the copyright-sign code point of the ISO-8859-1 document code: it does not get changed into a different number equal to whichever code point this sign occupies in EBCDIC. So, we now have a document whose "document character set" is still ISO-8859-1 but whose "storge encoding" is EBCDIC. Similar situations arise when HTML is stored on a platform whose native code is the Mac code, or say CP850 (DOS).
When we move to a situation involving more than one transmission encoding, the issue becomes more complex. The same ISO-10646 data can be transmitted in several different transmission encodings (UCS-2, UCS-4, UTF-8). Then we have to understand how this relates to the other encodings that exist today (CP850, EBCDIC, KOI8-R etc.).
Let me offer you a pointer to the relevant area at W3C, with particular reference to the sub-heading of Character Sets.
I think it is fair to add that a browser, such as Netscape, that offers the user the ability to configure it to a default character code other than ISO-8859-1 would then no longer be compliant with the HTML2.0 standard, since it would then no longer display a default (i.e iso-8859-1) document correctly. Such changes ought to be under control of the author/server and not under control of the user configuration (although, obviously, they cannot work properly unless the user has taken care to make the necessary resources available to the browser - browsers typically don't come with all of this set up as standard).
For the Latin-2 (Central European) situation there's an interesting resource by P.Peterlin.
If you want to find out about Unicode, try at http://www.unicode.org/. A search at one of the WWW search engines seemed quite productive, returning among other things some interesting papers and discussions from the WWW Working Groups.
Original materials ? Copyright 1994 - 2006 A.J.Flavell