More about jargon

It's very difficult to talk about character codes, fonts, printed characters etc. without having a consistent way of naming the concepts. Unfortunately, the terminology of character codes has been used in various inconsistent ways in different fields (typography, computer design, written-language etc.) and at different times, and it is practically impossible to fix on one consistent set of jargon and to stick to it throughout.

When we see a printed character, e.g a capital A, in a document that is in English, or any other language that uses the so-called "Latin" alphabet (itself a misnomer of course, since it contains letters that were not used in the Latin language), we recognize it as being an upper-case A irrespective of the fact that different fonts look subtly different. It would not usually occur to us that it might be the letter A from the Cyrillic or Greek alphabets, even though those alphabets have a letter that plays a similar role and looks similar. If we then consider the Latin character H, here we have a letter where similarly-shaped characters exist in both the Cyrillic and in the Greek alphabets, but playing quite different roles in each. Viewed in isolation, without any knowledge of the alphabet they came from, these printed letters could be mistaken for each other, but it seems inappropriate to consider them as equivalent to each other, and our usage will preserve the distinction between them. And the same goes for the three "A"s, irrespective of claims about their underlying equivalence.

One might go further and claim some "underlying meaning" in which the A is "the same" in all three alphabets, whereas the "H"s are different. However, this line of reasoning is not particularly helpful in the present context, and we will not pursue it.

Similar dilemmas occur with punctuation and diacritics. Just by looking at them in isolation, a matched pair of quotation marks could be mistaken for acute and grave accents, and a caret could be mistaken for a circumflex accent. However, they would be used in quite different ways.

Should we encode characters solely according to their shape, irrespective of their meaning? It's pretty clear that it would be a mistake, in general. The ISO-8859-n series of character codes sets out to assign encodings for what were considered to be the important characters used in each language domain, and assigns codes for a character such as "upper case latin A" or "asterisk" or "oblique stroke", irrespective of the details of how these characters will be displayed in each font.

Nevertheless, there is some ambiguity in the ISO-8859-1 codes. For example, in the ISO-8859-1 code there is a code point assigned to "dieresis or umlaut". But although identical in appearance, the dieresis ("diaeresis" in British usage) plays a very different language role than the umlaut mark (and all of the umlauted vowels used in German have their own code points already, so there is no need for a free-standing umlaut mark). To take another example, the code contains an "ae ligature", but does not contain an "oe ligature" (nor indeed any other ligatures such as fl or ffl). The explanation that is offered for this is that in some Latin-1 languages, the "ae" is considered a letter in its own right (the letter known variously as aesc, aesh, ash, in Anglo-Saxon), rather than a ligature, and was admitted to the code for that reason, whereas mere typographic ligatures were excluded: but why then is the official name of this character "ae ligature", rather than something like "letter aesc"?

Anyhow, to pull the threads of this rambling together, the terminology which I use in the main part of this briefing is (or is meant to be) as follows. I use the term "character" to refer to the abstract concept that lies behind, say, the Latin upper case A (capital A); and I use the term "glyph" to refer to an actual visible embodiment of that concept in matter that has been written, printed or displayed, e.g any Latin upper case A, irrespective of its font or style. A "repertoire" consists of a collection of characters which are to be represented, but without reference to their particular representation. A "character code" refers, strictly speaking, to a complete system of representing these characters in a computer, consisting of an appropriate number of "coded characters". In practice, the term "character code" also gets used to refer to a "coded character", for example one might say that "the character code for Latin upper case A is decimal 65" ('decimal 65 is the coded character "Latin upper case A", in other words').

I stress that this is just my chosen usage in this document, and you may (probably will) find different usages elsewhere. In particular, when dealing with Unicode, a subtly different terminology is used. Unfortunately.

Digression on copyright statements

This item was based in a useful discussion with Tom Neff in late Aug'95. First let it be said that I am no lawyer, and rarely need to worry about copyright issues from a legal point of view. I started the discussion solely from the point of view of HTML standards and the technicalities of ISO character codes. In view of the widespread misconceptions about copyright, it's obligatory that I call your attention to the Copyright Myths FAQ on news.announce.newusers at least. There used to be a full Copyright FAQ in the FAQ archive for misc.legal but I haven't seen it there recently.

If the author wants to take advantage of the additional protection that is conferred by a copyright notice, it is essential to present it to the user in a legally-valid form. Anything else would just be useless decoration. One or more of the word "Copyright", the abbreviation "Copr.", or the C-in-a-circle sign must appear, in addition to the date and the author's name (etc.). Although a properly HTML2.0-compliant browser/platform combination must display the C-in-a-circle correctly, there are, as Tom says, many ways in which a browser might fail to display it properly, due to inappropriate configuration or whatever, and the lawyers could argue that any decent HTML author would have known that; an author who tried to rely solely on the C-in-a-circle character could be held not to have made the proper effort to get a legally valid copyright statement onto the screen. In view of the potentially serious consequences of getting this wrong, I can only pass on this advice. The following form would seem to be appropriate:

Copyright 1995 A.N.Author

The addition of one of the HTML constructions for C-in-a-circle would be a harmless decoration, provided that there was no risk of it disrupting the copyright statement on broken browsers. In view of the only partial coverage for © at that time (1995) I reluctantly was unable to recommend it, and I therefore recommended the numeric reference, which is 169. The following form would seem acceptable:

© Copyright 1995 A.N.Author

However, if this is an important issue to you, I seriously recommend that you take legal advice that is appropriate to your own jurisdiction.

Digression on naming the "paragraph sign"

(Minor updates in 2002; most of this material is from around 1995 however)

As far as I was able to make out, a shift in naming has occurred with time, at least in the computing field (I make no claims for usage in the field of typography, with which I am unfamiliar). However, subsequent discussions reveal also a locale dependence.

There is no doubt about which code point of ISO8859-1 denotes which actual glyph, but there has been some disagreement about what each of the two characters are called: (hex)A7 the intertwined-SS character §, and (hex)B6 the pilcrow character ¶. In earlier computer usage, e.g from the late 1960s, I have clear evidence that the linked-SS sign was known in computer circles as "paragraph sign". I find a more recent document that makes reference to ISO6937 (which, I must admit, I have not consulted myself) and calls these signs "paragraph sign" and "pilchrow sign" respectively, so it is clear that this terminology persisted for a long time (although the spelling "pilcrow" seems more common than the one with the "h").

Moving to a more recent table that was reviewing the ISO8859-1 code, they are there described as respectively "section/paragraph symbol" and "paragraph symbol USA".

And in the Unicode Names List we also find mention of locale differences:

        * paragraph sign in some European usage

        * section sign in some European usage

However, it is clear that the current standard usage is now to call the linked-SS sign § the "section sign", and the pilcrow sign ¶ the "paragraph sign". Indeed, in discussions, some participants have maintained energetically that it was always so, and that my references are defective. But in my submission, the Unicode evidence alone is sufficient to reveal that it's not without foundation. See also a presentation by J.Korpela, which seemed reasonable enough to me, but has however been criticised by at least one commentator as being too "parochial".

Digression on National 7-bit codes

It isn't my intention to write a historical thesis on this, but for those youngsters who are puzzled about the rules that told you to avoid using "National Characters" such as "tilde" in URLs, you should review what the term "National Characters" means, and compare the various National 7-bit codes with each other.

Details don't seem to be too readily available on the WWW, (probably because ISO would prefer to make some money by selling you a paper copy), but I tracked down an interesting collection in Japan: the collection later moved to a new server.

I subsequently stumbled on INTERNATIONAL STANDARDIZATION OF 7-BIT CODES, a chapter of a report at Terena.

The Kermit site also has a few relevant references.

Keyboards: Some national keyboards do not have a tilde, etc., on them, so inexperienced users who are asked to type one in might not know what to do.

Teletext: this system, in widespread use on European TVs, uses what are in effect "national ISO-646 7-bit codes". Here some technical details. For sample pages see e.g NOS Teletekst. Interesting sideline: CERN even has a private teletext system, to distribute particle accelerator status information across their site.

Not surprisingly, teletext pages often want to mention URLs nowadays, and this is a problem with character codes whose repertoire does not include a tilde!

As far as URLs are concerned, it's probably still a good idea to avoid "national" characters - the most notorious example being that tilde "~", which is more correctly represented in URLs as %7E. It's true that in many situations, especially when dealing with experienced web practitioners, no harm will come of presenting a URL with an actual tilde in it. But where these URLs will be presented to the general public, e.g on teletext, or printed in magazines, newspapers etc. you are taking a distinct risk if you attempt to include a tilde. It's probably best if you can re-jig the server configuration so that it doesn't require the use of the tilde at all, e.g by converting URLs of the form http://www.some.dom/~thatuser/ into http://www.some.dom/users/thatuser/ or similar.

Some details about using CP819

In Jan 1997 I got an email from Joao Magalhaes in Portugal, which I'm summarising here.

The IsoCp101 files work: not just in viewing, but also with a Portuguese keyboard.

The installation is actually quite simple: just jot isolatin.cpi and isokb850.com somewhere into the path, say, c:\dos.

In config.sys one might have e.g

country= 351,850,c:\dos\country.sys
devicehigh =c:\dos\display.sys con=(ega,850,2)

The real work happens in autoexec.bat. After any code page and keyboard configuration (this one mandatory) just force DOS to accept code page 819 and the keyboard driver:

lh nlsfunc            <- actually it may no longer work
REM    setting up the monitor
mode con codepage prepare=((850) c:\dos\ega.cpi)
mode con codepage prepare=((819) c:\dos\isolatin.cpi)
mode con codepage select=819
REM    setting up the keyboard
lh keyb po,850,c:\dos\keyboard.sys
lh c:\dos\isokb850    <- install the keyboard driver
c:\dos\isokb850 /A    <- activate
chcp 819              <- actually it seems not to work any more
mode con cp /status   <- just to take a look
Problems: when deactivating the keyboard driver and the 819 code page, the console seems not to return to code page 850 but to code page 437 (default).

In Windows 3.11: When running DOS in a Windows the characters will be wrongly displayed, but will appear correctly if you use full screen. The problem might(?) be solved by editing (or setting up again) the Keyboard and Boot.Description in System.ini.