Using FONT FACE to extend repertoire?

The suggestion to use FONT FACE with customized fonts as a means to extend the character repertoire is an idea that keeps popping up. And indeed this trick gives a visual impression of working on a proportion of browser versions that are still in use. But this is not HTML's way of extending the character repertoire.

And the same applies ("with knobs on") when CSS is used to propose the use of such a non-standard font via the font-family property.

HTML character representation

The representation of characters in HTML is set out in the HTML 4 specification, in the chapter HTML Document Representation. This specifies that each character - whether it be a coded character, a named entity &entityname;, or a numerical character reference &#number; - represents a specific character as laid down in an international standard, namely Unicode. The relationship between the various character codings used for document transmission, and the underlying Unicode representation on which HTML4 is based, is tabulated in mappings that are stored at the Unicode site.

Thus, for example, the character "W" in an HTML file always represents the "Latin" upper-case letter "W". Either the browser displays this character correctly, or it does not, but the fundamental meaning of the character can never be changed merely by enclosing it in some markup - even less by applying some CSS styling property. I make this as a plain assertion, but I think you'd have to come to the same conclusion from reading the published specifications.

The Problem

One often sees it advocated to produce unusual characters by enclosing some "normal" character in a FONT FACE specification. And indeed if, for example, the letter "W" is enclosed in FONT FACE="Symbol", then a proportion of browsers will in fact display the Greek letter Omega. Sample: 'W' (if your browser displayed a "W" at that point, then it was working correctly, in the sense of RFC2070 and HTML4; if it displayed an Omega, then it was not in conformance with interworking specifications). Yet a different character might be displayed if the font chosen was, say, WebDings: 'W' or Monotype Sorts: 'W' - if the reader has the font, and if the browser consents to this foolery.

What is happening here? Well, this can be described in two ways. The supporters of this technique claim that this is valid HTML (they point out that "it passes syntax validation, after all"), and the fact that they see the character that they desired, proves that it's working (so they claim).

But it's not a matter of syntax, it's a matter of semantics. According to the HTML specification, we have here a Latin letter "W", not a Greek letter "Omega". Some browsers display a Greek letter "Omega", the reason being that the font happens to be populated at that position by a different sign (the "wrong" one, in HTML terms). Thus, in terms of the HTML specification we can conclude that the browser is misbehaving, and compensating for the author's error by displaying what they misguidedly wanted, rather than what the specification says they actually asked for.

However, some browser/versions will handle this situation differently. Asked to display the character "W" in font "Symbol", a browser may detect that the Symbol font does not contain a Latin letter "W", so it uses some visually compatible font that does contain the requested letter "W". And then of course there are browsers that, for their various reasons, do not implement the FONT markup: examples being character cell browsers, speaking machines etc.: not forgetting, of course, those indexing robots!

Unfortunately, the upper-case W is the international symbol for watt, the unit of power, which would cause hopeless confusion if the author was attempting to produce something looking like the ohm sign, the unit of resistance. (By the way, in Unicode the ohm sign is the character U+2126, Ω: example, Ω, but this is a compatibility character, useful for round-trip code integrity. The Unicode standard, in Chap.14, see section 14.2, recommends using for this purpose the Greek upper-case Omega, U+03A9, Ω: example, Ω. The glyphs may well be indistinguishable, depending on the font.)

In various places on the WWW, you'll find charts that have been intended to document the results that their author achieved with FONT FACE="Symbol". These charts may (or may not) be accompanied by various kinds of explanation attempting to show how this technique "works" on the WWW, and explaining (or not) the range of behaviours that readers may experience if this trick is used, and warning (or not) against using the trick in one's own authoring. I can only caution against taking these charts at their "face"(!) value. To repeat what was said at the start: this trick gives a visual impression of working on some proportion of the browser versions that are in use, but this is not HTML's way of extending the character repertoire.

Does this matter?

According to the HTML specification, client agents are not expected to interest themselves in the details of how a particular font is populated. The HTML3.2 FONT tag was introduced for getting cosmetically-different renderings of characters, and the same is true of the CSS font-family property. They are not intended to be used for extending the character repertoire of HTML, because HTML has a proper mechanism for that. By no stretch of the imagination could a Greek "Omega" be considered a cosmetic variation of the Latin "W", and so forth.

And so, character-mode browsers that make no use of font variations, also indexing robots, speaking browsers and so forth, can be expected to interpret these characters according to the HTML specification, and not according to the visual appearance that is found on certain browsers (no matter how popular those browsers may be).

On some browsers, the "Samples" below will appear to be Greek: but if you copy/paste the phrase out of the browser into some other application, you will not be surprised to see the "Latin" script.

And indeed, if you paste the phrase into the search term of various search services, you may find this page.

The direct binding between a character encoding and a font layout, as found in early browsers (and on which this trick essentially depends) was a dead-end, which RFC2070 aimed to resolve. Modern HTML4 browsers have been supporting this i18n model for some years now. Even some authors who continue to document the "Symbol" font trick will admit that it cannot be relied on to work for all readers.

Some HTML authors have claimed that the use of "Symbol"-type fonts on web pages used to be a respectable technique; some of them indeed mention the Symbol-type font called "Webdings" from MS as support for this view. They then grumble that the pedants in the Mozilla development team are trying to take away from them something that was previously "working" (as they supposed). I reckon that they are mistaken: the character model of HTML was rather clearly spelled out in the HTML/2.0 specification, and clear signposts set for future developments. The implications may not have been evident to many readers at the time, but they were set out with considerable clarity in RFC2070, after which there really can be little excuse for any serious misunderstanding. The technique which they supposed to be "working" never was really working in the specification-conforming sense: it only gave a visual impression of doing what the (misguided) author intended. In many browsers, it did that rather too well, to the point of convincing them that what they were doing was not wrong, but, by the W3C specifications, they were mistaken, and had been mistaken from the outset. Mozilla is now rectifying that mistake: what it's "taking away" is something which, according to W3C specifications, they never really had. And, it seems, Opera behaves similarly (I reviewed version 7.01 as well as 8.5) in displaying the HTML-specified character, not the one faked by the symbol-type font.

As for the "Webdings" font which was cited in the above argument, it might be perfectly usable as a symbol font in a number of contexts (usually proprietary) where symbol fonts are used, for example MS Word. But an inspection of its font data in Unicode terms shows that its glyphs are assigned to the Private Use Area code points U+F020 to U+F0FD inclusive: if this font were ever to be used from HTML, this would be the proper way to refer to those glyphs (but the use of Private Use Area characters in an uncontrolled WWW situation is not very reliable, and I would advise against its use). Such a font has no business to be used in W3C-compatible web pages that are published to the WWW when applied to "normal" characters i.e in the Unicode range U+0020 to U+00FD - Warning: some MS applications, such as MS Word, which purport to export to HTML format, are found to do precisely that.

See also the Mozilla FAQs on downloadable fonts and symbol/dingbats for further discussion. As the FAQ says: "characters in HTML 4 and XML documents are Unicode characters — not font glyph indexes". Which is another way of saying what I said in the preceding paragraphs.

The use of the CSS font-family counterpart of the HTML/3.2 FONT FACE trick is even less defensible, since CSS should be considered as an optional rendering proposal, not as an inherent property of the content.

In detail...

When the relevant custom installable fonts are downloaded from sites where this technique is promoted, they fall, in detail, into three sub-types; however, the principles in all three are broadly the same, and can be categorised as "symbol font" techniques in a broad sense, even though only one of the three admits it in so many words.

Explictly marked as Symbol fonts. This is, at least, honest, even though I categorise it as misguided in the HTML context. Mozilla-family browsers, which implement HTML with commendable attention to the specifications, refuse to go along with this trick in their "Standards" mode, although there are limited concessions in "Quirks" mode.
Explicitly marked as Latin-1 fonts. This is outright dishonest, in any context. Clearly, no browser can defend itself against such fraud, but that should not be interpreted as any excuse to use it.
Fonts for which the "Supported Unicode Ranges" and "Supported Code Pages" font data (as reported by the MS Font Properties Extension) are missing. (These could loosely be considered under a heading of "user-defined character encodings", but such usage is generally inappropriate in HTML.) Current tests with Mozilla suggest that it does not block attempts to use such fonts, not even in its "Standards" mode, although this could be rated as a bug and corrected at some future point.

It has to be said that in an HTML context, all of these approaches are bogus.

There might have been some excuse for using this trickery some 10 or more years ago, despite it being contrary to the published HTML interworking specifications, as web page authors could not at that time rely on an adequate level of Unicode support in browsers. But nowadays (2005) such browser versions must be considered long-since obsolete, and there is frankly no excuse for using this misbegotten trick in HTML pages now - and even less excuse for promoting the technique to naive web authors, as some web sites continue to do.

Samples

FONT FACE

Fontasmagorical Phantasie

STYLE font-family

Fontasmagorical Phantasie

Both

Fontasmagorical Phantasie

OK, so how do we use symbols in HTML?

Well, this ought to be clear from the HTML character model (as embodied in RFC2070 and HTML4). To get a desired symbol rendered, one specifies its Unicode code point (as tabulated in the Unicode specification), either as a coded character (in which case you probably want to use utf-8 encoding), or in terms of &-notation (hexadecimal for convenience, or decimal for slightly better compatibility with some older browsers).

Some information for this purpose was (1999) made available in cross-mapping tables at the Unicode.org site:

but see the following discussion of Private Use Area codes.

Where these symbols use standard Unicode code points, there is no need to specify any particular font face (via HTML3.2 font face=, or via CSS font-family specifications) for correct results. Indeed there are advantages, in terms of cross-platform compatibility (as I said before) in not trying to constrain the client agent's choice of fonts.

However, some of the characters documented at those cited URLs are defined in the so-called Private Use Area (PUA), which extends from U+E000 up to U+F8FF, or to be more specific, the "Corporate Use Subarea" (sometimes denoted CUS), located at the high end of the PUA i.e extending down from U+F8FF. The use of PUA characters is rather dubious in HTML, since there is no interworking standard for where these characters should be. For example, the Adobe reference, cited above, places some characters in Adobe's part of the Corporate Use Subarea; but if one inspects MS's "Symbol" font, then the relevant characters are found in another part of the CUS.

Over time, quite a few of these PUA characters got assigned new standard Unicode code points: for example the various fragments for constructing large brackets are now in U+23xx; Dingbats are in U+27xx.

Personally, I would recommend avoiding the use of character code points which are in the PUA, as being too unreliable to set loose on the WWW. A few tests showed that only a proportion of the PUA characters in those Adobe tables actually worked in Mozilla, and some of the others displayed as the wrong characters; while none of them displayed at all in MS IE6 (which just showed empty boxes). It's possible that some twiddling with font specifications might improve the performance, but frankly I don't consider the effort to be worth it. My advice would be to look for the character, or an effective replacement for it (e.g copyright instead of copyright-serif or copyright-sans), in the standard Unicode tables, and avoid the PUA.

The sands of time...

Update: (revised Jan 2006): it seems as if Mozilla is generally standing-fast against the demand to "support" Symbol font(s). See Bug id 33127 for the at times acrimonious arguments behind the scenes, up to late 2004 (after which, things seem to calm down, aside from a short outburst in Feb 2005). The developers made a concession, in quirks mode only, for honouring (misguided) requests to use Symbol font. Of course, this is meant for compatibility with existing documents over which they have no influence: neither they nor I would recommend using the technique for new documents.

A related digression - Hebrew

One of those curious coincidences that happen: later in the same day that I finished the first draft of this page, my attention was called to a site with Hebrew and German on it. For the sake of discussion I'll mention that this site was http://www.hagalil.com/, although the site has changed since.

Anyhow, apart from the places where Hebrew text had been included as images, there were several places where Win Netscape 4.5 showed Hebrew text, in spite of the fact that the document had been sent without any charset specification. Inspection of the source code revealed sections like this:

<font face="Web Hebrew AD">תשרב תירביע</font>

Here is a demonstration:

תשרב תירביע

Well, from an earlier experiment I did in fact have the font "Web Hebrew AD" installed, back then, on my Win95 system, and sure enough, Netscape 4.5 (still not yet HTML4/RFC2070 conformant except in some special cases) was persuaded by this to display some Hebrew.

MSIE5, however, knew better, and displayed the actual characters contained in the document, i.e accented Latin-1 characters, u-acute, u-grave and so forth, based on the default of iso-8859-1. It was possible to persuade MSIE5 to display Hebrew by manually setting the encoding to "Hebrew ISO Visual" (the Hebrew part then looked the same as Netscape was showing), but of course then the German umlauts were wrong!

The site was in fact offering detailed instructions of how to download, install and select the font. While it's not disputed that in practical terms this gives an impression of "working" with earlier browsers, it never has been recognised by the open published HTML specifications - quite the contrary (see RFC2070 and earlier discussions), and became increasingly problematic as browsers moved to support HTML4 "to specification".

Sure enough, pasting that "phrase" of accented Latin-1 characters, "תשרב תירביע", into Altavista resulted, at the time this was originally written, in it finding the precise "Hebrew" page cited. And after this present page had been on the WWW for some time, Altavista found this page too! And this remains true in later years with other search engines. But a search for the real Hebrew text produces quite a different selection of pages.

Careful readers may have noticed one difference here: although MSIE5 was still playing along with FONT FACE="Symbol" in the earlier example, it did not play along with the "Web Hebrew AD" font trick. It appears, on the other hand, that MS continue to make an exception for Symbol font(s), by analogy with usage in MS Word, even though it goes against the principles of HTML. However, it would seem unwise to take advantage of this on the World Wide Web.

Here are the first three Hebrew letters, represented in HTML by their numerical character references, ordered from left to right in the actual markup: אבג . Right-to-left text is discussed in my Text direction page.

Caveat

I can't read Hebrew, much less understand it. I'm only taking an interest in the technology of representing it in HTML. You might care to read Nir Dagan's better-informed pages on the topic.

Indic scripts

The situation with Indic scripts had been particularly anarchic. It seems that each Indian newspaper had defined a different font layout (i.e in HTML terms, in effect, a different user-defined character encoding) for their use, so that readers of Indian newspapers were expected to download a different font for each newspaper which they wanted to read!

Fortunately this is changing, as proper Unicode support for the various scripts becomes available. And as authors slowly work out that the available "tutorials" for those earlier methods are inappropriate nowadays (if they involve 8-bit characters) or totally bogus (if, as is more often the case, they involve ampersand notations for the numeric character references or character entity names of Latin-1 characters).

A diplomatic side-swerve - user-defined character coding

See: User-defined character coding to extend repertoire?.

It's not just me

Don't take this from me: take it from the W3C's own i18n wizards. And of course the already-cited Mozilla FAQs.

Alan Wood's symbol font demo now rightly explains that the method is contrary to specification, "is not reliable, and should not be done", and tabulates Unicode equivalents.

Jukka Korpela's page on Math in HTML briefly reviews the method, and concludes that "the trick appears to work in many situations, but that's because of browser bugs".

Acknowledgements

The URL of this note is in honour of the classic "<FONT FACE> considered harmful".