Baltic notes

Baltic languages are not my field, and most of the items here are second-hand. In Sept. 2000 we had a discussion about Baltic characters, with particular reference to the Mac platform and Lithuanian, on a German-language usenet group. I was guided, as ever in Mac internationalization issues, by Andreas Prilop. I thought it useful to summarise some points from the discussion here. I'm doing this because of my interest in character codings and related technologies: I have no particular expertise in the languages themselves.

Pre-requisites: this note assumes an understanding of how character codings (are supposed to) work in HTML4. The note will likely make little sense to anyone who isn't reasonably up to speed on that. I've done my best to offer some suitable resources, and won't be repeating them here. Thanks.

Character coding

Earlier Baltic 8-bit codings were iso-8859-4 and iso-8859-10. According to the W3C page the appropriate ISO choice would now be iso-8859-13, as Andreas agreed. iso-8859-13 is approximately equivalent to Windows-1257. Others confirm that in practice, Lithuanian web pages (if they claim any particular coding) predominantly claim to be in Windows-1257 coding. Note also the character database that was recommended by Jukka Korpela, where the required special characters can be researched, e.g for Lithuanian.

Mappings for these codings can be found at the Unicode web site: iso-8859-13 and windows-1257 and here are my corresponding test pages, subject to the explanation given for the "playground" pages: iso-8859-13 and windows-1257. (these will be sent out from the server with explicit charset specification in the HTTP headers: see comments on browser support in a moment).

Aside from the Windows character coding having displayable characters in the range 128-159 decimal as usual, which the iso-8859-* codings reserve for control characters, there are also four differences between the two codings in the range 160-255. (This is analogous to the situation in Greek between iso-8859-7 and windows-1253, where there are also a few differences; but is in contrast to iso-8859-1 versus windows-1252, which are identical in this range.)

Netscape 4 does not support iso-8859-13 coding; however, iso-8859-13 documents can be usefully viewed in later releases of this browser version if the reader uses the "View->Character Set" menu (which really selects character encoding) to choose windows-1257. A.Prilop reports that unix versions of Netscape 4 may not offer windows-1257 as an option (I'm unsure about the version history of the Netscape 4.* versions and just how closely the unix, Win, and Mac versions aligned in this regard).

Alternatively, Netscape 4 can handle utf-8 coding: this would make the document incompatible with earlier browsers, but by now (2005) that might not be considered critical. Despite the existence of support for Windows-1257 in some browser versions, it would be generally deprecated to advertise a proprietary coding on the WWW. If one confined one's usage to the common subset of the two codings, then one could advertise the resulting document as being in either coding, just as we did for the Greek version of the "Quickstart" document in regard to iso-8859-7 and Windows-1253.

If you are offering a utf-8-coded version of your document (or using what I describe as the conservative option), then offering it also in a version using an 8-bit character encoding can be beneficial not only to older browsers but also to some search engines: but search engine support for utf-8 is getting steadily better and soon (as of 2005) shouldn't be a problem.

Resources for the Mac

Andreas points out that whereas the ISO codings are different for the "Baltic" area than for the "Central European" area, in the case of the Macs the coding for "Central European" is also used for the Baltic area, and therefore his Mac Central European resources page applies.

Andreas recommends Mac users to prepare documents using native MacCE codings, and then to use his software "Convert Central European & Romanian HTML files" (see the subheading "Info-Mac HyperArchive" in that web page) to get a suitable WWW coding (be it 8-bit for older browser versions, or UTF-8 for more modern browsers). Again, keep in mind that although advertised under the "Central European" banner, these Mac resources are also applicable to Baltic.

Specifying fonts?

Please review my page on this general topic first.

In this section, we're discussing the use of a regional font with a limited repertoire. Increasingly nowadays (2005), one uses comprehensive fonts (typically unicode-based) rather than the regional fonts that were usual in earlier times, and this part of the problem resolves itself. However, the discussion is kept here as a matter of interest.

Specifying an explicitly named regional font in this kind of situation is likely to be harmful. For example, to display Baltic characters in the Arial font family, the Windows-based user would likely be using "Arial Baltic", while the Mac user would need "Arial CE". Thus an author explicitly specifying the named font for the one platform would be harmful to the other: so the best they could hope for would be that the specification was unsuccessful!

Getting "i18n" right in a WWW situation is fraught enough as it is. I definitely cannot recommend exacerbating the situation by trying to impose a specific choice of font on your readers. In general I would advise authors to refrain from use of the legacy FONT FACE construction, and to refrain from, or at least be very cautious in, the use of font name specifications in CSS.

Web-authoring applications

Seem to be a particular problem, especially those developed predominantly in the USA, as some of them haven't the remotest clue how to deal with i18n issues, and produce the most preposterous drivel when used in non-Latin-1 situations. This is exacerbated by naive authors copy/pasting into composer windows out of word processors etc. whose method of handling internationalized content might be very different than what HTML needs.

I can't offer any specific practical advice here, sorry, but before starting a WWW project involving non-Latin-1 codings using these kind of tools, it would be advisable to ask some penetrating questions about your authoring application's ability to deal with such issues. There are certainly i18n-capable web authoring applications, but it's often hard for a beginner to the field to recognise them when selecting an authoring solution. And for word-processor documents it would probably be more productive to look for purpose-designed conversion software (again, after asking the penetrating questions). If anyone tells you that the solution involves the author setting a particular custom font, then they are imposters! (as the other documents in this area should make clear).

Legacy documents

On the web will be found an unfortunately large number of legacy non-Latin-1 documents which were prepared for older browsers and which "take advantage" of bugs in those browsers to get the desired effect. A commonly seen abuse is for the document to pretend to be in Latin-1 coding (or in no particular coding at all), intended to be used with some specially-prepared non-standard font. The document then contains designations such as è or è, with the intention of displaying the substitute characters which their font has in the corresponding positions. This is absolutely wrong and will not work on modern standards-conforming browsers.

Links

Andreas suggested a couple of sample links, which seemed basically OK despite the occasional anomaly. However, one of them subsequently disappeared. Readers should also be aware that there are some other web sites out there which recommend inappropriate or even perverse techniques (such as the ones mentioned in the previous section), that give an impression of working on older browsers but are increasingly failing as browsers move to support the published standards. It may be hoped that readers who have taken on-board the principles used in this part of the HTML specification, which I've tried to elucidate in the various pages in this area, will recognise these off-beam techniques when they see them.


|Previous|Up|Next | |RagBag|About the author||