Some notes about the negotiation of natural language on the WWW.
This page makes no attempt to be a complete tutorial on the topic. It only discusses a selection of issues that seem to keep coming up in discussions.
Note on Apache versions: this page now cites references to Apache httpd version 2.0. There are corresponding references at the Apache site also for the older version, 1.3.
HTTP protocol offers two different mechanisms for a client and server to negotiate alternative resources to fulfil a user request. A fine introduction to the concepts and terminology (as well as to Apache's support for them) can be found as Apache Content Negotiation at the Apache web site.
There are two fundamentally different mechanisms defined: server-driven negotiation which is well-established and on which the present notes will be based, and transparent negotiation as defined in RFC2295 and 2296, which is more experimental and won't be covered here.
This mechanism is based on a set of HTTP Accept headers which a client (such as a browser) may send to the server in any request. These include Accept for Content-type; Accept-Charset for what is now more properly called the "Character encoding" of text content-types; and so on.
The header which interests us here is the Accept-Language header, which is (supposed to) represent the user's preferences for different languages. Typically, when a browser is installed, this preference list might be initially empty, or might be set to the language in which the operating system has been configured to work. Browsers are supposed to offer intuitive configuration for users to indicate their language preferences, so that they may be used in requests to servers.
There are numerous reasons why server-driven language negotiation should not be the only selection mechanism available to users. Here are a few:
For this and other reasons, it's recommended that pages that are available to readers in more than one language variant, should offer explicit links to the other languages. These links could be marked with the names of the relevant languages (each in their own language and writing system, as far as possible). Note that the frequently seen use of national flags for this purpose is illogical, and can be a cause of pointless annoyance to one's readers, as an article by J.Korpela has pointed out.
Authors are often seen arguing that it is pointless to apply language negotiation, because most people have no idea how to configure their browsers to use it properly.
They are then inclined to respond to this belief (which may very well be true) by designing their own weird and wonderful language selection mechanism, exclusively for their own site, based typically on some kind of user dialogue leading to the setting of a cookie, or even on making guesses based on the user's IP address or domain name.
I would suggest that this is perverse in several important respects.
Readers spend most of their time on other people's sites. You do them no benefit by wasting their time teaching them a mechanism that only works on your own site. You would be doing them much more of a favour, if you want to teach them anything, to teach them to find the language preferences selection that is built-in to their own browser already, so that they can take advantage of the WWW's standard method of language negotiation, not only on just one site.
These site-specific dialogues which the author is aiming to implement for language selection need to communicate with the reader in some language or other, thus creating a "chicken-and-egg" problem; whereas the user's browser has presumably already been configured to communicate with the user in a language that is acceptable to them.
Readers who are concerned about privacy are likely to be even more reluctant to accept and send cookies than they are to send the standard language-preferences selection.
Guesstimates of a reader's preferred language based upon the apparent calling IP address or DNS name are notoriously unreliable in a WWW context.
It has already been recommended that the author should provide explicit links to other language variants, so the language negotiation feature will do no more than to make an initial selection of language, which the reader can then overrule. So, even if they don't bother with the WWW's standard function for this purpose, the other language variants will still be accessible to them. So everybody wins.
For this purpose, language tags are as defined in RFC2616, consisting of a primary language tag, for example en
for English, de
for German, and so on, possibly followed by secondary tags indicating some subset of the generic language, e.g en-GB
for British English.
User preferences are expressed in terms of language tags (whether generic or with secondary tags), possibly with q=
values to express their relative preference. A willingness to receive any language that the server has available would be expressed by including "*
" along with the list: the absence of "*
" from the list means, strictly speaking, that the user rejects any language that has not been included in their preferences list (although RFC2616 makes the concession that it may sometimes be preferable for the server to send an available variant anyway, if it has nothing fitting the requestor's stated preferences).
The language variants that are available at the server can be categorised by their language and, optionally, by qs=
source quality parameter,
In conjunction with any Accept-language header, the server is then in a position to determine which document variant to send back. If no Accept-Language header was included, then any available language is treated as equally acceptable to the user.
If the server finds no available variant which meets the user's requirements, then one of two procedures can be followed.
Return the status 406, which indicates "nothing acceptable" is available. The server should include a menu of the variants which are available, to give the reader an opportunity to choose one anyway.
Return one of the available variants (presumably the one with highest source quality) as a normal (status 200 OK) response, relying on other means (such as the explicit links to other language variants that were already recommended above) for the reader to make any other choice.
If the user's language preferences list is empty, then the browser should omit the Accept-Language header from the request. An Accept-language header without parameters would be a protocol error (the specification requires at least one language token), and a header with only "*
" as language token would be pointless.
In this situation, the server has no basis on which to decide the language to return: as far as the language dimension of negotiation is concerned, all applicable documents appear equally acceptable to the recipient. The author may be able to configure the server to send one particular language in preference to others: see e.g the LanguagePriority
directive in Apache. Don't confuse this with the quite-different situation where the client has indicated a language preference, but none of the preferred languages is in fact available (this is the "Status 406 dilemma" discussed below).
*
"If the user selects one or more explicit language tokens, then there is a difference in meaning depending on whether they include "*
" in the list. Contrary to what some have suggested, it would be quite wrong for a browser to silently insert an unsolicited "*
" into the list, although a browser might usefully propose it to a user when they are setting their language preferences.
Some browsers which encourage their user to configure an ordered list of preferred languages, send the preferences to the server without any q-values, e.g Accept-language: en,de,fr
It should be noted that the specifications don't require a server to treat such an ordering as significant: all the languages are assumed to have q=1.0
and thus be equally acceptable from the user's side. As I understand it, the coding of the content negotiation in Apache is such that if several variants are available (and of equal source quality, expressed or implied), then the first one will be taken from the ordered sequence, which is likely what the reader intended. However, this result is not mandated by the specification: the user should really express their preference by means of q values.
Indeed, most of the browsers that have been tried recently, even if they don't prompt their user for a preference value, will compute values to assign to the ordered list. Mozilla (to take one example) addresses this issue by automatically assigning q values to the list, for example if one configures an ordered languages selection as (let's say) fy,en-GB,en,de,fr
then what Mozilla actually sends is fy,en-gb;q=0.8,en;q=0.6,de;q=0.4,fr;q=0.2
- and the behaviour in recent versions of MSIE seems to be quite similar.
Of course, if distinct q values are supplied on a request, then the q values unambiguously override anything which might have been implied by their sequence, for example in the somewhat perverse acceptance list de;q=0.2,fr;q=0.9;en
the implied q value of 1.0 for the en gives it first preference, followed closely by fr, and de a poor third, irrespective of the order in which the terms appear.
This is an area which seems particularly counter-intuitive. What RFC2616 says about its prefix matching rule is very definite.
If the user states a preference for a language subset, e.g en-GB for British English, this preference should only be used when a variant is available in that specific subset (British English, in this example). If they do not also include an unqualified en
amongst their language preferences, then it means that they are rejecting the generic English version.
Similarly, if a user expresses preferences in the relative q-value order of, for example, en-GB, then de, then en, it means that if no specifically British English variant is available then they should be sent the German variant, in preference to generic English.
RFC2616 puts the responsibility on the browser designer to assist the user in understanding the consequences of this rule and to help them in including the generic language in their list if this is their intention (I know of no browser which actually offers this assistance!).
It could well be argued that RFC2616 has made an unfortunate choice in this regard, due to its counter-intuitive behaviour. It's at least plausible that if the server had no British English variant then the best way to respond to a request for en-GB
would be to return a generic English variant, and that the user should be expected to take some additional action (such as including "en;q=0
" amongst their preferences) if they want to indicate rejection of a generic variant. Whatever the attractions of this alternative approach might seem, RFC2616 still says what it says.
Note however that whereas a user acceptance of en-GB
can not match a document whose language is en
: in the converse situation, a user acceptance of en
can match a document whose language is en-GB
: this is the so-called "prefix" rule in RFC2616.
Where a document contains several languages, the factor that determines the proper settings for the purpose of language negotiation is its intended audience. For example, as RFC2616 points out, an English-language elementary Latin primer would be intended for an English-speaking audience, and should not be negotiated as if it were a Latin document, even though much of its content is in Latin.
q=0
Don't forget that a q value of 0 means that the user is explicitly rejecting the associated language! Many kludged-up CGI scripts, PHP pages etc. get this hopelessly wrong, assuming that any mention of a given language in the Accept-Language header is good enough for that language to be sent. But no! An Accept-Language string of, let's say, "en;q=0,*" means that the user accepts any language just so long as it isn't English. You might consider such a request to be perverse, but, as far as the protocol is concerned, its meaning is clear, and a failure to honour it is a definite protocol error.
To get an idea of the typical spectrum of language preference settings likely to be encountered, I created a custom log file in our Apache server, by means of the configuration entry (all on one line):
CustomLog /var/log/httpd/lang_log "%h %l %u %t \"%r\" %s %b \"%{User-agent}i\" \"%{Accept-language}i\""
This, obviously, logs the Accept-language settings as well as the presented "User-agent" identification for every request.
The results were rather surprising - indeed, in some respects astonishing.
Relatively few requests were made without language preferences, and quite a proportion of those without a language preference were coming from search robots and other automatic processes.
A reasonable proportion of requests were made with entirely plausible language preference settings, such as "de,ru;q=0.5", "en, ja", "en, bg;q=0.50"; one remarkable setting that was observed was:
"vi, fr;q=0.8, fr-ca;q=0.7, fr-be;q=0.5, fr-ch;q=0.3, fr-lu;q=0.2"
However, the big surprise was the overwhelming proportion of requests whose language preferences contained only "en-us": the great majority of these requests were coming from MSIE versions, although some other browsers (Netscape6, even Lynx) were represented in small numbers.
One bunch of requests was made with only "de-at" (Austrian German) in its language preferences!
It would appear therefore that the message of RFC2616 simply isn't getting through, neither to the browser designers nor to the end users. It's hard to believe that such an overwhelming proportion of US-Americans really did intend to reject all regional versions of English other than their own, and even harder to believe that a webnaut would intentionally browse a British web site while demanding nothing but Austrian German.
A little experiment with MSIE showed that if all of the language preference entries were deleted, it correctly omitted the Accept-language header in its requests. The conclusion seems to be that on initial installation, MSIE pre-sets the Accept-language to the operating system regional setting - rather than to no preferences at all - but that it fails even to add the generic language as a fallback choice. And as I commented already, not only IE but also the other popular browsers make no attempt to assist users with their language choices in the way that is recommended in RFC2616.
The conclusion would be that when setting-up a language negotiated site, one should take care not to apply the "we have nothing for you" rules too rigidly, as the theory seems to be quite some way from the situation encountered in widespread practice.
In discussion on Usenet, a server admin proposing to set up a Spanish/English dual language site for a community of Central-American users came to the conclusion that the bulk of his readers, although unable to read English web site content, nevertheless had learned to use their IE browsers "as installed" in American English, and so would never be offered the Spanish-language content which they required - only the US English which they didn't want. He sadly concluded that language negotiation was unworkable for his requirement. Which I can only rate as a great pity.
Apache's content-negotiation, which includes language-negotiation, is done in module mod_negotiation. Version 2.0 contains some additional possibilities, over and above the earlier version 1.3, which are well documented in their Version 2.0 content negotiation notes.
mod_negotiation implements two different ways in which the information provider may define the properties of their documents: MultiViews, and negotiation type-maps. Note that Apache does not inherently use any of the internal clues that might be found in an HTML document, such as the <meta>
tag or the lang
attribute on the <html>
tag: the server expects this information to be communicated to it in other ways, which have the advantage that they are not confined to HTML files but can be applied equally to other kinds of content, such as audio files, PDF or even plain text.
The simpler and easier-to-use of the configuration options is MultiViews. Although convenient and easy to use, this does not give access to the full range of possible negotiation features supported by the type-map method. Apache documentation of course gives details.
Well, there seem to be two parts to this dilemma.
The problem with status 406 is that it means that the reader has rejected all of the available variants: in the context of language negotiation, the question then is what language shall you use to tell them the bad news??? If you tried to negotiate the language of the error document, you might end up with yet another status 406...
Apache's response is to generate a very basic error document: apart from the menu of available document variants (the HTTP protocol specification tells us that we "SHOULD" generate this, and that's reasonable enough), the error page contains almost no explanation.
(By the way, in some versions of Apache this error document could be so small as to fall foul of MSIE's ridiculously-named "friendly error message" rule, resulting in the menu of available variants being hidden from the reader!)
Beyond the menu of available variants, Apache's error page merely displays the status (406) and the conventional (English- language) text, "Not Acceptable". Unfortunately, this seems to many authors and readers to be quite rude, especially as a proportion of readers will misunderstand it to be accusing the reader of doing something unacceptable! But what it really means is that the server is responding "Sorry, on the basis of what you already told me, I had to conclude that I have nothing which you would find acceptable" - a message which, however, would be hard to convey accurately and politely to readers whose command of English is unproven.
Many authors have, quite reasonably, concluded that they would do better to send the reader one of the available variants regardless (i.e the author's choice of language, on some basis). And indeed the HTTP/1.1 protocol spec allows this, with the words
Note: HTTP/1.1 servers are allowed to return responses which are not acceptable according to the accept headers sent in the request. In some cases, this may even be preferable to sending a 406 response.
And, so long as authors are taking the advice to offer readers some alternative, explicit, way of navigating the available variants, this seems an entirely acceptable procedure.
Some notes on how to configure Apache to achieve this via MultiViews will be given later.
Normally, Apache's provisions for authors to supply a custom error document are convenient and satisfactory. In the case of error 406, however, this seems not to be the case. The reason is that the built-in error document includes also the generation of a menu of available variants: but if the server is configured with an ErrorDocument directive, then Apache generates the custom error page which has been specified, but the menu of available variants is lost.
One might approach the problem by configuring a server script as the custom error document. But when this script is invoked, there is no provision for the script to be given the details of the available variants, so it would have to go and compute those all over again for itself.
The only alternative would seem to be to write a custom module based on the existing Apache code where this menu is generated. Thus losing the simplicity of a relatively free-standing ErrorDocument script.
(This section assumes that readers are at least superficially familiar with the relevant documentation.)
Content negotiation via MultiViews is based on the use of multiple filename extensions which are defined to the server as being characteristic of some feature of the content: html
for text/html, txt
for text/plain, pdf
for PDF, and so on; fr
for French, en-GB
for British English, gz
for gzip-encoding, etc.
MultiViews works best if the URLs themselves are devoid of all filename extensions. The module then evaluates the best match to the available variants on the basis of the client's Accept* headers and the filenames which it finds in the subdirectory, e.g a request for the URLpath /some/dir/foo might be resolved to the actual file foo.html.de.gz, the gzipped German-language HTML format, in a particular case.
If you decide that you want to send a default document instead of risking the error 406 dilemma, then you could provide a file (or in unix a symlink) under the name e.g foo.html. This will match the acceptability criteria for any client which accepts text/html, irrespective of its preferences for language or its capability for handling gzip compression.
For this purpose, it is important to be aware that a file called precisely foo would wreck the negotiation, since the MultiViews procedure is only invoked if there is no exact match of URL to filename. Conversely, if an href
was made to foo.html and such a file exists then it will be returned, irrespective of any kind of negotiation. This allows you to give readers explicit links to variants which their browser/preferences have otherwise declared to be unacceptable to them.
There is no overt provision under MultiViews to express the idea that some variants are of better quality than others. If you need to support this, then the alternative procedure, i.e the map file, would seem appropriate. [I've seen it confidently stated that one can use MultiViews, by supplying qs=
values via an AddType
directive: and indeed, on some superficial tests it did indeed seem to be working - but this doesn't seem to be advertised in the Apache documentation, so I'm a bit uneasy about using it on that basis.]
It had been noticed, in relation to Google in particular, that documents with URLs of the form somename.en.html, somename.it.html etc. got successfully indexed, whereas ones of the form othername.html.en, othername.html.it were omitted from their index, despite both sets of documents being sent out with the correct Content-type and other relevant HTTP headers.
It would seem that Google had been only considering URLs which ended with specific filename extensions (html and htm in this case) for indexing as HTML documents, and was ignoring other documents merely on the basis of the "filename extension" in the URL, irrespective of the correct Content-type advertised from the server.
It does seem that they had subsequently re-jigged this arrangement, since HTML files with other filename extensions were seen to be getting indexed later too.
This observation referred specifically to Google, but there may well be other indexers etc. that are making similarly flawed assumptions. At least, I'm not aware of any for which contradictory advice should be offered.
No adverse reports have been received about AltaVista and AllTheWeb, who appear to handle this issue appropriately. However, I did hear some unsettling news about Google Images failing to index some images whose URLs used MultiViews i.e without explicit jpg, gif etc. name "extensions".
The potential snag with an existing site is that your href
links will very likely be pointing to foo.html rather than just to foo, and these links will have gotten out into search engines, bookmarks etc.
In this situation, if you're not to suffer a "big bang" renaming of files and redirecting of obsolete URLs, all of your negotiated files need names which start with foo.html, for example foo.html.en, foo.html.de.gz etc.: you mustn't call the files foo.en.html, foo.de.html.gz, see the "Note on hyperlinks and naming conventions" in the Apache documentation if this isn't already obvious.
If you only have files whose names have additional extensions, such as foo.html.en, foo.html.de.gz etc. then all goes well with these foo.html links. But if you attempted to provide a default (fallback) variant by calling it foo.html then the MultiViews would stop working - because the exact match to the href
would be found and the negotiation would never be performed. just as has been already explained.
There is however a neat workaround. Simply call your desired default file foo.html.html (repeating the file extension which characterises the content-type), and the negotiation will proceed, even when the link is specified as foo.html.
In this situation, though, your other files will have names like those shown above: foo.html.en, foo.html.de.gz etc., which don't end with the filename extension which other folks might expect; well, that ought not to be a problem, since the real Content-type specifier is supposed to be the one that the server sends on its HTTP response header, and not some filename extension that just happens to be used within the server. But nevertheless, users do sometimes fall into the habit of expecting an appropriate filename extension at the end (and this might have consequences for search engines, as the previous subsection has commented). If the circumstances call for it, then feel free to hang another copy of the .html onto the end of the name, as in foo.html.en.utf-8.html etc., so that the "foo.html" part still expresses the original filename, but the new filename still has "html" on the end for anyone who's made the mistake of confusing a filename extension with a MIME content-type.
There are ways of using PHP, SSI, CGI etc. co-operatively with negotiation - of language, content-type or otherwise. Different approaches may be taken, depending on whether MultiViews or the type-map method is used. However, some of the options seem to be limited to Apache/2.
The general idea is that we let Apache take care of the negotiation machinery: compare this with the idea that comes up from time to time, of trying to implement the negotiation by brute force in the script itself - every such script which has been shown to me has had one or more serious defects, which could have been avoided by taking advantage of Apache's existing negotiation routines. It hardly seems worth mentioning what those defects usually are, since, as I say, I don't recommend doing it anyway.
In order to jive nicely with MultiViews, it's better not to configure Apache by means of "magic content types" such as:
AddType application/x-httpd-php php
or the corresponding SSI. CGI etc. magic types. Instead, use AddHandler
to define the appropriate handler (PHP, SSI, CGI etc. as the case may be), (or use XBitHack for SSI), and use AddType
to tell Apache what the final content-type will be (i.e typically text/html
), so that it can be used in content-type negotiation via MultiViews.
Taking PHP as example, in Apache/2 this could be:
AddHandler php-script php AddType text/html php
The Handler approach is also available to some extent in Apache 1.3, for example for CGI (handler cgi-script
), SSI (handler server-parsed
), although there doesn't appear to be a corresponding handler in this version for the PHP module. I suppose that the Action
directive can be used to invoke PHP via the CGI interface, but PHP does not recommend this approach for various good reasons.
For discussion, see Mark Tranchant's multiviews notes.
If this approach isn't feasible (e.g if you want to work with the PHP module in Apache 1.3), then it looks to me as if you'd have to write a typemap file. I admit that I haven't actually tried this in a production situation, but I've done some tests, and it looks feasible.
In a situation where you know systematically what you'd want the contents of the typemap file to be, you could write a script which automatically creates the typemap file for each set of variants (for example, as part of a Makefile procedure).
(In the meantime, I commend J.Korpela's thoughts on the topic.)
See also W3C i18n FAQ: Apache language negotiation set up.
There's a related W3C tutorial: Using language information in (X)HTML and CSS.
Original materials © Copyright 1994 - 2006 by A.J.Flavell