FORM Submission (GET) by URL

How to compose the `HREF` to mimic a `FORM` submission?

There seems to be a lot of confusion about the correct preparation of the HREF attribute for this kind of usage, which maybe originates from some very early browser versions which handled the procedure incorrectly; old lore based on that misbehaviour still seems to be circulating, in spite of the fact that the problem has long since been corrected in the browser versions that are in current use. However, following those old instructions, even though it appears to "work" in many cases, has some nasty surprises in store for the unwary.

The correct procedure is as follows:

Construct the "form data set"
Encode the "form data set" to construct the URL
Apply HTML rules to represent the URL in HREF.

The first two steps are identical to the corresponding steps taken by a browser when preparing a normal form submission, as is documented in the HTML specification. For reference, we show these two steps first, although they are uncontentious as far as the present issue is concerned.

Construct the "form data set"

(See Processing form data in the HTML4.01 specification). The "form data set" consists of the set of "name=value" pairs that are to represent the "successful controls" (I'm using here the terminology of the HTML4 spec).

Note that the HTML4.01 spec emphasises that method GET is only defined for ASCII (i.e 7-bit) characters. Although one often sees form submission performed in this way using 8-bit characters representing some 8-bit encoding or other, such de-facto usage lies outside of what is covered by even the HTML4.01 specification. I'm drafting a separate page on FORM submission and i18n.

Encode the "form data set" to construct the URL

Encode the "form data set" according to the rules of the default encoding type, www-form-urlencoded, as described in the appropriate subsection of HTML4.01 Form content types. This means:

Represent each space character by "+".
Represent certain problematical characters by "%hh", where "hh" is the hexadecimal value of the ASCII character.

The text of the HTML4.01 recommendation states that all non-alphanumerical characters should be %-encoded, but client agents tend to encode only a smaller repertoire of problematical characters. It should be noted that the ampersand characters separating the parameters must not be %-encoded: only if the ampersand should represent data, is it to be encoded in the %hh format to distinguish it from a parameter separator. The character "+" itself, on the other hand, must be %-encoded, to distinguish it from the "+" sign used to represent a space.

At this point, we have the correct data to append to the ACTION URL, prefixed by the question-mark, i.e something like

http://some.site.dom/path/to/action.cgi?name1=value1&name2=value2&...

It will be noted that this URL correctly contains one or more ampersand characters. This is where the confusion starts.

Apply HTML rules to represent the URL in `HREF`

The value of the HREF attribute is declared in HTML to be CDATA: what that means is that it will not be parsed for HTML markup, but it may contain character entities.

There is a specific note about these ampersands in the HTML4.01 spec.

According to HTML rules, therefore, when the URL (which itself correctly contains ampersands) is to be used as an HREF attribute value, the ampersands need to be represented using ampersand-notation, i.e either as the character entity & (which is what I would recommend) or as the numerical character reference, &.

The just-cited note in HTML4.01 also repeats some earlier advice to implementers, that they should accept some other character - it suggests semicolon ";" - as delimiter as an alternative to ampersand, in order to ease the formulation of such URLs: few authors seem to have taken advantage of this long-standing advice, but it is supported by the well-reputed CGI.pm Perl module, and if you are in control of writing your own CGI scripts then I can definitely recommend it: not only does it sidestep the bother of "entifying" the ampersand character, but also it circumvents any bugs (not that I have seen one since around 1999) there might be in some obscure client software.

Use of the semicolon as alternative separator is supported by PHP; see also this note at the W3C.

On the WWW will be found a very large number of instances where the ampersand has been included in URLs as-is, without following this final step of the procedure. While it's true that this technically-incorrect procedure will usually give the appearance of working (except for the error reports from HTML syntax validation), there will be cases where it will fail, as will be shown shortly. Where it appears to work, it is because client agents confronted with unknown entity representations usually treat the whole construct as a literal string.

In XHTML, when properly parsed, the consequences are more serious: many so-called XHTML documents on the web today are relying on browser fixups per HTML tag-soup parsing, and would fail if they were taken seriously as the XHTML which they purport to be.

Example and tests

Consider the following GET query, which might be a request to print out two copies of the English version of some documentation, and notify somebody when it's ready:

request=print&lang=en©=2¬ify=fred

(OK, it's slightly contrived, but it will make the point).

Here's the query coded as a proper form (with those values pre-set as HIDDEN inputs). Submit it to get a report, and check the QUERY_STRING that's shown.

Here's the same thing correctly coded as a link, follow that link to get the report.

Here, for comparison, is the same link, but wrongly coded with bare ampersands; follow that link to see what happens. This brought a number of oddities out of the woodwork! (Validation of this document, at e.g HTML4.01, reports this URL to be in error at just one position: attentive readers will already know which that is, and why there is only one, yes?).

It should be noted that although it's always recommended to terminate character entities explicitly with a semicolon, which is treated as part of the construct, there are situations where the HTML rules do not insist on that, and the entity may be terminated with some other (punctuation) character, which is not supposed to be treated as part of the construct. For example, ©= should be rendered as a copyright sign followed by an equals sign: ?= (N.B - it's different in XHTML, where the terminating semicolon is not optional.)

Here's some results with various browsers, and comments from me. The results of the form submission were all OK; the results from submitting the first (correctly coded) link were also all OK on the browsers that normal users might be expected to use (and Amaya also since 1999 handles this correctly, although Amaya still seemed to have problems with composing such HREFs in its editor). The results from the second, incorrectly coded, link produced a mixed bag of different browser behaviours, from browser to browser and from version to version.

Results from the incorrectly-coded link:

Win Mozilla 1.7.11

"Bare ampersands" test result: request=print&lang=en%A9=2¬ify=fred
The copyright sign has been interpreted per the specification, and expressed in the %hex-coded format.

MS IE6 (in XP SP2)

"Bare ampersands" test result: request=print&lang=en?=2¬ify=fred
The copyright sign has been interpreted per the specification, and submitted as-is.

Win Netscape 3.01

"Bare ampersands" test result: request=print&lang=enŠ=2Źify=fred
The copy has been interpreted as a "copyright" entity, as indeed it should be according to HTML rules; Netscape 3 has also picked out the not as a defined entity (which is not really correct) and passed the residue, ify, along as text.

Win Netscape 4.5 (4.79 is the same)

"Bare ampersands" test result: request=print&lang=enŠ=2¬ify=fred
At version 4, Netscape stopped picking out substrings that happened to match a known entity. This behaviour is now correct, for an HTML3.2 browser. If it had any claim to be an HTML4 browser, it would have recognised the entity lang, but it didn't. Here is that HTML4 character entity, ?: "?" - NS4.* versions don't even recognise that (nor indeed most of the HTML4 character references) in normal text.

Win MSIE 3.03 (16-bit)

"Bare ampersands" test result: request=print&lang=enŠ2¬ify=fred
Interprets the copy character entity per HTML specification, but eats the following "=" sign, which is wrong (this didn't happen in normal text).

Win MSIE 4

"Bare ampersands" test result: request=print&lang=enŠ=2Źify=fred
In this regard it seems MSIE4 was bug-compatible with NS3. Furthermore, although MSIE4 recognises and attempts to render the character entity ? in normal text, it seems to have decided not to understand it here, in spite of the fact that it is understanding and interpreting copy.

Win MSIE 5 and 5.5

"Bare ampersands" test result: request=print&lang=enŠ=2¬ify=fred
Unlike MSIE4, this version is not picking up the substring not as an entity name. Otherwise it's behaving as MSIE4 behaved.

Opera 3.60

"Bare ampersands" test result: request=print&lang=enŠ=2¬ify=fred

Win Mosaic 3.0

"Bare ampersands" test result: request=print&lang=enŠ2¬ify=fred
Similar to Netscape 3.01, but the "=" sign has been eaten. (This is incorrect behaviour, which was also exhibited in normal text).

Lynx (2.8.1dev.9, 2.8.5dev.7, confirmed for other versions by various informants)

The correctly formulated query worked, as with other browsers tested.

Curiously, on the bare ampersands test, this browser "appeared to work", i.e it did not conform at all to the HTML specification in this regard. I suppose that a study of the developers' discussions would reveal why they decided to make it Do What Authors Meant, instead of doing what the spec indicates that it should. But this should not, of course, be taken as any kind of excuse for authors to continue perpetrating the error.

emacs-w3 3.0.62 (rather old!)

"Bare ampersands" test result: request=print&lang=enŠ=2¬ify=fred

Amaya (prior to version 2.2, Oct 1999)

Now here, to be honest, we had a problem. There was a long-standing bug in the handling of this kind of URL in Amaya. However, they did accept that this was a bug and that they would work to correct it - and indeed it is corrected in version 2.2. Amaya would, I think it's fair to say, not be regarded as a normal everyday browser, so I think this anomaly can be ruled out as any kind of excuse for not coding properly.

Amaya version 2.2, which, as I just said, correctly handled the correctly-coded link, seems to have made a fine mess of the wrongly-coded URL with bare ampersands. But since the URL is defective, this is no criticism of Amaya 2.2.

iCab (report from brian d foy)

Handled the correct HREF, correctly, of course.
Complained about the incorrect HREF (it bickered about every missing semicolon, which isn't technically correct, although admittedly it is poor style to code character entities without the semicolon, even where it's technically not mandatory).

WebTV Viewer (2.0 build 551)

"Bare ampersands" test result: request=print&lang=en=2ify=fred
(what a mess!).

Acorn's Browse on RISCOS - Matthew Somerville writes:

It will not work in Acorn's own web browser, Browse, on RISC OS. It's a bug in Browse, and as it isn't being developed, this will probably never be fixed, but a few people still use it.

However, he went on to say that he would not advocate writing URLs wrongly just in order to appeal to this old browser. If you're writing web pages for your own scripts, then you can follow the W3C's advice and use semicolon as delimiter, thus bypassing the problem. If you're writing for someone else's script, then you'd be stuck with whatever their script may support.

A wider picture

An email correspondent reports an extensive analysis of responses to a link which was correctly formulated as shown in this page, with the intention of identifying any clients (browsers, indexing robots etc.) which interpreted it wrongly. He reports that the responses were overwhelmingly correct, with a very short list of clients which failed. Some of these defective clients are evidently presenting a fake user agent string (commonly pretending to be some version of MSIE, but behaving in a way that differed from the real thing), so one has no idea what they really are: others appear to be Java-based or other special-purpose clients, or obsolete development or preview versions (for one instance, Firefox/0.8).

My correspondent concluded:

There are a small number of user-agents that will misinterpret correctly-encoded ampersands. However, most agents (including Google) interpret them correctly, so I am quite comfortable with advising people to make the fix.

Similar contexts

Are there other HTML constructs that behave in this same way? Yes, sure. IMG SRC=, LINK HREF=, and various other HTML constructs contain attribute values denoted in the HTML specification as %URI just like A HREF=, and all behave in the same way. Or are supposed to, unless you know some extra browser bugs.

XHTML note

In XHTML, the terminating semi-colon of a character entity is not optional: in the above example, even though (just as in HTML4) the copy and lang would not raise a validator alert for being undefined entities, they would (unlike HTML) raise an alert for lacking their closing semi-colon.

So does this mean ampersand should always be `&`?

Not quite! If you want to include an ampersand character as data in your URL, corresponding to a form value with an ampersand in it, for example Mom&Pop like in this form (just try submitting it):

then you'll need to encode that ampersand as %26 when forming the URL, as in this example.

`CGI.pm` Perl module

The respected CGI.pm module provides a method, self_url, for returning a URL string that will re-establish the current environment: this is subject to the same considerations as we are discussing above. Its documentation contain examples along the following lines:

$myself = $query->self_url;
print "I'm talking to myself.";

This is fine so long as the -newstyle_urls pragma is in effect, i.e semicolons used as delimiters in the query string, which has in fact been the default for quite a few versions now. I see that recent versions of the documentation explain the implications for -oldstyle_urls, and recommend applying escapeHTML() to the URL for use in href values.

No matter which style of URLs CGI.pm is configured to generate, it always parses both ampersands and semi-colons as separators in the submitted URLs; so if you need either of these characters in your data, be sure to encode them in %hh hexadecimal URL-encoded format in your submitted query strings (they are, after all, "reserved characters" in the sense of RFC1738 etc., so CGI.pm isn't asking for anything which wasn't already called for by the RFC).

Summary

When the URL was correctly coded into HTML, there was no problem with any halfways-recent browser versions that were tried. When it was not, the most common "problem" was that many browsers did what the HTML rules specify, rather than what the author had intended, in regard to the copy entity; it's assumed that the same would be true for any other form field name which happened to match one of the Latin-1 entity names.

It's true that my test parameters had been deliberately contrived to provoke these anomalies, but that could always happen by accident when composing such URLs; what's more, there is no guarantee that a new browser release may not suddenly start interpreting all of the HTML4 entities according to the published spec, and thus cause previously "working" (but defective) forms to stop working.

Some browsers also caused problems if the initial substring of the form field name happened to match an entity name. Curiously, though, none of the browsers was triggered by the HTML4 entity name that had been included, not even those that recognised this entity when it appeared in normal text.

This is not legal advice

(and I am not a lawyer!)

Some advertisers have a contractual requirement to place their exact snippet of HTML into one's web pages in order to qualify for payment on their advertising scheme. Often this snippet contains invalid syntax of the kind that is described above. Some free web-hosting sites also inject snippets of invalid HTML into their "customers'" pages for advertising purposes. I can't really advise people what they could or should do about that: I can only point out what HTML's syntax rules are, and leave them to decide whether it matters to them or not.

Conclusion

I was unable to find any even halfways-convincing technical reason for not coding the HTML in accordance with the rules given in the HTML4 spec, and several reasons for not copying the misbehaviour that is widespread on the WWW. So, if you're in control of the CGI script, make sure it supports the semicolon ";" as an alternative separator, and use that; if you can't do that, then code the ampersand correctly as &.