What characters are allowed unencoded in query strings?
A couple of months ago I advised people to Be careful with non-ascii characters in URLs. We’ve been discussing that at work lately, more specifically whether characters like “:” and “/” are allowed unencoded in query strings or not.
I may well have made mistakes trying to understand the specification, so any help clarifying any errors in the following would be appreciated.
The summary of my previous post is this:
In essence this means that the only characters you can reliably use for the actual name parts of a URL are
a-z
,A-Z
,0-9
,-
,.
,_
, and~
. Any other characters need to be Percent encoded.
But what about those query strings? After studying RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax I’ve come to the following conclusions.
In section 2.2 Reserved Characters, the following characters are listed:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
The spec then says:
If data for a URI component would conflict with a reserved character’s purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.
Next, in section 2.3 Unreserved Characters, the following are listed:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
Ok, so let’s look at what is allowed in the path component of a URL. Section 3.3 Path has a bunch of rules that should be used by URI parsers. The last rule defines which characters are allowed:
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
Unreserved, percent-encoded, sub-delimiters, “:”, and “@”. Seems pretty clear.
What about the query component then? According to section 3.4 Query, these characters are allowed:
query = *( pchar / "/" / "?" )
Ok, so from the earlier definition of “pchar” we have unreserved, percent-encoded, sub-delimiters, “:”, and “@”. And for query strings “/” and “?” are allowed as well.
The conclusion is that something like http://example.com/document/?uri=http://user:password@example.com/?foo=bar
is valid, since “/” and “?” do not need to be percent encoded in query strings, and neither do “:” and “@”.
Did I get it right? If not, a comment explaining where I’m mistaken would be much appreciated.
- Previous post: No longdesc attribute in HTML5
- Next post: Remember non-vendor-prefixed CSS 3 properties (and put them last)