When whitespace isn't whitespace, but it is white [:space:]

After much wrestling with hexdumps, Matthew highlighted an issue for us today of the stealthy ninja linebreak. Here it is. Are you ready? Right: "
"

Did you spot it? Unlike all the other linebreaks in this Wordpress post, it hasn't been converted to a <br/> or <p/> tag, so Wordpress didn't. Not entirely fair of me to expect it to, though, as strictly speaking it's the line separator, \u2029. It has a nonidentical twin brother, the paragraph separator, \u2030. A shady pair of characters, these two: intended for printing use rather than computer use, like many of the other (horizontal) spacings in the General Punctuation code chart.

The reason that Matthew even noticed something was going funny in the first place was that Coldfusion's JSStringFormat doesn't escape it, but some output streams entirely filter it out. Javascript would see whitespace and sometimes die with formatting problems, but text dumps of the database revealed nothing at the command line. It was a sort of Mandelbreak, appearing and disappearing as if by random until he revealed it by dumping the actual bytes.

Ultimately, though, if you're filtering out high-end codepoints, why should you care? Well, imagine someone typed this into your site:

ja
vascr
ipt:alert("foo!");

If that looks like a blatant Javascript link to you (it does to me, on Firefox 3), then check the source. So: on the round trip from user input, through filters, into the database, back out again and then to whatever rendering system you're using: can you be certain that those line separators wouldn't just magically disappear, leaving you with some cross-site scripting?

Normalizing whitespace---reducing all whitespace clusters down to a single space---is a useful way of at least taking the sting out of such odd characters. Matthew initially mentioned as a problem that:

Firefox's Javascript parser (and possibly others) treats it as an end-of-line if it encounters one

But this can be put to good use: assuming it's consistent in its parsing at all levels, Javascript should have an innate understanding of whitespace beyond the ASCII character set. So it should spot the stealth spaces and be able to normalize them. Putting '\s' in the input field on this tester script does indeed return a list with both \u2028 and \u2029 are both on the list.

So if you're accepting user input then your first set of gatekeepers could be this code, running when a form is submitted:

jQuery(":input").each(function(i) {
  jQuery(this).val(jQuery(this).val().replace(/\s+/, " "))
});

You should never depend solely on browser-side sanitization, of course, as whatever you get the browser to do, a cross-site scripter can fake having done.

An equivalent server-side solution would be to use a regex parser that supports, whether implicitly as '\s' or explicitly as '[:space:]', POSIX character classes. These are regular expression terms in square brackets, based on a special colon-delimited marker: "[:space:]" means "any one of all the ASCII whitespace characters, plus any character which has the Unicode \p{Z} character property" (character properties are Unicode's equivalent of character classes). Picking a handful of technologies, Coldfusion claims to support POSIX classes; Python will support them in version 2.7 (a quick test suggests that '\s' in the re module is limited to catching only ASCII whitespace); and Opera was patched in version 9.52 only a couple of months ago to prevent XSS attacks utilizing these very characters.

Until there's a straightforward way, in a large and distributedly developed project like blogging software or a framework, to catch all the looky-likey high-codepoint whitespace, then only whitelisting---of characters and of markup---will really guarantee that the content you let into the system will resemble---both in security terms and byte by byte---the content you finally release. So how safe is your site, and how robust your offline content workflows?