This space intentionally left blank

I’ve been asked a couple of times recently, as part of separate projects, to split the results of a SQL query on whitespace within. Simply put, how does one go from:

foo
foo bar
quux
blort wuu spong

to the expanded form:

foo
foo
bar
quux
blort
wuu
spong

efficiently and cleanly, only using SQL? (In case anyone’s worried, I’ve scrubbed the data sets of any personal details they might have previously contained: any resemblance to the real Blort Wuu-Spong is entirely coincidental.)

I finally decided it wasn’t possible, and although without the pure mathematics to back me up I could have kept hunting—partial solutions involving a self-join for each whitespace splitting kept rearing their heads—what finally convinced me was comparing the behaviour of SQL with that of XSL(T). The two are more alike than you might think; and no, I don’t mean SQL and XQuery, although that easy comparison provides a clue for the underlying similarity.

In XSL(T), the XML node in your original document(s) is in a sense king: it’s considered bad form (and is at any rate inefficient) to do data management on some transient data set, created within the template. Loops work best over nodesets rather than with some sort of conditional or from/to structure. This stems from XSL(T)’s underlying functional paradigm, where each nodeset is created

Of course, it’s always possible to twist non-functional behaviour out of the stylesheet (and most real-world solutions have to take a pragmatic approach to such programmatic purity) and interpreter-specific kluges exist to node-ize strings based on some non-XML token, but the language works fastest and cleanest when it’s hanging functions off nodes.

In SQL, the equivalent to the node in an XML document is the row in a query. Rows are passed around, compared with other rows based on the content of some of their cells, tied together and discarded, but very rarely can rows be created out of thin air. The closest one gets is the LEFT/RIGHT OUTER JOIN where the ON-condition is not satisfied: then the left-hand row, rather than being discarded as in the INNER JOIN, is in a sense tied to a row of NULLs. Although that equates to it being tied to no row at all, then when the SQL99 dust settles and post-processing can begin, NULLs can be reinterpreted (Coldfusion does this without being asked, for example).

So to create new rows, one can UNION two rowsets, or entangle the rowsets with some sort of a JOIN, but in simplest, non-iterative SQL, there ought to be no easy way to make one row magically split into two, or maybe three, or maybe four, based on its textual content. It breaks the underlying principle, that rows should flow through the SQL into bit-buckets or the STDOUT tray, but shouldn’t be tossed into the stream with flamboyant verve like chillis into a stir-fry.

Exit gracefully: regardless of the data itself, the data model that a given language’s designers had in mind can have the most effect on what’s plausible to do in the language. Almost all languages evolve through proprietary extensions until they can do associative arrays, every kind of loop structure and, if left alone for long enough, GOTOs, but being able to complete a task with a given language is not the same as being able to complete it, for a sufficiently large data set, before the death of your server, your development team or the universe.