whitespace

Blosxom to WordPress: tying up loose ends

A busy few weeks, but they’ve included an import from a Blosxom blog to a WordPress blog which is worth describing. There are a couple of established methods for importing the data, and I opted for the one that seemed the most modular. This was Eric Davis’ Import-Blosxom method, consisting of a PHP script on the WordPress side and a set of Blosxom flavour files which produce a feed compatible with RSS 2.0. This separation of Blosxom and WordPress behaviours meant that I could thoroughly test the former before proceeding with the latter.

It worked very well with practically no configuration or edits, but there were a few issues with the out-of-the-box behaviour of the import script:

  1. Unicode character entities were being escaped in titles, leading to the exposure of the alphanumeric code e.g. “Zürich” instead of “Zürich”.
  2. Whitespace in post bodies is converted to hard newlines by WordPress, and so must be excised to avoid tags being broken e.g. ‘<a [newline] href=”…”>’ becoming ‘<a <br/> href=”…”>’.
  3. Multiple hierarchical categories are not supported (a known problem).
  4. Although categories are created and posts are linked to them, the number of posts that a category is used in is not incremented and hence the list of categories on the front-end has zero posts for each category(possibly owing to a change between WordPress versions of how this has been handled).

I’ve come up with a number of fixes that I’ve mentioned both to Davis and on the WordPress support forums. As they’ve been greeted with an eerie silence that I’ve found typical of such forums, I’ll put them up here instead.

To fix the first three problems I created rss_to_wp, a Blosxom plugin that, along with the standard interpolate_fancy package, you can use to wrap your title and category processing bits in the RSS2.0 flavour templates. Respectively, this plugin tackles the above problems by:

  1. Providing an interpolate_fancy method to unescape entities
  2. Normalizing any whitespace in the body of your Blosxom posts to single spaces
  3. Providing an interpolate_fancy method to convert a Blosxom-style category path into a set of category tags

You’ll need to change the Davis-recommended story.rss20 template to implement the two interpolation methods. I’ve made a sample available.

The final issue was a more knotty problem, as it was a bug in the script (possibly caused by WordPress’ handling of categories changing over time). It’s easily fixed by adding a few lines to the category-handling part of import-blosxom.php as follows:

294    if (!$exists)
295    {
296        $wpdb->query("INSERT INTO $wpdb->post2cat (post_id, category_id)
297                      VALUES ($post_id, $cat_id)");
298    }
299
300    // JPS' addition - increment count if cat ID exists
301    if ($cat_id) {
302        $wpdb->query("UPDATE $wpdb->categories SET category_count = category_count + 1 WHERE cat_ID = $cat_id");
303    }
304    // End JPS' addition

Exit gracefully: exporting and then importing—transporting?—works well if the two tasks are separable. That way the integrity of the exported data can be checked in its transitory state and any bugs worked out, before it’s imported into the new system. It’s certainly worthwhile backing up the target database for the import, as this lets you preserve any quirks of your target database if you have to dump all the imported data and start again. The standard WordPress install includes a plugin for doing this, but the command-line tool mysqldump is arguably more powerful.

This space intentionally left blank

I’ve been asked a couple of times recently, as part of separate projects, to split the results of a SQL query on whitespace within. Simply put, how does one go from:

foo
foo bar
quux
blort wuu spong

to the expanded form:

foo
foo
bar
quux
blort
wuu
spong

efficiently and cleanly, only using SQL? (In case anyone’s worried, I’ve scrubbed the data sets of any personal details they might have previously contained: any resemblance to the real Blort Wuu-Spong is entirely coincidental.)

I finally decided it wasn’t possible, and although without the pure mathematics to back me up I could have kept hunting—partial solutions involving a self-join for each whitespace splitting kept rearing their heads—what finally convinced me was comparing the behaviour of SQL with that of XSL(T). The two are more alike than you might think; and no, I don’t mean SQL and XQuery, although that easy comparison provides a clue for the underlying similarity.

In XSL(T), the XML node in your original document(s) is in a sense king: it’s considered bad form (and is at any rate inefficient) to do data management on some transient data set, created within the template. Loops work best over nodesets rather than with some sort of conditional or from/to structure. This stems from XSL(T)’s underlying functional paradigm, where each nodeset is created

Of course, it’s always possible to twist non-functional behaviour out of the stylesheet (and most real-world solutions have to take a pragmatic approach to such programmatic purity) and interpreter-specific kluges exist to node-ize strings based on some non-XML token, but the language works fastest and cleanest when it’s hanging functions off nodes.

In SQL, the equivalent to the node in an XML document is the row in a query. Rows are passed around, compared with other rows based on the content of some of their cells, tied together and discarded, but very rarely can rows be created out of thin air. The closest one gets is the LEFT/RIGHT OUTER JOIN where the ON-condition is not satisfied: then the left-hand row, rather than being discarded as in the INNER JOIN, is in a sense tied to a row of NULLs. Although that equates to it being tied to no row at all, then when the SQL99 dust settles and post-processing can begin, NULLs can be reinterpreted (Coldfusion does this without being asked, for example).

So to create new rows, one can UNION two rowsets, or entangle the rowsets with some sort of a JOIN, but in simplest, non-iterative SQL, there ought to be no easy way to make one row magically split into two, or maybe three, or maybe four, based on its textual content. It breaks the underlying principle, that rows should flow through the SQL into bit-buckets or the STDOUT tray, but shouldn’t be tossed into the stream with flamboyant verve like chillis into a stir-fry.

Exit gracefully: regardless of the data itself, the data model that a given language’s designers had in mind can have the most effect on what’s plausible to do in the language. Almost all languages evolve through proprietary extensions until they can do associative arrays, every kind of loop structure and, if left alone for long enough, GOTOs, but being able to complete a task with a given language is not the same as being able to complete it, for a sufficiently large data set, before the death of your server, your development team or the universe.

Subscribe to RSS - whitespace