<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Graceful Exits &#187; import/export</title>
	<atom:link href="http://www.jpstacey.info/blog/category/importexport/www.jpstacey.info/blog/category/importexport/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.jpstacey.info/blog</link>
	<description>Garbage collection, in a very real sense</description>
	<pubDate>Tue, 30 Sep 2008 20:10:32 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.2</generator>
	<language>en</language>
			<item>
		<title>Spliticket running again with BeautifulSoup</title>
		<link>http://www.jpstacey.info/blog/2008/09/07/spliticket-running-again-with-beautifulsoup/</link>
		<comments>http://www.jpstacey.info/blog/2008/09/07/spliticket-running-again-with-beautifulsoup/#comments</comments>
		<pubDate>Sun, 07 Sep 2008 18:10:27 +0000</pubDate>
		<dc:creator>jps</dc:creator>
		
		<category><![CDATA[formats]]></category>

		<category><![CDATA[hacking]]></category>

		<category><![CDATA[import/export]]></category>

		<category><![CDATA[projects]]></category>

		<category><![CDATA[beautifulsoup]]></category>

		<category><![CDATA[html]]></category>

		<category><![CDATA[interface]]></category>

		<category><![CDATA[library]]></category>

		<category><![CDATA[mashup]]></category>

		<category><![CDATA[matthew]]></category>

		<category><![CDATA[parser]]></category>

		<category><![CDATA[rule]]></category>

		<category><![CDATA[sax]]></category>

		<category><![CDATA[somerville]]></category>

		<category><![CDATA[spliticket]]></category>

		<category><![CDATA[state]]></category>

		<guid isPermaLink="false">http://www.jpstacey.info/blog/?p=204</guid>
		<description><![CDATA[Or, how I learned to stop parsing and love the soup]]></description>
			<content:encoded><![CDATA[<p>Ages ago Matthew Somerville emailed me to say that <a href="http://www.jpstacey.info/applications/spliticket/" >spliticket</a> had fallen over. It&#8217;s <a href="http://www.jpstacey.info/blog/2008/03/13/cheaper-rail-journeys-with-matthew-and-spliticket/" >my hacky interface</a> to <a href="http://www.dracos.co.uk/wiki/Trains/SplitTickets" >his wiki page documenting split tickets</a>, and ultimately it found the vagaries of even wiki-generated HTML a bit too hard to cope with.</p>
<p>At the time I built the HTML parser using core SAX-based HTML parsing, and it was horrible. SAX works in a basic sense, but you have to build your own internal state engine, track which elements have gone past while working out what to do with the current context, and even write rules for what to do when the underlying dumb parser encounters HTML entities: no mean feat when the document is peppered with &amp;#8211; en dashes. </p>
<p>Not only was writing the rules initially a pain in the rear, but adding new rules and bugfixing the existing ones was even worse. But I lived with SAX,  because I was deploying on shared hosting: I presumed that this was the best option available if I couldn&#8217;t install any new shared libraries.</p>
<p><em>Not true!</em> I&#8217;ve just rebuilt the entire parsing layer with <a href="http://crummy.com/software/BeautifulSoup" >Beautiful Soup</a>, a Python HTML/XML parser library which (a) is available as a single file and (b) works out a decent HTML DOM tree from pretty much anything you throw at it.</p>
<p>Try it yourself, if you have to do any HTML parsing.It&#8217;s astonishing; beautiful, in fact. I will never write another SAX parser ever again, which I&#8217;m sure I&#8217;ve said before.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jpstacey.info/blog/2008/09/07/spliticket-running-again-with-beautifulsoup/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Firefox/Sage bookmarks to Google Reader import</title>
		<link>http://www.jpstacey.info/blog/2008/07/17/firefoxsage-bookmarks-to-google-reader-import/</link>
		<comments>http://www.jpstacey.info/blog/2008/07/17/firefoxsage-bookmarks-to-google-reader-import/#comments</comments>
		<pubDate>Thu, 17 Jul 2008 19:04:54 +0000</pubDate>
		<dc:creator>jps</dc:creator>
		
		<category><![CDATA[formats]]></category>

		<category><![CDATA[import/export]]></category>

		<category><![CDATA[projects]]></category>

		<category><![CDATA[quickies]]></category>

		<category><![CDATA[bookmarks]]></category>

		<category><![CDATA[DTD]]></category>

		<category><![CDATA[export]]></category>

		<category><![CDATA[firefox]]></category>

		<category><![CDATA[google]]></category>

		<category><![CDATA[import]]></category>

		<category><![CDATA[opml]]></category>

		<category><![CDATA[reader]]></category>

		<category><![CDATA[sage]]></category>

		<category><![CDATA[standard]]></category>

		<category><![CDATA[transform]]></category>

		<category><![CDATA[xml]]></category>

		<category><![CDATA[xslt]]></category>

		<guid isPermaLink="false">http://www.jpstacey.info/blog/?p=186</guid>
		<description><![CDATA[When OPML is OPML but it isn't OPML]]></description>
			<content:encoded><![CDATA[<p>Want to migrate your RSS bookmarks from Firefox (or its RSS-reading addon, Sage) to Google Reader? I did, just now.</p>
<p>Christopher Hinze has written a great <a href="https://addons.mozilla.org/en-US/firefox/addon/2625" >Firefox addon that exports bookmarks to OPML 1.0</a>. Unfortunately, OPML is a bit of an <a href="http://www.opml.org/spec1" >anything-goes specification</a>. So although Hinze&#8217;s plugin produces valid OPML, it isn&#8217;t the same sort of valid OPML that Google Reader expects. Google Reader, in fact, gags and chokes on Hinze&#8217;s OPML, and refuses to import it.</p>
<p>The main problem is that the &lt;outline/&gt; element, the basic hierarchical building block for OPML, <a href="http://www.opml.org/spec1#limits" >will take <em>any attributes</em></a>. What does that mean in practice? Well, here&#8217;s what Hinze&#8217;s export produces:</p>
<blockquote class="code"><p>
&lt;outline text=&#8221;Coding&#8221;><br />
&nbsp;&nbsp;&lt;outline type=&#8221;link&#8221; text=&#8221;Joel on Software&#8221; url=&#8221;http://www.joelonsoftware.com/rss.xml&#8221;   /><br />
&lt;/outline>
</p></blockquote>
<p>and here&#8217;s the result of Google Reader exporting its own store of RSS bookmarks:</p>
<blockquote class="code"><p>
&lt;outline title=&#8221;Coding&#8221; text=&#8221;Coding&#8221;><br />
&nbsp;&nbsp;&lt;outline text=&#8221;drupal.org - Community plumbing&#8221;<br />
&nbsp;&nbsp;&nbsp;&nbsp;title=&#8221;drupal.org - Community plumbing&#8221; type=&#8221;rss&#8221;<br />
&nbsp;&nbsp;&nbsp;&nbsp;xmlUrl=&#8221;http://drupal.org/node/feed&#8221; htmlUrl=&#8221;http://drupal.org&#8221;/><br />
&lt;/outline>
</p></blockquote>
<p>To a computer, these are fundamentally two different data formats: the URLs are stored in different attributes, and there are attributes on each that either have different values or are not present on the other. Someone did a <a href="http://static.userland.com/gems/radiodiscuss/opmlDtd.txt" >DTD for OPML</a>: looking at those two apparently analogous fragments above you have to ask yourself why they bothered.</p>
<p>Help is at hand, though. This sort of problem is bread and butter to XSLT, and <a href="/applications/google/ff2gr_opml.xsl" >here&#8217;s an XSL transform for converting Firefox OPML to Google Reader OPML</a>. If you have <code>xsltproc</code> installed on your system, you would type:</p>
<blockquote class="code"><p>
xsltproc http://www.jpstacey.info/applications/google/ff2gr_opml.xsl bookmarks.opml > fixed_bookmarks.opml
</p></blockquote>
<p>Or download the XSLT&#8212;it&#8217;s released under GPL2&#8212;and run it locally, changing that URL there to a local file location.</p>
<p>One thing to note: the XSLT will remove an outline wrapped around your bookmarks with title &#8220;Sage Feeds&#8221; (case-sensitive). So you can export that branch of your bookmarks, and the XSLT will strip the wrapper off and you <em>won&#8217;t</em> import a load of bookmarks tagged &#8220;Sage Feeds&#8221;. If you don&#8217;t like this behaviour then either rename your Sage bookmark container, or learn XSLT: it won&#8217;t kill you.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jpstacey.info/blog/2008/07/17/firefoxsage-bookmarks-to-google-reader-import/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Library of Congress, Flickr&#8217;d to the max</title>
		<link>http://www.jpstacey.info/blog/2008/01/17/library-of-congress-flickrd-to-the-max/</link>
		<comments>http://www.jpstacey.info/blog/2008/01/17/library-of-congress-flickrd-to-the-max/#comments</comments>
		<pubDate>Thu, 17 Jan 2008 10:43:22 +0000</pubDate>
		<dc:creator>jps</dc:creator>
		
		<category><![CDATA[culture]]></category>

		<category><![CDATA[formats]]></category>

		<category><![CDATA[import/export]]></category>

		<category><![CDATA[information]]></category>

		<category><![CDATA[news]]></category>

		<category><![CDATA[archive]]></category>

		<category><![CDATA[collections]]></category>

		<category><![CDATA[congress]]></category>

		<category><![CDATA[copyright]]></category>

		<category><![CDATA[flickr]]></category>

		<category><![CDATA[free]]></category>

		<category><![CDATA[library]]></category>

		<category><![CDATA[metadata]]></category>

		<category><![CDATA[museum]]></category>

		<category><![CDATA[online]]></category>

		<category><![CDATA[photographs]]></category>

		<category><![CDATA[tagging]]></category>

		<guid isPermaLink="false">http://www.jpstacey.info/blog/2008/01/17/library-of-congress-flickrd-to-the-max/</guid>
		<description><![CDATA[Flickr is working with the Library of Congress on new project The Commons. Currently there are around three thousand photographs up there from two collections, and according to the Commons homepage they&#8217;re all copyright-free. More information in the relevant post on the Flickr blog.
This is wonderful news, especially because the collection is being released through [...]]]></description>
			<content:encoded><![CDATA[<p>Flickr is working with the Library of Congress on new project <a href="http://flickr.com/commons">The Commons</a>. Currently there are around <strong>three thousand</strong> photographs up there from two collections, and according to the Commons homepage they&#8217;re all copyright-free. More information in <a href="http://blog.flickr.com/en/2008/01/16/many-hands-make-light-work/">the relevant post on the Flickr blog</a>.</p>
<p>This is wonderful news, especially because the collection is being released through a slightly adapted version of Flickr&#8217;s existing website. This means, apart from it being an established interface that millions of people already know vaguely how to use, that you can do all the Flickry things with the photos&#8212;dedicated Flickr-heads will hopefully give a more qualified response in due course&#8212;and that third-party tools should already be set up to work with the content. The meta information storage won&#8217;t particularly excite any Dublin-Core enthusiasts&#8212;a block of unstructured HTML in the standard Flickr notes field, plus of course Flickr tagging&#8212;but the whole project is still a fascinating experiment, and interesting for even the casual observer of American history. How exciting does it get? More exciting than the <a href="http://www.flickr.com/photos/library_of_congress/2179047088/in/set-72157603671370361/">World of Mirth Shows</a>?</p>
<p>Thinking offline for a moment, this hopefully presages more leaps forward in <abbr title="Museums, Libraries and Archives">MLA</abbr> culture. One of the first would be to remove the &#8220;NO PHOTOGRAPHS&#8221; signs from all museums. At the very least such signs could be more honest, and instead read &#8220;NO PHOTOGRAPHS; unless our security guards don&#8217;t catch you at it, in which case we&#8217;ll be blissful in our ignorance. Anyway, in five years time it&#8217;ll all be online so we don&#8217;t know why we&#8217;re bothering, to be honest&#8230;.&#8221; On reflection, I suppose they would need bigger signs.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jpstacey.info/blog/2008/01/17/library-of-congress-flickrd-to-the-max/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Blosxom to WordPress: tying up loose ends</title>
		<link>http://www.jpstacey.info/blog/2006/11/06/blosxom-to-wordpress-tying-up-loose-ends/</link>
		<comments>http://www.jpstacey.info/blog/2006/11/06/blosxom-to-wordpress-tying-up-loose-ends/#comments</comments>
		<pubDate>Mon, 06 Nov 2006 10:56:36 +0000</pubDate>
		<dc:creator>jps</dc:creator>
		
		<category><![CDATA[formats]]></category>

		<category><![CDATA[hacking]]></category>

		<category><![CDATA[import/export]]></category>

		<category><![CDATA[blosxom]]></category>

		<category><![CDATA[bug]]></category>

		<category><![CDATA[categories]]></category>

		<category><![CDATA[export]]></category>

		<category><![CDATA[import]]></category>

		<category><![CDATA[interpolate_fancy]]></category>

		<category><![CDATA[plugin]]></category>

		<category><![CDATA[whitespace]]></category>

		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://www.jpstacey.info/blog/2006/11/06/blosxom-to-wordpress-tying-up-loose-ends/</guid>
		<description><![CDATA[A busy few weeks, but they&#8217;ve included an import from a Blosxom blog to a WordPress blog which is worth describing. There are a couple of established methods for importing the data, and I opted for the one that seemed the most modular. This was Eric Davis&#8217; Import-Blosxom method, consisting of a PHP script on [...]]]></description>
			<content:encoded><![CDATA[<p>A busy few weeks, but they&#8217;ve included an import from a Blosxom blog to a WordPress blog which is worth describing. There are a <a href="http://codex.wordpress.org/Importing_Content#Blosxom" title="Blosxom">couple of established methods</a> for importing the data, and I opted for the one that seemed the most modular. This was <a href="http://www.insanum.com/">Eric Davis&#8217;</a> <a href="http://www.insanum.com/downloads/Wordpress/import-blosxom.php.gz">Import-Blosxom</a> method, consisting of a PHP script on the WordPress side and a set of Blosxom flavour files which produce a feed compatible with RSS 2.0. This separation of Blosxom and WordPress behaviours meant that I could thoroughly test the former before proceeding with the latter.</p>
<p>It worked very well with practically no configuration or edits, but there were a few issues with the out-of-the-box behaviour of the import script:</p>
<ol>
<li>Unicode character entities were being escaped in titles, leading to the exposure of the alphanumeric code e.g. &#8220;Z&amp;amp;#252;rich&#8221; instead of &#8220;Z&#252;rich&#8221;.</li>
<li>Whitespace in post bodies is converted to hard newlines by WordPress, and so must be excised to avoid tags being broken e.g. &#8216;&lt;a [newline] href=&#8221;&#8230;&#8221;&gt;&#8217; becoming &#8216;&lt;a &lt;br/&gt; href=&#8221;&#8230;&#8221;&gt;&#8217;.
  </li>
<li>Multiple hierarchical categories are not supported (a known problem).</li>
<li>Although categories are created and posts are linked to them, the number of posts that a category is used in is not incremented and hence the list of categories on the front-end has zero posts for each category(possibly owing to a change between WordPress versions of how this has been handled).</li>
</ol>
<p>I&#8217;ve come up with a number of fixes that I&#8217;ve mentioned both to Davis and <a href="http://wordpress.org/support/topic/32515" title="Migrating from Blosxom">on the WordPress support forums</a>. As they&#8217;ve been greeted with an eerie silence that I&#8217;ve found typical of such forums, I&#8217;ll put them up here instead.</p>
<p>To fix the first three problems I created <a title="a Blosxom plugin to fix a number of RSS2.0-to-WordPress issues" href="/blog/files/code/blosxom/rss_to_wp">rss_to_wp</a>, a Blosxom plugin that, along with the standard <code>interpolate_fancy</code> package, you can use to wrap your title and category processing bits in the RSS2.0 flavour templates. Respectively, this plugin tackles the above problems by:</p>
<ol>
<li>Providing an <code>interpolate_fancy</code> method to unescape entities</li>
<li>Normalizing any whitespace in the body of your Blosxom posts to single spaces</li>
<li>Providing an <code>interpolate_fancy</code> method to convert a Blosxom-style category path into a set of category tags</li>
</ol>
<p>You&#8217;ll need to change the Davis-recommended <code>story.rss20</code> template to implement the two interpolation methods. I&#8217;ve made a <a title="Blosxom story flavour for RSS 2.0" href="/blog/files/code/blosxom/story.rss20">sample</a> available.</p>
<p>The final issue was a more knotty problem, as it was a bug in the script (possibly caused by WordPress&#8217; handling of categories changing over time). It&#8217;s easily fixed by adding a few lines to the category-handling part of <code>import-blosxom.php</code> as follows:</p>
<blockquote class="code"><pre>
294    if (!$exists)
295    {
296        $wpdb-&gt;query("INSERT INTO $wpdb-&gt;post2cat (post_id, category_id)
297                      VALUES ($post_id, $cat_id)");
298    }
299
300    // JPS' addition - increment count if cat ID exists
301    if ($cat_id) {
302        $wpdb-&gt;query("UPDATE $wpdb-&gt;categories SET category_count = category_count + 1 WHERE cat_ID = $cat_id");
303    }
304    // End JPS' addition
</pre>
</blockquote>
<p><b>Exit gracefully:</b> exporting and then importing&#x2014;<i>trans</i>porting?&#x2014;works well if the two tasks are separable. That way the integrity of the exported data can be checked in its transitory state and any bugs worked out, before it&#8217;s imported into the new system. It&#8217;s certainly worthwhile backing up the target database for the import, as this lets you preserve any quirks of your target database if you have to dump all the imported data and start again. The standard WordPress install includes a plugin for doing this, but the command-line tool <a href="http://del.icio.us/jp.stacey/mysql+backup+restore+command-line" title="J-P Stacey's del.icio.us bookmarks for MySQL backup/restore">mysqldump</a> is arguably more powerful.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jpstacey.info/blog/2006/11/06/blosxom-to-wordpress-tying-up-loose-ends/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Now that&#8217;s magic (quotes)</title>
		<link>http://www.jpstacey.info/blog/2006/08/15/now-thats-magic-quotes/</link>
		<comments>http://www.jpstacey.info/blog/2006/08/15/now-thats-magic-quotes/#comments</comments>
		<pubDate>Tue, 15 Aug 2006 20:37:01 +0000</pubDate>
		<dc:creator>jps</dc:creator>
		
		<category><![CDATA[efficiency]]></category>

		<category><![CDATA[import/export]]></category>

		<category><![CDATA[layers]]></category>

		<category><![CDATA[addslashes]]></category>

		<category><![CDATA[CGI]]></category>

		<category><![CDATA[cookies]]></category>

		<category><![CDATA[database]]></category>

		<category><![CDATA[object]]></category>

		<category><![CDATA[persistence]]></category>

		<category><![CDATA[php]]></category>

		<category><![CDATA[POST]]></category>

		<category><![CDATA[security]]></category>

		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://www.jpstacey.info/blog/2006/08/15/now-thats-magic-quotes/</guid>
		<description><![CDATA[If your web application ensures that all your incoming CGI variables are free of the most common source of malicious site damage, can you stop worrying?
I wondered this as I got far enough into a PHP publishing system that I had to start thinking about adding new content through the system (rather than just jamming [...]]]></description>
			<content:encoded><![CDATA[<p>If your web application ensures that all your incoming CGI variables are free of the most common source of malicious site damage, can you stop worrying?</p>
<p>I wondered this as I got far enough into a PHP publishing system that I had to start thinking about adding new content through the system (rather than just jamming it into the database by hand, which is why the previous incarnation has sadly fallen into disuse). As it&#8217;s typically configured, PHP will add backslashes to anything it doesn&#8217;t trust: hence the comment &#8220;it&#8217;s a great site you&#8217;ve got here&#8221; will, when submitted by a POST request, become &#8220;it&#8217;s a great site you&#8217;ve got here&#8221;. Whether or not your server does this automatically can be checked by calling the function <code>magic_quotes_gpc()</code> (I realised only the other day that &#8220;gpc&#8221; stood for &#8220;GET, POST and cookies:&#8221; I probably have some catching up to do). In performing this blanket adding of slashes, PHP prevents the unwary coder from leaving his site open to both unintentional database hiccups and intentional malevolent attacks, the <a href="http://www.unixwiz.net/techtips/sql-injection.html" title="SQL Injection Attacks by Example">SQL injection attack</a>. </p>
<p>All well and good, but my application is heavily object-oriented. Such objects store whatever content you give them, as well as optionally writing it to the database. If I want these objects to persist (even for the course of a single request) then any access to their internal storage must yield sensible data: <b>those slashes have to disappear</b> before the articles appear in an RSS feed, or on the website itself. So when the CGI environment gives slash-added content to an object, the object needs to know to both add it to the database verbatim and to produce it for viewing with the slashes removed. It can either do this by storing it in a slash-removed state or by placing a filter on its outputs.</p>
<p>There&#8217;s a further complication, in that content can also be written to an object by the PHP application itself: the publishing of all my unpublished articles, for example, would change the status of their accompanying objects without reference to any CGI variable. If I assumed all of this content had had its slashes escaped, then this article, for example, would lose all of its &#8217; text, because the object would assume they&#8217;d been added by PHP&#8217;s internals: in my second paragraph, the &#8220;after&#8221; string would look like the &#8220;before&#8221; string, and the &#8220;before&#8221; string would instead break the database insertion. In addition, what if the server is reconfigured? Can I trust my hosting company to never change the configuration of PHP, even accidentally during an upgrade?</p>
<p>I found myself lost in a maze of adding, removing and then adding slashes, with no clear way of deciding. Suddenly I decided: why not use one of PHP&#8217;s major downsides&#x2014;that it doesn&#8217;t support persistence of objects from one request to the next very well, and hence each action is fighting against the overhead of constantly recreating and recompiling code&#x2014;to ascertain which input/output processes were the most frequent (and most public) and hence needed to be the fastest? I drew <a href="/blog/files/image/magic_sql_slashes.gif" title="GIF image of a typical object's workflow">a flowchart of a typical object&#8217;s behaviour</a> and, by identifying which channels could be safely bottlenecked, arrived at a reasonable solution to the problem.</p>
<p>From my phrasing it&#8217;s clear that it was a foregone conclusion: I wanted, more than anything else, for content to flow straight from the database (through the object if applicable) to the user. This content needed to stay in any object in a simple, de-slashed form, so it could flow and flow as long as the object was in existence. That meant that incoming CGI content could not be stored with its added slashes intact. Counter-intuitively, then, my solution was to <b>undo PHP&#8217;s default safety mechanisms</b>, unescaping the CGI content and storing it raw, and then without fail adding slashes to anything that CGI or my application wanted to add to the database. This would be my bottleneck: everything else would be as fast as it could be.</p>
<p><b>Exit gracefully:</b> ensuring all incoming content can be added to the database safely is not necessarily the most efficient or desirable long-term solution. By examining the likely workflows for content, it&#8217;s possible to make pragmatic decisions on where content should be pre-processed and where it should be left alone. Consider all your overheads, including that of short-term programming and long-term cumulative processing time: this will vary depending on your environment. Also, if you&#8217;re aware of a safety net, over the presence of which you have minimal control, account for the possibility that someone might one day remove it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jpstacey.info/blog/2006/08/15/now-thats-magic-quotes/feed/</wfw:commentRss>
		</item>
		<item>
		<title>You can skip this one (and that five, and that seventeen&#8230;)</title>
		<link>http://www.jpstacey.info/blog/2006/08/11/you-can-skip-this-one-and-that-five-and-that-seventeen/</link>
		<comments>http://www.jpstacey.info/blog/2006/08/11/you-can-skip-this-one-and-that-five-and-that-seventeen/#comments</comments>
		<pubDate>Fri, 11 Aug 2006 14:02:02 +0000</pubDate>
		<dc:creator>jps</dc:creator>
		
		<category><![CDATA[diagnostics]]></category>

		<category><![CDATA[formats]]></category>

		<category><![CDATA[hacking]]></category>

		<category><![CDATA[import/export]]></category>

		<category><![CDATA[blue]]></category>

		<category><![CDATA[cell]]></category>

		<category><![CDATA[column]]></category>

		<category><![CDATA[excel]]></category>

		<category><![CDATA[hide]]></category>

		<category><![CDATA[non-contiguous]]></category>

		<category><![CDATA[number]]></category>

		<category><![CDATA[row]]></category>

		<category><![CDATA[sequence]]></category>

		<guid isPermaLink="false">http://www.jpstacey.info/blog/2006/08/11/you-can-skip-this-one-and-that-five-and-that-seventeen/</guid>
		<description><![CDATA[Programmers: know your Excel!
People who don&#8217;t know how to hide rows and columns still do hide them. They just find&#8230; innovative and unexpected ways of doing so. Changing row height and column width, then protecting the cells against being resized (how they can do that and not know how to hide the cells is beyond [...]]]></description>
			<content:encoded><![CDATA[<p>Programmers: know your Excel!</p>
<p>People who don&#8217;t know how to hide rows and columns still do hide them. They just find&#8230; innovative and unexpected ways of doing so. Changing row height and column width, then protecting the cells against being resized (how they can do that and not know how to hide the cells is beyond me). Or using a filter, and then hiding the dropdowns that show that they&#8217;re filtering: it&#8217;s handy to note, as I finally discovered, that doing this results in the <a href="http://exceltips.vitalnews.com/Pages/T0875_Changing_Coordinate_Colors.html">non-contiguous row or cell numbers turning blue</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jpstacey.info/blog/2006/08/11/you-can-skip-this-one-and-that-five-and-that-seventeen/feed/</wfw:commentRss>
		</item>
		<item>
		<title>This space intentionally left blank</title>
		<link>http://www.jpstacey.info/blog/2006/07/11/this-space-intentionally-left-blank/</link>
		<comments>http://www.jpstacey.info/blog/2006/07/11/this-space-intentionally-left-blank/#comments</comments>
		<pubDate>Tue, 11 Jul 2006 11:22:23 +0000</pubDate>
		<dc:creator>jps</dc:creator>
		
		<category><![CDATA[efficiency]]></category>

		<category><![CDATA[import/export]]></category>

		<category><![CDATA[paradigms]]></category>

		<category><![CDATA[declarative]]></category>

		<category><![CDATA[efficient]]></category>

		<category><![CDATA[functional]]></category>

		<category><![CDATA[sql]]></category>

		<category><![CDATA[whitespace]]></category>

		<category><![CDATA[xsl]]></category>

		<guid isPermaLink="false">http://www.jpstacey.info/blog/2006/07/11/this-space-intentionally-left-blank/</guid>
		<description><![CDATA[I&#8217;ve been asked a couple of times recently, as part of separate projects, to split the results of a SQL query on whitespace within. Simply put, how does one go from:
foo
foo bar
quux
blort wuu spong

to the expanded form:
foo
foo
bar
quux
blort
wuu
spong

efficiently and cleanly, only using SQL? (In case anyone&#8217;s worried, I&#8217;ve scrubbed the data sets of any personal details [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been asked a couple of times recently, as part of separate projects, to split the results of a SQL query on whitespace within. Simply put, how does one go from:</p>
<blockquote><p>foo<br />
foo bar<br />
quux<br />
blort wuu spong</p>
</blockquote>
<p>to the expanded form:</p>
<blockquote><p>foo<br />
foo<br />
bar<br />
quux<br />
blort<br />
wuu<br />
spong</p>
</blockquote>
<p>efficiently and cleanly, <em>only using SQL? </em>(In case anyone&#8217;s worried, I&#8217;ve scrubbed the data sets of any personal details they might have previously contained: any resemblance to the real Blort Wuu-Spong is entirely coincidental.)</p>
<p>I finally decided it wasn&#8217;t possible, and although without the pure mathematics to back me up I could have kept huntingâ€”partial solutions involving a self-join for each whitespace splitting kept rearing their headsâ€”what finally convinced me was comparing the behaviour of SQL with that of XSL(T). The two are more alike than you might think; and no, I don&#8217;t mean SQL and XQuery, although that easy comparison provides a clue for the underlying similarity.</p>
<p>In XSL(T), the XML node in your original document(s) is in a sense king: it&#8217;s considered bad form (and is at any rate inefficient) to do data management on some transient data set, created within the template. Loops work best over nodesets rather than with some sort of conditional or from/to structure. This stems from XSL(T)&#8217;s underlying functional paradigm, where each nodeset is created</p>
<p>Of course, it&#8217;s always possible to twist non-functional behaviour out of the stylesheet (and most real-world solutions have to take a pragmatic approach to such programmatic purity) and interpreter-specific kluges exist to node-ize strings based on some non-XML token, but the language works fastest and cleanest when it&#8217;s hanging functions off nodes.</p>
<p>In SQL, the equivalent to the node in an XML document is the row in a query. Rows are passed around, compared with other rows based on the content of some of their cells, tied together and discarded, but very rarely can rows be created out of thin air. The closest one gets is the LEFT/RIGHT OUTER JOIN where the ON-condition is not satisfied: then the left-hand row, rather than being discarded as in the INNER JOIN, is in a sense tied to a row of NULLs. Although that equates to it being tied to no row at all, then when the SQL99 dust settles and post-processing can begin, NULLs can be reinterpreted (Coldfusion does this without being asked, for example).</p>
<p>So to create new rows, one can UNION two rowsets, or entangle the rowsets with some sort of a JOIN, but in simplest, non-iterative SQL, there <em>ought to be </em>no easy way to make one row magically split into two, or maybe three, or maybe four, based on its textual content. It breaks the underlying principle, that rows should flow through the SQL into bit-buckets or the STDOUT tray, but shouldn&#8217;t be tossed into the stream with flamboyant verve like chillis into a stir-fry.</p>
<p><strong>Exit gracefully:</strong> regardless of the data itself, the data <em>model </em>that a given language&#8217;s designers had in mind can have the most effect on what&#8217;s plausible to do in the language. Almost all languages evolve through proprietary extensions until they can do associative arrays, every kind of loop structure and, if left alone for long enough, GOTOs, but being able to complete a task with a given language is not the same as being able to complete it, for a sufficiently large data set, before the death of your server, your development team or the universe.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jpstacey.info/blog/2006/07/11/this-space-intentionally-left-blank/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Very flat, Excel</title>
		<link>http://www.jpstacey.info/blog/2006/03/22/very-flat-excel/</link>
		<comments>http://www.jpstacey.info/blog/2006/03/22/very-flat-excel/#comments</comments>
		<pubDate>Wed, 22 Mar 2006 14:16:40 +0000</pubDate>
		<dc:creator>jps</dc:creator>
		
		<category><![CDATA[import/export]]></category>

		<category><![CDATA[customer]]></category>

		<category><![CDATA[database]]></category>

		<category><![CDATA[export]]></category>

		<category><![CDATA[relational]]></category>

		<category><![CDATA[schema]]></category>

		<category><![CDATA[uml]]></category>

		<guid isPermaLink="false">http://www.jpstacey.info/blog/2006/03/22/very-flat-excel/</guid>
		<description><![CDATA[Database design, as an intellectual exercise, can be tremendously satisfying. As one builds up relations and constraints, making your database just rigid enough to support the more intelligent model soon to be layered over it, it&#8217;s possible to feel a sense of future-problems-solved: this check will prevent the model from trying to call a service [...]]]></description>
			<content:encoded><![CDATA[<p>Database design, as an intellectual exercise, can be tremendously satisfying. As one builds up relations and constraints, making your database <em>just </em>rigid enough to support the more intelligent model soon to be layered over it, it&#8217;s possible to feel a sense of <em>future-problems-solved</em>: this check will prevent the model from trying to call a service a finished product; this trigger will roll back any unpassworded changes to a given provider. At the end, a complex relational system, which is normalized just to the point where normalization yields diminishing returns, is a work of craft if not of art.</p>
<p>So it&#8217;s always unsettling when the client then asks for such heavily relational data to be exported to a spreadsheet.</p>
<p>There are many excellent reasons why applications sit on top of a database instead of a spreadsheet, but most of them are hard to explain to clients, many of whom tend to use spreadsheets only as highfalutin Word tables. I calculated that I could express the data relevant to a single provider in one spreadsheet (multiple worksheets); it would therefore take around 1000 files, which the client understandably declined to accept.</p>
<p>Clearly a compromise was necessary, and in this instance it came about by asking the client what they actually needed: in this case, there were two separate requirements pulling in opposite directions that, singly, were easy to solve: their technical advisor wanted a schema, or at any rate a <a title="Object Management Group, maintainers of UML" href="http://www.uml.org/"><acronym title="Universal Modeling Language">UML</acronym></a> diagram, so they could build a companion application along similar lines to ours; their project co-ordinator wanted a checklist of the provider data we had, to avoid duplication during import and see how to proceed with their own data audits.</p>
<p><strong>Exit gracefully:</strong> After having finally teased out the two separate requirements, and dealt with one by a simple dump of the database schema, I identified the tables which could, just about, be flattened. I planned ahead by checking exactly how many were involved in each many-to-one relationshipâ€”no more than two, in the areas I agreed to flattenâ€”and eventually was able to promise the equivalent of an address book for providers. For the task in hand, this was more than adequate.</p>
<p>When building a database for a client, and one which might see some reuse or multiple simultaneous use, always make sure you can justify each constraint or relation not just in terms you can understand but in ones that stress the benefit to the client&#8217;s well-tended data: there will be a benefit to the data, so this isn&#8217;t as hard as it sounds. But, as you build, look for where you might need to make quick simplifications in future. Identify the quasi-flat areas you might be able to isolate, because sooner or later the client is bound to ask for data in that shape.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jpstacey.info/blog/2006/03/22/very-flat-excel/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
