You are here

formats

Webform integrates with Entity Token...

but only from 4.x onwards, it seems.

We have a specific client requirement on Drupal 7, to use webforms to update both a third-party CRM system and local “Drupal storage”: whatever that ends up meaning.

It turns out that the excellent Profile2 module provides us with better storage than D7’s core profile module (which doesn’t even use Field API, unlike the rest of D7!) We were always expecting to have to write the CRM/local updating ourselves, but piping those profile values - Profile2 doesn’t modify the $user object in the same way as Profile - into the webforms was a chunk of work we wanted to avoid.

Luckily, as of the 4.x branch, Webform seems to support token replacement; the Entity Token module (part of the Entity API project) lets entities expose themselves as tokens; and Profile2 uses this to hook itself up to Webform. It all works pretty well, although we’re crossing our fingers for a non-alpha release on Webform’s 4.x branch

Blog category: 

Removing checked-in client data from a git repository - permanently

During a recent migration, we checked in some client data to the sites/default git repository. This was a mistake: not least because the client’s data was some four or five times the size of the rest of the rest of the codebase; but also because there were non-ASCII characters in some of their filenames (“Pannetoné”, anyone?) These were playing havoc when they were on folders shared between a Linux Vagrant box and a Mac OSX host.

Removing files permanently from version control means removing them not just from the current revision, but also rewriting every revisions in the repository so they’re not there either (otherwise Git keeps them hanging around in its object repository, in case those old revisions are ever checked out again.) While there are lots of fragments of solutions online, very few of them encompassed the full complexity of our requirements: many branches, many tags, all needing to be preserved, along with trialling the changes by pushing them all to a temporary remote repository.

Our solution to this problem has ended up a little more long-winded than some, but it’s safer and works harder to preserve the integrity of the version history; when we work on behalf of clients, that’s our biggest priority. And applying this to a 27MB repository clone reduced its size to under 6MB!

Blog category: 

Form and content at Oxford Geek Night #13

The future of textual rendering and data visualizations at OGN13.

Those of you who subscribe to the Oxford Geek Nights Google group hopefully need no reminder that Oxford Geek Night 13 is on Wednesday 15 July. But, more excitingly, the two keynotes are now confirmed.

Bruce Lawson, Open Web Standards evangelist at Opera, is no stranger to Oxford Geek Nights, and covered new developments in accessibility back in OGN10. This time he'll be discussing the forthcoming new standard for hypertext markup, HTML5, and what effects it will have on web-browsing as we know it.

Andrew Walkingshaw, co-founder of Inkling Software, will present the rise (and further rise) of their service for data visualization and storage, Timetric. He'll also be discussing recent work by the Guardian which has incorporated Timetric visualizations, including a recent article on the relative purity of illegal drugs seizures over time.

We still need microslot talks, though, so if you're interested then do volunteer.

Tonight we're gonna parse like it's 1997

Opinions are like closing angle brackets: everyone's got one, but some stick out more than others, depending on your kerning

Via Sean McGrath comes a reasonably lucid and comprehensible redux of the argument about of whether or not the XML standard should (or should have) stipulated draconian error handling. I hope I'm not misrepresenting Avery when I boil a lot of it down to his three broad "real-world" examples to this:

  1. Not well-formed XML, produced by a legacy application that takes ages to fix, is rejected by draconian parser
  2. Not well-formed XML is accepted by a permissive parser
  3. Well-formed XML is accepted by draconian parser

and I hope he's also happy for me to then state that his argument consists broadly of the suggestion that 1 and 2 are together more likely than 3, hence permissive parsers obtain for you the lion's share of the "real world" parsing instances; or, if you prefer, via a slightly more complicated profit-and-loss argment, that making your parser permissive, and sanctioning permissive parsers, contributes a lower overall cost through lumbering us with poor legacy applications, divided up among all the parsing events, than having to fix those legacy applications.

However, an application of Postel's Law to the process of implementation should not be confused with being able to apply it to the original specification. And besides, do those examples really portray the real world, nearly twelve years after the argument first took place? How much XML is out there, and how much of it is bad XML, and how much of it remains bad XML for long enough for it to cause a problem? I don't think it's clear that draconian error handling in the wild has held back RSS syndication, Google Maps, web services, or RDF so much as that, beyond a certain tipping-point (say, 2002?) they've ensured the rapid takeup thereof (with the possible exception of RDF until recently, for its own reasons).

XML is unbelievably popular today, so popular and routine in its use that you almost don't know it's there in most applications. and I think---purely from my own experience---it's plausible to suggest that that's at least partly because consumption of XML is easy; in turn, this is because basic production quality is enforced.

HTML (SGML) is a format (specification) that, because of its messiness (its complex rules), its parsing permissiveness (its potential for misunderstandings), and a whole host of cultural reasons (ditto), was terribly hard to write reliable consumption software for. Even now, there's around half a dozen good browsers, and in part that's surely because of the entry barrier to writing browsers: permissive parsing of real-world mistakes remains a complex task.

I also have dim and partial knowledge of SGML in the old-skool publishing industry, where a licence for fully-featured SGML software could set you back tens of thousands of pounds six or seven years ago, and that price didn't seem to be heading down under market pressure. In comparison, XML parsing is cheap, easy, and ubiquitous. There are free and open-source CMS and blogging packages that can do it; I have access to dozens of command-line tools that can do it; publication, syndication and webservice consumption are things that happen, almost as though nature intended it that way. A lot of that must surely be down to XML's combination of rule simplicity and parser rigour. As Dave Winer says on the subject of Postel's Law and XML:

I yearn for just one market with low barriers to entry, so that products are differentiated by features, performance and price; not compatibility. Compatibility should be expressed in terms of formats, not products.... Anyway, the other half of Postel's Law is just as interesting, but so far no one is commenting. Think about it, if everyone followed the second half, the first half would be a no-op. You could be fully liberal in an afternoon or less.

Mark Pilgrim's history of draconianism versus tolerance seems to consist of a lot of tolerantists pontificating about what they've decided the "draconian" argument is: I can't believe that Tim Bray, even if he really were a lone voice, would have been such a reluctant paper tiger. But like the 1997 tolerantists, I've thus far waded in with my own interpretation of events. And despite dealing with XML on a daily basis, I find that during so many of the tasks I have to accomplish the XML layer is able to fade almost completely into the background.

Of all the problems I encounter at work well-formedness of XML happens very rarely, compared to those concerning the quality and stability of my own algorithms, application control flow, scaling and coping with heavy load, and logging and bailing out. Whether XML's ease of use is in 2009 is a result of the small rule set in XML making well-formedness easy, or the initial decision in favour of draconian parsing, all decided back in 1997, we'll probably never be able to tell. All that's certain is that there'll always be opinions about it, and somewhere in the rambling above is mine.

A WTF at the heart of your Drupal feed aggregation

Do try this at home, kids: but please have the decency to feel a little dirty about it.

Embedding JSON in XML. Hah, that's ridiculous, right? Almost as ridiculous as running a successful blog in .NET/ASP. Well, RSS can combine with JSON to quickly get a Drupal site to consume complex data structures over a webservice.

Drupal's core Aggregator module understands RSS2.0 with no tweaking, putting the text in the <description/> element into the content of quasi-node objects, so you can aggregate all sorts of syndicated content. You could build your own Google Reader if you liked that sort of thing, with articles from the BBC sitting alongside those from the Guardian.

So far so boring. And, on one level, it doesn't get much more interesting than that: Aggregator understands neither Atom XML (rich content) nor RSS that contains Dublin Core fields. There's therefore a limit to how much you can extend the actual XML format.

But what if you get a remote application to produce an RSS feed like this:

<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0">
  <channel>
    <title>Hello, world</title>
    <link>http://example.com</link>
    <description>Recent updates</description>
    <language>en</language>
    <item>
      <title>Sample JSON encoded content</title>
      <link>Foo</link>
      <description>
        {"text": "This is some lovely JSON text"}
      </description>
      <pubDate>Mon, 24 Nov 2008 22:07:03 +0000</pubDate>
      <guid isPermaLink="false">none</guid>
    </item>
  </channel>
</rss>

"What if?" Well, you get a quasi-node of content whose body contains the literal JSON text. Not terribly exciting. But Drupal's powerful themeing system means you can override the way that such content is .

Drop a file into your theme's directory called aggregator-item.tpl.php and containing the following:

<?php
$data = json_decode($content);
print $data->text;
?>

Voilà! You've unpacked the JSON data packet and accessed the content. And the packet, being JSON, can contain however much hierarchical data that you want. You could essentially encode whatever you liked at the webservice side and unpack it at the webconsumer side. You can't pickle objects very easily, unfortunately, but my recommendation is to avoid doing that sort of thing.

(You might need to empty your cache, if you've got any sort of zealous cacheing switched on. And this specific example will only work on PHP 5.2, unfortunately: json_decode() is a recent addition to the already-polluted default PHP namespace. You could use the PHP serialize() format if you've got an older version of PHP, or some other serialized data format that PHP can understand.)

If you were building all this from scratch, then of course you'd use either XML or JSON throughout, and not this weird hybrid solution. If you were building it from scratch. And if you are building it from scratch: let me know when you're done.

Spliticket running again with BeautifulSoup

Or, how I learned to stop parsing and love the soup

Ages ago Matthew Somerville emailed me to say that spliticket had fallen over. It's my hacky interface to his wiki page documenting split tickets, and ultimately it found the vagaries of even wiki-generated HTML a bit too hard to cope with.

At the time I built the HTML parser using core SAX-based HTML parsing, and it was horrible. SAX works in a basic sense, but you have to build your own internal state engine, track which elements have gone past while working out what to do with the current context, and even write rules for what to do when the underlying dumb parser encounters HTML entities: no mean feat when the document is peppered with &#8211; en dashes.

Not only was writing the rules initially a pain in the rear, but adding new rules and bugfixing the existing ones was even worse. But I lived with SAX, because I was deploying on shared hosting: I presumed that this was the best option available if I couldn't install any new shared libraries.

Not true! I've just rebuilt the entire parsing layer with Beautiful Soup, a Python HTML/XML parser library which (a) is available as a single file and (b) works out a decent HTML DOM tree from pretty much anything you throw at it.

Try it yourself, if you have to do any HTML parsing.It's astonishing; beautiful, in fact. I will never write another SAX parser ever again, which I'm sure I've said before.

Improving REST performance is all about negotiation

Ceci n'est pas un objét... nécessairement.

Web architects must understand that resources are just consistent mappings from an identifier to some set of views on server-side state. If one view doesn’t suit your needs, then feel free to create a different resource that provides a better view (for any definition of “better”). These views need not have anything to do with how the information is stored on the server, or even what kind of state it ultimately reflects. It just needs to be understandable (and actionable) by the recipient.

--- Roy Fielding on creating new resources for REST architectures.

Firefox/Sage bookmarks to Google Reader import

When OPML is OPML but it isn't OPML

Want to migrate your RSS bookmarks from Firefox (or its RSS-reading addon, Sage) to Google Reader? I did, just now.

Christopher Hinze has written a great Firefox addon that exports bookmarks to OPML 1.0. Unfortunately, OPML is a bit of an anything-goes specification. So although Hinze's plugin produces valid OPML, it isn't the same sort of valid OPML that Google Reader expects. Google Reader, in fact, gags and chokes on Hinze's OPML, and refuses to import it.

The main problem is that the <outline/> element, the basic hierarchical building block for OPML, will take any attributes. What does that mean in practice? Well, here's what Hinze's export produces:

<outline text="Coding">
  <outline type="link" text="Joel on Software" url="http://www.joelonsoftware.com/rss.xml" />
</outline>

and here's the result of Google Reader exporting its own store of RSS bookmarks:

<outline title="Coding" text="Coding">
  <outline text="drupal.org - Community plumbing"
    title="drupal.org - Community plumbing" type="rss"
    xmlUrl="http://drupal.org/node/feed" htmlUrl="http://drupal.org"/>
</outline>

To a computer, these are fundamentally two different data formats: the URLs are stored in different attributes, and there are attributes on each that either have different values or are not present on the other. Someone did a DTD for OPML: looking at those two apparently analogous fragments above you have to ask yourself why they bothered.

Help is at hand, though. This sort of problem is bread and butter to XSLT, and here's an XSL transform for converting Firefox OPML to Google Reader OPML. If you have xsltproc installed on your system, you would type:

xsltproc http://www.jpstacey.info/applications/google/ff2gr_opml.xsl bookmarks.opml > fixed_bookmarks.opml

Or download the XSLT---it's released under GPL2---and run it locally, changing that URL there to a local file location.

One thing to note: the XSLT will remove an outline wrapped around your bookmarks with title "Sage Feeds" (case-sensitive). So you can export that branch of your bookmarks, and the XSLT will strip the wrapper off and you won't import a load of bookmarks tagged "Sage Feeds". If you don't like this behaviour then either rename your Sage bookmark container, or learn XSLT: it won't kill you.

Cheaper rail journeys with Matthew and Spliticket

The rail industry's biggest fares secret: exposed, and now given an interface.

I’ve built something, just in time for me to crowbar it awkwardly into conversations at OKCon.

Matthew Somerville is well known for his accessible takes on rubbish websites: most useful is Traintimes, a layer on top of one of the equally poor commercial British rail sites; most notorious is Accessible Odeon, a fixing of the Odeon Cinema’s website that put their own substandard development to shame so much that in 2004 they had Matthew’s version taken down with legal threats. Remember: just because you’re successful doesn’t mean you’re not stupid.

Recently, Matthew gave an OGN talk on split tickets. These are train journeys which one cross-country train company sells at exhorbitant rates, whereas the components of the journey can be bought separately from the local companies for considerably less. This is what we infer; the rail websites try to keep it all quiet, in the hope that you might lose interest and stump up. The whole headshakeworthy situation is either a stinging indictment of the stupidity of privatizing a rail network and making passengers jump through ridiculous hoops to glean even the tiniest advantage, or a perfect demonstration of how the choice-enriched consumer can leverage capitalism in action: take your pick.

To come to the point: Matthew has set up a wiki where people have been adding the split tickets they’ve worked out in an ad-hoc fashion. Last Sunday I added K’s astonishing 47% saving for Oxford–Cardiff to the end of it, and there’s been little activity since. The page hasn’t got fantastic Google juice—”split tickets” means too many other things—and is just non-dynamic HTML, editable through the wiki but otherwise searchable only by eye.

At the moment, of course, there’s no reason for Matthew to expend any more effort on such a small and barely popular data set. But last Sunday I had a sudden instinctive jolt, to the effect that: more people would be likely to take advantage of split tickets if there was an easier way of looking up other people’s discoveries (my colleagues were recently trying to find split tickets and the wiki page was harder to use than Traintimes, for that very purpose).

With this in mind, here’s my take on accessificating the wiki page: Spliticket. It accesses a cached version of the page, and then using Python’s HTMLParser to hack away at the HTML it returns either a HTML or XML representation of any journey it finds. The idea is that it’s a bit easier to use on your mobile device, and lets you pin down journeys better. I’ve also included an option for searching strictly for the same journey: I’m not sure if split tickets—even returns—always work when the journey is reversed.

Spliticket also supports friendly URLs (inspired by Traintimes itself), so York to Edinburgh in HTML becomes <http://www.jpstacey.info/applications/spliticket/html/york/edinburgh/> and the aforementioned Oxford to Cardiff route in XML becomes: <http://www.jpstacey.info/applications/spliticket/xml/oxford/cardiff/> .

Mostly this was just something to fill idle hours, and also to convince myself of some ideas I’ve had recently about loose coupling, data reuse and open data. Hopefully at the same time (a) Matthew won’t take it as a dig, (b) others might find some casual use for it see it, and (c) a select few might see it as a testament to the ease of freeing information from solid if unsemantic markup. And maybe in six months’ time we’ll all be booking split tickets as a matter of course; by then, of course, Spliticket will probably be gratefully obsolete, replaced by the fully-fledged application that poor, embittered rail passengers deserve. I’m sure Matthew will build it if so.

Last.fm on Ubuntu Gutsy: smooth as rabbit fur

One of my resolutions this year is to try to cut down on the carbon I spend on music. Notwithstanding my purchase of the In Rainbows discbox, I’ve amassed an awful number of discs of metallized plastic in barely-recyclable containers. (I say “barely” because K. got me a pencil for Christmas made out of old CD boxes, and a pen from dead car parts. But there’s only so many pencils the world can use.)

As I spend the scraps and offcuts of January and February evenings ripping and filing my 2007’s CDs—some of which I won’t listen to very often once they’re fossilized in the collection—I’m aware of a tremendous weight of madeness and invested time and energy on the part of the manufacturers, and of a sort of casual luxuriating in my first-world lifestyle on my own part. You prepare a playlist before me in the presence of my enemies. You anoint my tapeheads with crude oil; my CD tray overflows. So in 2008 I hope to buy as few CDs as possible (none is the target) while also avoiding DRM-crippled music and staying legal.

To this end I’ve been seeking free and semi-free online music—free as in beer, semi-free as in of limited choice—since the new year. So far, outside of bittorrenting (which is obviously of variable legality, depending on what you’re downloading), I’m having some success with Last.fm. Until recently they offered a sort of customized “radio station”, where your input into the of the next track was limited to an intelligent deduction by Last.fm based on what you told it you enjoyed in the past. Now, alongside this potluck service, they’ve just started offering three free streamings of any explicitly chosen track before requiring you to buy the track from a commercial partner.

I’ve yet to try the former service (I think you might have to subscribe to be on the beta wagon: I’ll look into that later), but the latter has so far provided our house with unlimited, free access to a radio station for our very own target market. While such slightly sinister profiling might make it harder for me to discover truly new music, it does at least permit me to expand the boundaries of my comfort zone slowly, and cast a critical eye over my friends’ music preferences, while at the same time giving artists their due and most importantly avoiding physical recordings unless I really want them.

Most commercial support for Linux distributions still consists of monolithic installations, wrapped up with checksums to prevent you tampering with them, and installing themselves on your computer in whatever location and potentially harmful fashion they fancy. Until upgrading to Gutsy this was largely my experience (painfully and often repeated) with such packages as nVidia and wireless drivers, and interesting software that barely gave a second thought to existing Feisty users.

After a spot of Googling I was expecting to have to go through the same palaver with Last.fm’s client, and crossed my fingers that nothing would go horribly wrong. But I needn’t have worried: the Linux client for Last.fm is

  1. free of cost, as in beer
  2. free of restrictions, as in open source
  3. free of hard work, as in a no-sweat installation utilizing the Debian packages and apt package management that’s core to Ubuntu

To install it on Gutsy, you first want to add the GPG key for the repository for security reasons. At a command line, type:

wget -q http://apt.last.fm/last.fm.repo.gpg -O- | sudo apt-key add -

You’ll be asked by sudo for your password. Then, open Synaptic Package Manager (under “System > Administration” in the GNOME menus); then, via “Settings > Repositories”, add the following new third-party repository:

deb http://apt.last.fm/ debian stable

You can then search for the Last.fm widget’s package in the manager (hint: it’s called lastfm) and install it. When you first run it after installation it’ll ask for your Last.fm account, so best have one of those in advance. And that’s it: you’ve now got Last.fm’s widget on your Ubuntu PC.

All of the above is explained briefly on the very URL of the apt repository. Not only that, but they have a free bonus photo of a very cute bunny in case all the apt stuff bores you rigid. Like a TV licence for the Flash version of the BBC iPlayer, all of this is practically worth a subscription alone. As I type, my mouse sits over the very location of the link to do so in a separate tab. I just need to know first: how many more rabbits do I get when I join?

Pages

Subscribe to RSS - formats