You are here

php

Feeds objects within feeds objects

The Drupal Feeds module consists of layers of objects, tunnelling between each other, like a pearl onion on a cocktail stick

We've been doing a lot of work with the Drupal Feeds module recently. The frontend is nice enough, although the sub-navigation was rendered almost illegible by our theme's CSS. The online tutorials need work, and the admin navigation needs to be made a bit more robust to layout changes; but then it will be the de facto way for people to consume feeds on their Drupal sites.

The most recent work we've been doing involved custom integration with RSS feeds arriving effectively as PHP string variables containing all the XML. This is different from either a file on disk or a remote URL: in fact, we had a Python program creating the RSS file from us via a shell (which in turn, horribly, was hitting a remote Oracle database using cx_Oracle). Feeds was definitely up to the job in terms of power. In fact, it was quite a toolkit of useful functionality, which is Drupal code for "incredibly powerful but almost incomprehensible.

It's not that the developer documentation for Feeds isn't decent: it's pretty good. But it's limited in scope: it tells you roughly how to expose your own Feeds-like objects to the admin interface, but not really how all those objects interact. Most importantly, we wanted to know what happened on a cron run: this is the bedrock of how Feeds works on your site, after all.

I poked around a bit and this is what I discovered:

 

Workflow of a Feeds cron run

Here's a summary of the above diagram to give you some idea of what's going on.

 

  1. Drupal's cron creates a FeedsScheduler object and passes it a "job", which is all the configuration for a feed call, including any configuration that was attached originally to the particular node which defines the Feed. The scheduler creates a FeedsImporter and passes it the job; the importer then creates a FeedsSource and embeds itself in it as a parent. In each case, the method ::work() is called to create the child/helper object.
  2. The Source object is what now runs the three phases of feed consumption, via its parent Importer. The Source asks the Importer for the relevant Fetcher, Parser and Processor objects: for example, the HTTP Fetcher, the RSS Parser and the Node Processor objects are strung together to turn an RSS feed at a HTTP URL into a set of nodes, one per entry. Each of these have a relevant, verb-like named method: so ::fetch() for the Fetcher etc. The common currency is a FeedsBatch object, which gets passed around and needs to have methods that make it feel like a batch of feed objects.
  3. After the three phases have run, the Source calls hook_feeds_after_import() to do any tidying, then quits to the Importer, which quits to the Scheduler, which then runs its ::finished() method on the job, and the cron run for this particular feed is done.

 

When you build a new plugin, you need to implement hook_feeds_plugins() in a module and reference a class file: this class will be selectable in the admin interface for one of the three consumption phases, depending on what class it's ultimately based on. You should therefore extend existing classes rather than start from scratch: there are abstract PHP classes in the feeds module directories, which give you skeleton "interfaces" which you can then flesh out with relevant functions. But what's better is to extend e.g. the HTTP fetcher to fetch from a command on disk (which is what we did) or, say, extend the CSV parser to interrogate JSON.

Class hierarchies mean you don't have to spend a lot of time reinventing the wheel or hacking existing modules until they become unupgradeable; instead you can take existing classes and tweak them through inheritance, experimenting as you develop.

Blog category: 

A WTF at the heart of your Drupal feed aggregation

Do try this at home, kids: but please have the decency to feel a little dirty about it.

Embedding JSON in XML. Hah, that's ridiculous, right? Almost as ridiculous as running a successful blog in .NET/ASP. Well, RSS can combine with JSON to quickly get a Drupal site to consume complex data structures over a webservice.

Drupal's core Aggregator module understands RSS2.0 with no tweaking, putting the text in the <description/> element into the content of quasi-node objects, so you can aggregate all sorts of syndicated content. You could build your own Google Reader if you liked that sort of thing, with articles from the BBC sitting alongside those from the Guardian.

So far so boring. And, on one level, it doesn't get much more interesting than that: Aggregator understands neither Atom XML (rich content) nor RSS that contains Dublin Core fields. There's therefore a limit to how much you can extend the actual XML format.

But what if you get a remote application to produce an RSS feed like this:

<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0">
  <channel>
    <title>Hello, world</title>
    <link>http://example.com</link>
    <description>Recent updates</description>
    <language>en</language>
    <item>
      <title>Sample JSON encoded content</title>
      <link>Foo</link>
      <description>
        {"text": "This is some lovely JSON text"}
      </description>
      <pubDate>Mon, 24 Nov 2008 22:07:03 +0000</pubDate>
      <guid isPermaLink="false">none</guid>
    </item>
  </channel>
</rss>

"What if?" Well, you get a quasi-node of content whose body contains the literal JSON text. Not terribly exciting. But Drupal's powerful themeing system means you can override the way that such content is .

Drop a file into your theme's directory called aggregator-item.tpl.php and containing the following:

<?php
$data = json_decode($content);
print $data->text;
?>

Voilà! You've unpacked the JSON data packet and accessed the content. And the packet, being JSON, can contain however much hierarchical data that you want. You could essentially encode whatever you liked at the webservice side and unpack it at the webconsumer side. You can't pickle objects very easily, unfortunately, but my recommendation is to avoid doing that sort of thing.

(You might need to empty your cache, if you've got any sort of zealous cacheing switched on. And this specific example will only work on PHP 5.2, unfortunately: json_decode() is a recent addition to the already-polluted default PHP namespace. You could use the PHP serialize() format if you've got an older version of PHP, or some other serialized data format that PHP can understand.)

If you were building all this from scratch, then of course you'd use either XML or JSON throughout, and not this weird hybrid solution. If you were building it from scratch. And if you are building it from scratch: let me know when you're done.

The Straight Edge minimalist Wordpress theme

The Straight Edge theme is now available for download.

As promised, I'm releasing the Straight Edge theme used on this blog under GPL2.

There's a brief README.txt in the zipped archive linked above, but the theme's main features are:

  • XHTML compatible (in core theme files)
  • Minimal, semantic markup
  • No sidebar
  • Excerpts on archive and category pages
  • Implicit RSS feeds: the only orange icon is in your browser chrome
  • Adaptive top navigation
  • Separate pages for archives, categorisation and blogrolls
  • Next/previous rel links in header
  • Support for special pages e.g. blogroll, tag cloud

The todo list includes:

  • Implement my Blogthis! plugin, while trying to keep minimalist
  • Unobtrusive hiding of elements, using jQuery
  • Improve styling

The theme is in a fairly alpha state. The PHP is fairly straightforward, apart from some neat theme functions, but don't blame me if everything goes bang.

Blog category: 

Save our servers!

Sick and tired of getting a million hits, all to the same page, which more often than not hasn’t been updated in the mean time? Want to reduce your bandwidth and server-time loads without necessarily impairing your visitors’ experience of your site?

If you haven’t ever had cause to use it, there’s a standard called ETag out there which you can probably implement using existing technology that can boost the efficiency of your content delivery a hundredfold. Along with the longer-standing HTTP header Last-Modified it can be used with compliant browser/aggregator software to drastically lower your overheads, while scarcely impacting on your non-compliant users. And although the two standards see most use in the blogosphere, they can be used for anything else from a company’s record in a directory to enormous high-resolution image feeds from astronomy laboratories.

The idea is that you embed in every outgoing page request a couple of HTTP header lines. That’s easier for the total n00b than it sounds: you can do it in one line each with, say, the PHP function header() or the Coldfusion tag <cfheader>, or even in the HTML if you don’t have that level of access, using <meta http-equiv=.../>. The point is that you set the following two flags on the “envelope” that surrounds the page you send to the browser:

Last-Modified: Wed, 15 Nov 2006 18:20:54 +0000
ETag: “78c4d3d8-1834-11dc-8314-0800200c9a66″

Typically for the first field you’ll want e.g. the latest date from your RSS feed, or the date on which a semi-static page was last edited. The second field is really up to you: if you never go back and edit posts without changing the published date then (a) well done you—there’s a space in heaven already reserved—and (b) you can just calculate ETag from the published date. It can actually be the published date if you’re a bit slack, although you might want to hash it instead, in the way that you might with passwords, to avoid any sloppy client software depending on ETags being dates.

What happens next? Well, you won’t see anything at first. But compliant software that’s visited your site before will start sending you two headers that correspond to your original submissions:

If-Modified-Since: Wed, 15 Nov 2006 18:20:54 +0000
If-None-Match: “78c4d3d8-1834-11dc-8314-0800200c9a66″

The idea is that you make sure that you can compare these to the values you’re about to send out, quite early on in your workflow. That way, if they match, you can immediately terminate all further work and just send a “304: Not Modified” HTTP header. The result? Well, with quite complex pages, involving the computation of tag clouds and term hierarchies and archive structures, you can work out quite early on whether it’s worth bothering, or whether the remote client will know exactly what to do if you just tell it, concisely, that nothing has changed since it last looked.

Word to the wise, though: if you are going to send a 304, you should also send your two headers as you always would. If you don’t then you only win out every other time, because the remote client will see the absence of ETag and Last-Modified headers and duly forget the ones it had in its cache.

Blog category: 

Taking Drupal to pieces

Since listening to Garrett Coakley speak at the first Geek Night on the topic of Drupal, I’ve been sniffing round that open-source CMS. He kindly came to speak to us again, and very inspiring it was too. We’re now having a deeper look at it, seeing what it can do, what are its strengths and weaknesses; that sort of thing.

Drupal is certainly very interesting. Its notion of presentation is remarkable in that, at a certain level, all content consists of homogeneous nodes, whether that consists of uploaded files, images, blog posts, taxonomy categories, or embedded YouTube videos. In addition its API for templating, both as a library of functions and as a workflow that one can hook into, probably rivals WordPress in its scope and power. At the same time, though, the implicit homogeneity makes it hard to structure fundamentally heterogeneous sites; and the API hooks are very difficult to unravel: frequently you’ll want to get at a function some ten levels deep, and probably three of those levels can be overridden by your own code, but which, and how?

I want to mention more at a later date, to do Drupal justice, but suffice it to say for now that the complex hierarchy of the hook-in workflow is almost entirely opaque in PHP, a language that provides rather terse error reporting, without the function debug_print_backtrace(). Well worth a look if you’re debugging spaghetti-code, especially when all you can see is the White Screen of Death. Sprinkle it around as the gentle British rain from heaven: lightly, but often.

CMS wanted

I’m looking for a simple website management system (not necessarily a CMS, just something that can handle templates and a consistent look and feel) and an even simpler blogging system. The latter would have to be in PHP, but I’m easy either way otherwise.

Does anyone have any recommendations?

Blog category: 

Now that's magic (quotes)

If your web application ensures that all your incoming CGI variables are free of the most common source of malicious site damage, can you stop worrying?

I wondered this as I got far enough into a PHP publishing system that I had to start thinking about adding new content through the system (rather than just jamming it into the database by hand, which is why the previous incarnation has sadly fallen into disuse). As it’s typically configured, PHP will add backslashes to anything it doesn’t trust: hence the comment “it’s a great site you’ve got here” will, when submitted by a POST request, become “it’s a great site you’ve got here”. Whether or not your server does this automatically can be checked by calling the function magic_quotes_gpc() (I realised only the other day that “gpc” stood for “GET, POST and cookies:” I probably have some catching up to do). In performing this blanket adding of slashes, PHP prevents the unwary coder from leaving his site open to both unintentional database hiccups and intentional malevolent attacks, the SQL injection attack.

All well and good, but my application is heavily object-oriented. Such objects store whatever content you give them, as well as optionally writing it to the database. If I want these objects to persist (even for the course of a single request) then any access to their internal storage must yield sensible data: those slashes have to disappear before the articles appear in an RSS feed, or on the website itself. So when the CGI environment gives slash-added content to an object, the object needs to know to both add it to the database verbatim and to produce it for viewing with the slashes removed. It can either do this by storing it in a slash-removed state or by placing a filter on its outputs.

There’s a further complication, in that content can also be written to an object by the PHP application itself: the publishing of all my unpublished articles, for example, would change the status of their accompanying objects without reference to any CGI variable. If I assumed all of this content had had its slashes escaped, then this article, for example, would lose all of its ’ text, because the object would assume they’d been added by PHP’s internals: in my second paragraph, the “after” string would look like the “before” string, and the “before” string would instead break the database insertion. In addition, what if the server is reconfigured? Can I trust my hosting company to never change the configuration of PHP, even accidentally during an upgrade?

I found myself lost in a maze of adding, removing and then adding slashes, with no clear way of deciding. Suddenly I decided: why not use one of PHP’s major downsides—that it doesn’t support persistence of objects from one request to the next very well, and hence each action is fighting against the overhead of constantly recreating and recompiling code—to ascertain which input/output processes were the most frequent (and most public) and hence needed to be the fastest? I drew a flowchart of a typical object’s behaviour and, by identifying which channels could be safely bottlenecked, arrived at a reasonable solution to the problem.

From my phrasing it’s clear that it was a foregone conclusion: I wanted, more than anything else, for content to flow straight from the database (through the object if applicable) to the user. This content needed to stay in any object in a simple, de-slashed form, so it could flow and flow as long as the object was in existence. That meant that incoming CGI content could not be stored with its added slashes intact. Counter-intuitively, then, my solution was to undo PHP’s default safety mechanisms, unescaping the CGI content and storing it raw, and then without fail adding slashes to anything that CGI or my application wanted to add to the database. This would be my bottleneck: everything else would be as fast as it could be.

Exit gracefully: ensuring all incoming content can be added to the database safely is not necessarily the most efficient or desirable long-term solution. By examining the likely workflows for content, it’s possible to make pragmatic decisions on where content should be pre-processed and where it should be left alone. Consider all your overheads, including that of short-term programming and long-term cumulative processing time: this will vary depending on your environment. Also, if you’re aware of a safety net, over the presence of which you have minimal control, account for the possibility that someone might one day remove it.

Subscribe to RSS - php