Feeds objects within feeds objects

We've been doing a lot of work with the Drupal Feeds module recently. The frontend is nice enough, although the sub-navigation was rendered almost illegible by our theme's CSS. The online tutorials need work, and the admin navigation needs to be made a bit more robust to layout changes; but then it will be the de facto way for people to consume feeds on their Drupal sites.

The most recent work we've been doing involved custom integration with RSS feeds arriving effectively as PHP string variables containing all the XML. This is different from either a file on disk or a remote URL: in fact, we had a Python program creating the RSS file from us via a shell (which in turn, horribly, was hitting a remote Oracle database using cx_Oracle). Feeds was definitely up to the job in terms of power. In fact, it was quite a toolkit of useful functionality, which is Drupal code for "incredibly powerful but almost incomprehensible.

It's not that the developer documentation for Feeds isn't decent: it's pretty good. But it's limited in scope: it tells you roughly how to expose your own Feeds-like objects to the admin interface, but not really how all those objects interact. Most importantly, we wanted to know what happened on a cron run: this is the bedrock of how Feeds works on your site, after all.

I poked around a bit and this is what I discovered:

 

Workflow of a Feeds cron run

Here's a summary of the above diagram to give you some idea of what's going on.

 

  1. Drupal's cron creates a FeedsScheduler object and passes it a "job", which is all the configuration for a feed call, including any configuration that was attached originally to the particular node which defines the Feed. The scheduler creates a FeedsImporter and passes it the job; the importer then creates a FeedsSource and embeds itself in it as a parent. In each case, the method ::work() is called to create the child/helper object.
  2. The Source object is what now runs the three phases of feed consumption, via its parent Importer. The Source asks the Importer for the relevant Fetcher, Parser and Processor objects: for example, the HTTP Fetcher, the RSS Parser and the Node Processor objects are strung together to turn an RSS feed at a HTTP URL into a set of nodes, one per entry. Each of these have a relevant, verb-like named method: so ::fetch() for the Fetcher etc. The common currency is a FeedsBatch object, which gets passed around and needs to have methods that make it feel like a batch of feed objects.
  3. After the three phases have run, the Source calls hook_feeds_after_import() to do any tidying, then quits to the Importer, which quits to the Scheduler, which then runs its ::finished() method on the job, and the cron run for this particular feed is done.

 

When you build a new plugin, you need to implement hook_feeds_plugins() in a module and reference a class file: this class will be selectable in the admin interface for one of the three consumption phases, depending on what class it's ultimately based on. You should therefore extend existing classes rather than start from scratch: there are abstract PHP classes in the feeds module directories, which give you skeleton "interfaces" which you can then flesh out with relevant functions. But what's better is to extend e.g. the HTTP fetcher to fetch from a command on disk (which is what we did) or, say, extend the CSV parser to interrogate JSON.

Class hierarchies mean you don't have to spend a lot of time reinventing the wheel or hacking existing modules until they become unupgradeable; instead you can take existing classes and tweak them through inheritance, experimenting as you develop.