You are here

object

Feeds objects within feeds objects

The Drupal Feeds module consists of layers of objects, tunnelling between each other, like a pearl onion on a cocktail stick

We've been doing a lot of work with the Drupal Feeds module recently. The frontend is nice enough, although the sub-navigation was rendered almost illegible by our theme's CSS. The online tutorials need work, and the admin navigation needs to be made a bit more robust to layout changes; but then it will be the de facto way for people to consume feeds on their Drupal sites.

The most recent work we've been doing involved custom integration with RSS feeds arriving effectively as PHP string variables containing all the XML. This is different from either a file on disk or a remote URL: in fact, we had a Python program creating the RSS file from us via a shell (which in turn, horribly, was hitting a remote Oracle database using cx_Oracle). Feeds was definitely up to the job in terms of power. In fact, it was quite a toolkit of useful functionality, which is Drupal code for "incredibly powerful but almost incomprehensible.

It's not that the developer documentation for Feeds isn't decent: it's pretty good. But it's limited in scope: it tells you roughly how to expose your own Feeds-like objects to the admin interface, but not really how all those objects interact. Most importantly, we wanted to know what happened on a cron run: this is the bedrock of how Feeds works on your site, after all.

I poked around a bit and this is what I discovered:

 

Workflow of a Feeds cron run

Here's a summary of the above diagram to give you some idea of what's going on.

 

  1. Drupal's cron creates a FeedsScheduler object and passes it a "job", which is all the configuration for a feed call, including any configuration that was attached originally to the particular node which defines the Feed. The scheduler creates a FeedsImporter and passes it the job; the importer then creates a FeedsSource and embeds itself in it as a parent. In each case, the method ::work() is called to create the child/helper object.
  2. The Source object is what now runs the three phases of feed consumption, via its parent Importer. The Source asks the Importer for the relevant Fetcher, Parser and Processor objects: for example, the HTTP Fetcher, the RSS Parser and the Node Processor objects are strung together to turn an RSS feed at a HTTP URL into a set of nodes, one per entry. Each of these have a relevant, verb-like named method: so ::fetch() for the Fetcher etc. The common currency is a FeedsBatch object, which gets passed around and needs to have methods that make it feel like a batch of feed objects.
  3. After the three phases have run, the Source calls hook_feeds_after_import() to do any tidying, then quits to the Importer, which quits to the Scheduler, which then runs its ::finished() method on the job, and the cron run for this particular feed is done.

 

When you build a new plugin, you need to implement hook_feeds_plugins() in a module and reference a class file: this class will be selectable in the admin interface for one of the three consumption phases, depending on what class it's ultimately based on. You should therefore extend existing classes rather than start from scratch: there are abstract PHP classes in the feeds module directories, which give you skeleton "interfaces" which you can then flesh out with relevant functions. But what's better is to extend e.g. the HTTP fetcher to fetch from a command on disk (which is what we did) or, say, extend the CSV parser to interrogate JSON.

Class hierarchies mean you don't have to spend a lot of time reinventing the wheel or hacking existing modules until they become unupgradeable; instead you can take existing classes and tweak them through inheritance, experimenting as you develop.

Blog category: 

How to write a Javascript file

Now I know the title sounds presumptuous, but there’s a certain methodology I’ve settled into that seems to work really well for encouraging Javascript that’s legible and safe. I thought I’d share it with anyone that doesn’t consider themselves a JS playa, in case it’s of some use to you too.

Most Javascript libraries these days are written in a similar way, so it seems to be de facto recognised best practice, but it’s worth showing the anatomy of the simple case so you can build on it rather than having to work out what’s going on from an enormous, somewhat crufty sprawl.

/*
    @description Javascript template
    @createdBy JPS
    @createdOn 2006-10-03
    @notes Standard template
*/
    
MyObjectWithHandyName = {
    
    // Properties
    p: {
        // HTML IDs
        i: { },
    
        // HTML classes
        c: { },
    
        // Something else we might need to reference
        sthg: {}
    
        // Be wary of accidental trailing commas here, as it’s the end of
        // the array and IE doesn’t like a comma at that position
    },
    
    // Methods here - use hierarchy if large object
    DOM: {},
    eyeCandy: { dropdowns: {}, errors: {} },
    httpReq: {},
    
    // Window onload method - instantiates everything
    go: function() {
        alert(’OK!’);
    }
    
};
    
// Now add onload handler to do anything your object needs to do when page loads
    
// Prototype
// Event.observe(window, ‘onload’, MyObjectWithHandyName.go, false);
    
// Mochikit & others
// addLoadEvent(MyObjectWithHandyName.go);
    
// No library?
window.onload = MyObjectWithHandyName.go;

What’re the advantages of the above? Well, first of all, it just formalizes what you’ve already decided to do: that is, to encapsulate all the functionality to do with a certain something in one file. This just puts it all in one object, which you could call DHTML, or iFoo, or GoogleHack, or MyApp. It prevents collisions with standard Javascript functions, library functions you might include etc. Also, if in future you want to know if a function has been defined on a page, but from a different Javascript file, it’s sufficient for a smallish project to check the top-level object exists.

Secondly, the system is very extensible, and tidy with it. If configuration variables go in the hierarchical p(roperties) block at the top of the file, then you can re-use your code by, say, including a second Javascript file on certain pages, that rewrites this configuration. You can even change methods like this, if you know where they’re going to be, in a safe, extensible way. The hierarchy of the whole object means you can nest methods as far down as you want: then, if you find yourself repeating much of the hierarchy, you can use the with(object) control structure to tidy your code:

foo: { bar: { quux: { a: function() {…}, b: function() {…} } } },
    
blort: {
    with(foo.bar.quux) {
        a();
        b();
        a(b(a()));
    }
    foo.bar.quux.a();
}

Thirdly, it’s easy to maintain. Encapsulation and a certain predictability, and the encouragement to make methods small and put them somewhere that makes sense rather than build e.g. sprawling validation methods that, oh, do a bit of browser sniffing as well, and a bit of alert() calling… this definitely forces me to be careful in what I write, and that puts me in a good position to fix things later on.

I can’t say I’ve done any serious testing, but this way of building functionality seems far more robust, and exits more gracefully (on most decent browsers), than other paradigms for Javascript design. It’s possible this is how the whole Javascript community is now coding and I’m teaching my grannies to suck eggs: certainly it’s not how you’d code given ten minutes on Google, so it probably bears repeating.

A few caveats, of course, because there is no silver bullet:

  • All function definitions go in MyObjectWithHandyName: nothing outside that apart from the onload to do any actual function calls.
  • Any text used more than once (URLs, HTML classes, alert text, repalcement text etc.) goes in p at the top of the script.
  • Any event handlers should be wary of what they get as their first parameter, and what the this object refers to: depending on how they get called, that might change from event to element to their hierarchical container in MyObjectWithHandyName.
  • Trailing commas: at the end of an associative array, don’t leave a trailing comma as Firefox will quietly ignore it but IE will give one of its typically opaque syntax errors.
  • Similarly, don’t omit commas between elements—{ a: {} b: {} } is wrong—as all browsers will die. Easily done, if you’re writing a new method and you forget the go() already exists.

Anyway, give it a go and see what you think.

Now that's magic (quotes)

If your web application ensures that all your incoming CGI variables are free of the most common source of malicious site damage, can you stop worrying?

I wondered this as I got far enough into a PHP publishing system that I had to start thinking about adding new content through the system (rather than just jamming it into the database by hand, which is why the previous incarnation has sadly fallen into disuse). As it’s typically configured, PHP will add backslashes to anything it doesn’t trust: hence the comment “it’s a great site you’ve got here” will, when submitted by a POST request, become “it’s a great site you’ve got here”. Whether or not your server does this automatically can be checked by calling the function magic_quotes_gpc() (I realised only the other day that “gpc” stood for “GET, POST and cookies:” I probably have some catching up to do). In performing this blanket adding of slashes, PHP prevents the unwary coder from leaving his site open to both unintentional database hiccups and intentional malevolent attacks, the SQL injection attack.

All well and good, but my application is heavily object-oriented. Such objects store whatever content you give them, as well as optionally writing it to the database. If I want these objects to persist (even for the course of a single request) then any access to their internal storage must yield sensible data: those slashes have to disappear before the articles appear in an RSS feed, or on the website itself. So when the CGI environment gives slash-added content to an object, the object needs to know to both add it to the database verbatim and to produce it for viewing with the slashes removed. It can either do this by storing it in a slash-removed state or by placing a filter on its outputs.

There’s a further complication, in that content can also be written to an object by the PHP application itself: the publishing of all my unpublished articles, for example, would change the status of their accompanying objects without reference to any CGI variable. If I assumed all of this content had had its slashes escaped, then this article, for example, would lose all of its ’ text, because the object would assume they’d been added by PHP’s internals: in my second paragraph, the “after” string would look like the “before” string, and the “before” string would instead break the database insertion. In addition, what if the server is reconfigured? Can I trust my hosting company to never change the configuration of PHP, even accidentally during an upgrade?

I found myself lost in a maze of adding, removing and then adding slashes, with no clear way of deciding. Suddenly I decided: why not use one of PHP’s major downsides—that it doesn’t support persistence of objects from one request to the next very well, and hence each action is fighting against the overhead of constantly recreating and recompiling code—to ascertain which input/output processes were the most frequent (and most public) and hence needed to be the fastest? I drew a flowchart of a typical object’s behaviour and, by identifying which channels could be safely bottlenecked, arrived at a reasonable solution to the problem.

From my phrasing it’s clear that it was a foregone conclusion: I wanted, more than anything else, for content to flow straight from the database (through the object if applicable) to the user. This content needed to stay in any object in a simple, de-slashed form, so it could flow and flow as long as the object was in existence. That meant that incoming CGI content could not be stored with its added slashes intact. Counter-intuitively, then, my solution was to undo PHP’s default safety mechanisms, unescaping the CGI content and storing it raw, and then without fail adding slashes to anything that CGI or my application wanted to add to the database. This would be my bottleneck: everything else would be as fast as it could be.

Exit gracefully: ensuring all incoming content can be added to the database safely is not necessarily the most efficient or desirable long-term solution. By examining the likely workflows for content, it’s possible to make pragmatic decisions on where content should be pre-processed and where it should be left alone. Consider all your overheads, including that of short-term programming and long-term cumulative processing time: this will vary depending on your environment. Also, if you’re aware of a safety net, over the presence of which you have minimal control, account for the possibility that someone might one day remove it.

Subscribe to RSS - object