framework

User loading and saving in Drupal 6.x

Nearly a year ago I broke down user_load() and user_save() in Drupal 5. I had to put together workflows for a number of jobs, specifically integrating the creation, instantiation and updating of users with an external system. Fast forward nearly twelve months, and we have to do it all over again for D6, for different work. So here's a PDF of user_load() and user_save() in Drupal 5 and 6.

flowcharts of user_load and user_save

The flowcharts have been especially useful in coding in the most Drupalish way possible. Drupal core (and well-behaved modules) is built with a hook-based architecture. That means that before and/or after important events, Drupal calls all the functions which follow a particular naming convention: any module which, in effect, implements a hook. That means your code can tag along with Drupal’s powerful core, making hook essential to developing modules efficiently.

What's changed between Drupals 5 and 6? Not much, to be honest:

  • Loading now tries to grab an object, rather than checking if an ID has been returned by the database first
  • Updating clears the sessions for newly-blocked users, effectively kicking them out; it also sends notification emails through _user_mail_notify
  • Creating doesn't grab a new ID for the user, pre-creation, owing to D6's better database abstractions

For your convenience and mine, all six workflows are now in the same PDF. That makes it easier to compare 5 and 6 side by side, but it also makes clear some of the very minor errors I made in the original Drupal 5 diagrams. Well, best let them stand, for transparency's sake. And besides: if a man's errors are his portals of discovery, you'd be lucky to fit the chipmunk of serendipity through these.

The problem of many types of content

All snowflakes are unique, but some are more unique than others

David Yelvington mentioned back in December 2008 that his Drupal site had over 30 content types:

Why on earth so many content types? It's easy to see good reasons for news items to be structurally more complex than a simple blog post. But we also have some types of content you probably wouldn't think about at first. Wire stories are an interesting case.... Promos are another.... Other content includes special types for various video players, feeds from other technology and content partners, items aggregated from websites in the community, podcasts, cartoons, Soundslides shows, and Tweets.... Drupal also creates content types for internal purposes, such as representing user groups, webforms, etc.

To be honest, thirty seems about right for a medium-sized site, so my first reaction wasn't one of surprise at all. David's explained briefly what a content type is in his own post, so I'll not repeat that here. Safe to say that what makes a job advert on a site distinguishable---and capable of holding different information---from a publication or organizational event is that each is derived from a different content type. Some CMSes call these "templates", as does Torchbox's proprietary alternative.

Really, really big sites (especially multi-domain sites, sites with organic groups, sites with microsites, sites with just a lot of transitory projects going on) need a lot of types of content. Even a medium-sized site can easily have two dozen content types, just to satisfy things like news, PR, departmentalization, events, products, staff, directories.... The most recent Drupal builds I've been involved in have all had over twenty content types, and Torchbox's biggest non-Drupal site has literally hundreds of "templates", serving many tens of subdomains and several years of legacy content for many different projects.

What Drupal's content-type maintenance (especially in Drupal 5) highlights is that implicit in Drupal's interface design has been a small number of content types: or at any rate sufficiently few to be listed without much scrolling of your browser window. I'd hate to see an unfortunate user maintaining a Drupal site containing two hundred content types. Yet again (yawn), we're working on a module---currently only for Drupal 5---to categorize content types. It currently only improves the tabular content-type admin page by grouping similar content types into separate tables, but the hooks should then be available in order to have more segmented views of content types.

You can never predict where usability (as opposed to programmatic) scalability will bite you. At least Drupal's modular nature---rich in overrides and workflow interrupts---more often than not allows you to solve some of these problems in hindsight.

Inline edit links, but not editing inline

Squaring the circle of simple CMS usability with complex content representations, with a neat low-footprint Drupal module

It's heartwarming, really encouraging to see that Drupal 7 is undergoing a usability review. Drupal's a massively functional CMS, but all the functionality in the world won't help you when the average (for which read: can't write HTML, let alone PHP) CMS user can't discover it. There's a common misconception that usability is the finishing touches you add to an application if you've got time, the icing on the cake; but if your application lays any claim to maturity then its usability is the cake, and all that functionality you were so proud of is, without usability, just eggs and flour.

One of the main usability improvements suggested by the usability team---and largely shouted down by the technical team---is the ability to edit inline on the page: that is, to log in as an admin, then have any bit of the page "active", so that if you click on it then it becomes an edit box with the text inside. Flickr does this especially well, letting you edit title and description on photo pages and lists of photos by just clicking on the apparently uneditable text. But Flickr has the advantage that there's very little form on top of its content: it's a delivery mechanism for the raw metadata about photos, and the photo itself.

The other end of the spectrum---which complex CMS sites have every right to sit on---is a rich and complicated mapping between the storage of a node's content in the database and the eventual display of it in the browser. take a page from a recent Torchbox project at random, how would you expect areas of this page from the Joseph Rowntree Foundation's website to behave when you clicked on them? If you have to hardcode print statements in your PHP templates, what do you print? How do you get editing inline to work? What happens when content is brought in from other, related nodes, and mixed in with the other content before display.

I can appreciate both sides to this story of user experience versus technical practicality, although it's not sufficient to expect the usability team to discard the idea merely because there's no correspondence between page content and database content: that's only an argument for why Drupal doesn't currently have edit-on-page. The usability project is moving forwards rapidly, and while there's clearly a tension between usability for the CMS user and feasible technical limitations---usability for the developer, if you like---it will need to be resolved soon for this marvellous work, and a great opportunity, not to end up wasted. And resolving that conflict will involve some sort of compromise, for both sides.

One possible compromise would be to offer edit links, when Drupal can spot a sort-of 1-to-1 correspondence between a fragment of page content and the node that supports it. Page templates and views---specifically hook_preprocess_node and hook_views_pre_render---know full well that what they're processing is a node. And they generally know what field the node title will be in. So let Drupal rewrite the title, to add an "edit inline" link. If anyone clicks on this link, then pop the node-edit form up in a lightbox for editing.

Here's some screenshots of what I've been working on, in an attempt to get people interested (click for bigger.) Firstly, here's what the anonymous site visitor sees:

Homepage for an anonymous site visitor

Next, here's what happens when a user has just logged in. Note that the brilliant Admin menu module kicks in, giving the user a black navigation bar across the top. But, more pertinently, each node title also now has an "[edit inline]" link beside it:

Homepage for a logged-in admin user

If the logged-in user clicks on one of these new links, then our edit-inline module kicks in and, using the equally brilliant Drupal Thickbox wrapper module, provides a stripped-down version of the node-edit page in a Thickbox overlay, both speeding up node editing using AJAX calls and also letting the user cancel the node-edit procedure and return to the webpage they were on quickly:

Effect of clicking on an 'edit inline' link

To reiterate, you don't have to be on a node's page to edit it. All that matters is that the title of the node you want to edit passes through onee of the supported pre-render hooks. Currently, clicking on save/preview/cancel takes you elsewhere rather than being trapped within the Thickbox, and we're also wrestling with getting CSS and Javascript into the Thickbox overlay to support the nattier bits of node editing, but it's functional and, I hope, gives you some idea of how it would all work given a few more hours of bashing away at keyboards.

Anyway, there it is. A possible compromise. I've mentioned it in a comment on the d7ux blog but I fear I might have been eaten by a spamtrap. If anyone's interested in the project then email me, jp.stacey, either at gmail.com or torchbox.com, and say hello.

How to not cache a particular Drupal page

Sometimes you don't want every random visitor seeing the same thing on your cached site.

Edit: if you know reasonably in advance e.g. at the start of a given page request that the page is never going to be cached, there is a better way (thanks to Stack Overflow!)

In a recent Drupal project we turned on standard caching to help site performance. With this in place, however, we found that certain visitor-sensitive details might be revealed. For example, if a submission via the webform module contains an email address, and this is included somehow in the acknowledgement page (through custom code), then this custom page can be guessed for other users. The reason for this is a complication of webform and fairly understandable custom-modular code. Webform's confirmation page is a GET URL of the form:

http://example.com/form_page/done?sid=1418

where sid is the submission ID: the unique identifier of the data. This is fine with out-of-the-box webform, which just gives all of your site visitors the same confirmation message. But if the message is personalized based on the submission, e.g. to say "Thanks, Bob! We've sent an acknowledgement of your gift of a cheese pastie to your email address, which is..." then we're in trouble. The cache is set to slurp up the response to any HTTP GET request, which while it doesn't affect forms does include confirmation pages, however personalized. As a matter of course, we firstly made the confirmation page customization contingent on a $_SESSION variable, which was set when the form was processed, and unset when the confirmation page was viewed: without the variable, the page would not be customized. In the uncached situation, this solved the problem of discovery; however, the cache would just serve up the cached version regardless, as it just grabs the raw HTML from the database, never touching the code which checks $_SESSION One option would have been to change the webform's confirmation URL to have a random second parameter e.g:

http://example.com/form_page/done?sid=1418&random=0d1803d0-fdf7-11dd-87a...

This would still put an entry in the cache, but it becomes hard to stumble across by trial and error! However, while OK in practice, this felt like a bit of a hack: fundamentally, it's safest not to have any cache entry. With this in mind, we took a different route. In Drupal 6, hook_exit() is called across all modules, immediately before the end of a page request. This happens both in the absence of caching and the presence of standard caching, in drupal_page_footer() and _drupal_bootstrap() respectively. The order of execution is, with some omissions:

  • drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL)
    • Cached version?
      • Set headers and send cache contents to browser
      • module_invoke_all('exit'): execute the exit hooks
      • quit!
    • Otherwise, carry on with page execution
  • ... page execution...
  • drupal_page_footer()
    • Cacheing on?
      • page_set_cache(): cache this page
    • Regardless, module_invoke_all('exit'): execute the exit hooks

So either the page is served from the cache, or it's created by code execution and then cached. That means that we can build a module which uses hook_exit to clear the page that's just been cached every time the code is executed. The page content is therefore never cached i.e. always dynamic, and the $_SESSION trick ensures security of submissions. We use cache_clear_all to clear the page out. If we inspect page_set_cache, we can see the page-specific key that it stores the cached content using cache_set. We can therefore clear out just the entry under this key, for this page, leaving the rest of the cache intact. Here's some sample code that accomplishes this.

/** * Implementation of hook_nodeapi */ function mymodule_nodeapi(&$node, $op) {   if ($node->type == "webform" && ($op == "load") && (arg(2) == "done")) {     // If sid doesn't match session, then quit unless the current user has admin access     if ( ($_GET['sid'] != $_SESSION['submission']['sid']) && !user_access("access webform results")) {       $node->webform["confirmation"] = "Thank you for your submission.";       return;     }
    // Flag the submission for deletion in hook_exit     $_SESSION['submission']['#delete'] = TRUE;
    /* . . . thankyou customization code . . .*/   } }
/** * Implementation of hook_exit */ function mymodule_exit() {   global $base_root;
  // Have we just processed a submission?   if ($_SESSION['submission']['#delete']) {     // Firstly remove the submission entirely from the session, just in case     unset($_SESSION['submission']);     // Then clear the cache for this page     cache_clear_all($base_root . request_uri(), 'cache_page');   } }

Note on aggressive caching

Later on, owing to performance issues, you might want to increase caching from standard to aggressive on your site. At that point, the site will warn you that your new module is "incompatible with aggressive mode caching and might not function properly." I think this happens purely because of the presence of hook_exit(): it's a warning because, in drupal_bootstrap(), aggressive cacheing exits before the _exit hooks are actioned. But this only happens if Drupal finds a cached version. It isn't omitted on the first visit to a given URL. So when webform creates a particular visitor's thankyou page, aggressive cacheing can't find a cache item for that sid: so it still executes the page, puts it in the cache, and executes hook_exit() to clear the page out of the cache! Result: hook_exit() is still called on pages which need it, even with aggressive caching switched on. Note: unlike Knuth, not only have I not tested this yet, but I've barely proven it works, in an aggressively cached Drupal install. Use it in that environment at your peril.

Google now lets you pay for Google App Engine

But can you buy the bits you want to deploy a Django application?

Google are introducing paid-for extensions to Google App Engine quotas, which is great as it lets you build more complex applications if you're willing to pay the rates. At the same time they're reducing the baseline free quotas. That's a shame, but only to be expected in a recession: at least there's still wiggle room there for the casual developer to play with the service.

As far as I can see there's still no SLA in the Terms and Conditions (11.3 seems to be fairly clear on this), which is a thorny issue, not just for our clients. There's also no option I can see to increase your file number quota (as opposed to increasing the quota for total disk space) which means that deploying a Django app of any complexity presumably remains a pain in the arse.

The Joseph Rowntree Foundation's new site

I was senior developer on the recent project at Torchbox, to rebuild the JRF's site in Drupal with ApacheSolr.

At Torchbox we've recently completely rebuilt the website for the Joseph Rowntree Foundation using Drupal. I was technical lead for the project and I'm incredibly proud of the work that the whole team, both at Torchbox and at the JRF, has put into launching such a great site.

Search indexing and retrieval is powered by Apache Solr on a Tomcat server; the Apache Tika extensions are used to scrape PDFs for their plain text. Torchbox chose a Lucene-based product like Solr because we've got a history of experience with the technology: our own proprietary CMS has used Lucene for some years, and we've built site spiders using it.. So it was quite a surprise to see that the drupal.org redevelopment has chosen Apache Solr as its search technology. Also, Lucid Imagination recently decloaked, with big players aligning themselves with Lucene. Although some of it is still a bit nebulous, this news really bodes well for support and development of this open-source search technology going forwards.

We integrated Drupal with the search stack in the first instance using the apachesolr module, but found that the requirements of the site---category-based and year-based searching, archived publications, visitor-configurable number-per-page and three different search pages---very quickly outstripped the capabilities of either the core module or optional submodules within it: Tika submission and serialization/deserialization between structured searches and Lucene syntax are almost entirely handled by custom code. The apachesolr module was very friendly to our extensions, though, constituting a useful library for us to work with, although its alpha status (it's since moved to beta) meant that we had some difficulty upgrading it to try to get extra functionality, and it doesn't seem to support such Lucene edge cases as on-the-fly weighting.

Performance has been great so far (although I shouldn't jinx that sort of thing by even talking about it) and offloading search to Solr was definitely the right decision: Lucene/Tika solves a whole domain of problems related to free-text and vocabulary searches---fast indexing, stemming, stop words, conditional logic, weighting, text extraction---that Drupal's core search was always going to struggle with. The stack has been a (sometimes infuriating) pleasure to work with, and I look forward to using it in future projects.

Any Drupal site can be an Acquia Drupal site

From tomorrow onwards.

A New Year's present from Dries Buytaert:

It didn't take long for us to realize that people wanted more than Acquia Drupal: they wanted support for everything Drupal 6.x -- all modules, themes and custom code. The good news is that Acquia is a nimble company so the last weeks we worked on changing our support model to address customer demands. Starting tomorrow, we will support everything Drupal 6.x -- not just Acquia Drupal but all modules and themes available on drupal.org as well as custom code. I'm still a firm believer in Drupal distributions so Acquia Drupal still has a role as a packaged on-ramp for people getting started with Drupal. However, anyone will be able to connect any Drupal 6.x site to the Acquia Network.

Blog category:

The multiple magics of Drupal search

Form API is magical; core Drupal search is a twist on that magic; hooking onto that twist puts your code on yet another level of weird.

Drupal's Form API handles so much work for you that you'd be a fool not to use it as much as possible. This code snippet:

function myform_some_form($form_state) {
  $form['text'] = array(
    '#type' => 'textfield',
    '#title' => t('Your submission'),
    '#default_value' => t('Enter some text'),
    '#description' => t('Please use this field to submit some text'),
    '#required' => TRUE,
  );
  return $form;
}

creats a form with:

  • A single textfield element
  • Accessible XHTML with form labels
  • Potentially localized labels, translated into any number of languages
  • A bit of similarly localized help text below the element
  • Validation of the form submission, with the field content marked as required

That's a separate item of form functionality for each array key. And as long as you use Form API, Drupal handles validation and input sanitization for you, thus massively reducing the risk of attack by SQL injection or XSS.

Bookmarkable search URLs with POSTed search terms

But there's a catch. To encourage best practice in terms of form submission and friendly URLs, Form API defaults to HTTP POST. If site searching used Form API (which it does) then what impact would that have? Successful searches could never be bookmarked, because the URL on its own doesn't capture the POST submission.

The search module tackles this by adding an extra twist to Form API. At the end of submission processing are the following two actions:

  • Call either the function named in $form['#submit'] or $ID_submit, where $ID is typically the name of the original form creation function ("myform_some_form" above)
  • Finally, either return to the original action page of the form, or redirect to any URL specified in $form['redirect']

The search module therefore uses a function called search_form_submit to grab the POSTed search terms, and redirect the user to search/$SEARCH_TYPE/$SEARCH_TERMS. $SEARCH_TYPE is "node" for Drupal's out-of-box textual node searching, but if you install some other search module e.g. Apache Solr then it'll be e.g. "apachesolr_search" instead. Result: bookmarkable search URLs.

Writing your own module to handle searches

This has important ramifications if you're trying to piggyback off core search somehow: if, say, you're still using core search or a third-party module for the actual result-finding, but then you want a page other than core search to display the results.

If you want the main site search form to redirect to your own pages, for example, then you have to (a) add your own $form['#submit'] function to the stack and then (b) use that to change the core search's $form['redirect']:

// Implementation of hook_form_alter(), adding an extra submit callback to
// search forms identified by their existing callback
function mysearch_form_alter(&$form, $form_state, $form_id) {
  $submits = array(
    'box' => 'search_box_form_submit',
    'form' => 'search_form_submit',
  );
  if (is_array($form['#submit'])) {
    $which = array_intersect($submits, $form['#submit']);
    $which && ($form['#submit'][] = 'mysearch_form_mysubmit');
  }
}
// Submit callback, which changes the redirect using a regular-expression replace
function mysearch_form_mysubmit(&$form, $form_state) {
  $form_state['redirect'] = preg_replace('/^search\/[^\/]+/', 'search/my_special_search',
    $form_state['redirect']);
}

Now you've got all your site search forms redirecting to a bookmarkable page at search/my_special_search/$SEARCH_TERMS. All you have to do now is write a menu callback for that page: from here on in you're on your own for now.

A WTF at the heart of your Drupal feed aggregation

Do try this at home, kids: but please have the decency to feel a little dirty about it.

Embedding JSON in XML. Hah, that's ridiculous, right? Almost as ridiculous as running a successful blog in .NET/ASP. Well, RSS can combine with JSON to quickly get a Drupal site to consume complex data structures over a webservice.

Drupal's core Aggregator module understands RSS2.0 with no tweaking, putting the text in the <description/> element into the content of quasi-node objects, so you can aggregate all sorts of syndicated content. You could build your own Google Reader if you liked that sort of thing, with articles from the BBC sitting alongside those from the Guardian.

So far so boring. And, on one level, it doesn't get much more interesting than that: Aggregator understands neither Atom XML (rich content) nor RSS that contains Dublin Core fields. There's therefore a limit to how much you can extend the actual XML format.

But what if you get a remote application to produce an RSS feed like this:

<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0">
  <channel>
    <title>Hello, world</title>
    <link>http://example.com</link>
    <description>Recent updates</description>
    <language>en</language>
    <item>
      <title>Sample JSON encoded content</title>
      <link>Foo</link>
      <description>
        {"text": "This is some lovely JSON text"}
      </description>
      <pubDate>Mon, 24 Nov 2008 22:07:03 +0000</pubDate>
      <guid isPermaLink="false">none</guid>
    </item>
  </channel>
</rss>

"What if?" Well, you get a quasi-node of content whose body contains the literal JSON text. Not terribly exciting. But Drupal's powerful themeing system means you can override the way that such content is .

Drop a file into your theme's directory called aggregator-item.tpl.php and containing the following:

<?php
$data = json_decode($content);
print $data->text;
?>

Voilà! You've unpacked the JSON data packet and accessed the content. And the packet, being JSON, can contain however much hierarchical data that you want. You could essentially encode whatever you liked at the webservice side and unpack it at the webconsumer side. You can't pickle objects very easily, unfortunately, but my recommendation is to avoid doing that sort of thing.

(You might need to empty your cache, if you've got any sort of zealous cacheing switched on. And this specific example will only work on PHP 5.2, unfortunately: json_decode() is a recent addition to the already-polluted default PHP namespace. You could use the PHP serialize() format if you've got an older version of PHP, or some other serialized data format that PHP can understand.)

If you were building all this from scratch, then of course you'd use either XML or JSON throughout, and not this weird hybrid solution. If you were building it from scratch. And if you are building it from scratch: let me know when you're done.

"One of a kind" is not necessarily a compliment

Assembler programmers rarely wire their own hardware; C programmers rarely write assembly language; Python programmers rarely compile C binaries. The creation of a website should not be delayed by having to work out how to write website construction systems.

If you want someone to build you a website, don't let them build you a bespoke CMS to help you manage it. I've fallen prey to this very temptation, although in my defence it was as much an investigation into technology and the structure of my own content as a solution to the problem of managing said content. But after having spent two years struggling in vain---completely and utterly, to the point of not merely writing zero new code but also writing much less content---I've moved to Drupal. Since doing so I've been playing with blocked-out chunks of reusable code and content, and little related-content lists, and automatically generated RSS feeds, and a free-text search that just works. Beat that, my own programming skills!

This sort of slip-up, though---asking someone to build you a thing, be it website, web application, or offline electronic service, and letting them instead build you the tools to build it with---is tremendously common, and I can see why. You want to think that your particular site is the unique snowflake, that your way of working only fits a certain publishing workflow, or moderation structure. Of course you do: to suggest that your notions could be implemented using off-the-shelf tools is to suggest that they too are off the shelf.

But however innovative the idea, eighty percent of its implementation is, if not the same, then of the same form. How many newly-invented white goods couldn't be fitted with a standard wrench, and how many plumbers got customers to pay for them to invent a new, never-before-seen custom tool to fit them? It's not a flattering comparison, but so it is with new ideas for using the web. Every publishing system needs ways of filtering, searching, and unpublishing content; every website with dynamic content needs to address cross-site scripting, error catching and logging, and formatting raw content; and every successful site needs to consider cacheing, spooling, and load balancing or throttling.

And while web development companies can and do solve all of these problems (you didn't necessarily ask, but Torchbox does it quite nicely, thank you very much), they have to justify to you as a client why they're better placed to solve them than people whose core business is to solve them. The same goes for application frameworks: if the company's meant to be building you an application, but they're not recommending a framework like Ruby on Rails or Django, then you should ask why. And they should be able to tell you.

Maybe the system you're end up with won't quite let you moderate content the way you're used to, or the way you're expecting to. There may well be good reasons for that---and the framework or CMS has just stopped you making a very silly mistake---but if it was just a case of it being a reasonable tradeoff of functionality and developer time in the original framework, then that's the point where your web company of choice can start to extend and build on the framework. By that time, that eighty percent is already there, and secure, and the bespoke twenty percent that every site needs can sit on a solid, reliable base.

I can't tell you how happy I've been since moving from my homegrown system---much as I loved its slickness and elegance---to Drupal---much as you can always argue about the quality of other people's code, essentially just whining that it's not quite how you would have done it. Content, after all, is born free; yet it is everywhere in custom-built chains.

Pages

Subscribe to RSS - framework