Bulk updates across many nodes using Entity Field Query and Entity Metadata Wrapper

I was recently trying to solve a Drupal problem: how do you update a large number of nodes, swapping the long-text version of a text field into the summary version of it, and removing the long-text version? Those two stalwarts of bulk updates in the admin interface, Rules and Views Bulk Operations, were only getting me so far: either both versions of the field would end up empty, or copying into the summary field would strip the markup because of how Rules "treats" summary as a field without its own filter format.

In the end, I bit the bullet and wrote a command-line script with Drush. You can save any PHP file as a Drush script, and as long as you run it with:

drush php-script [SCRIPTNAME]

then it will inherit the whole of the current Drupal site's code and configuration before being run. You might want to use a non-executable file suffix for the [SCRIPTNAME]. That way, if you do accidentally leave it lying around on your production filesystem, a curious or malicious site visitor can't "accidentally" run it. I usually call such scripts e.g. body_swap.drush.

To write my script, I took advantage of the fact that, from Drupal 7 onwards, nodes are a particular breed of the abstraction known as entities, and there are two very handy coding concepts in Drupal for dealing with entities, either in bulk or singly:

In bulk: Entity Field Query (Drupal core)
Because entities are an abstraction, in D7 you can't reliably e.g. run SQL statements against database tables, to work out what entites match what criteria. Entity Field Query understands that abstraction, though, and provides db_select-like syntax for building up a query. To avoid performance problems, the query only returns the absolute minimum of data: typically, in the case of nodes, node ID and content type. This EFQ quick overview is really helpful and worth reading.
Singly: Entity Metadata Wrapper (in the third-party Entity API module)
In Drupal, core entities like nodes aren't very object-oriented: they're really just object-like arrays of arrays of arrays of etc. Entity Metadata Wrapper wraps around an entity, and provides more object-oriented routes to reading and modifying its properties: fields, author UID, status etc. It especially removes the pain of guessing: what language key do I use for this field? is it "und"? is it "en"? This EMW quick overview is slightly less helpful, but only because there's a lot to cover in EMW. For complex fields like links or rich text, a handdful of examples really helps.

In summary: EFQ provided me with a list of node IDs to iterate over; EMW wrapped around each loaded node in turn so I could modify the fields. Below is the finished script, with comments explaining aspects of it.

<?php
 
// Search for publication nodes of publication type "report".
$efq = new EntityFieldQuery();
$efq
  // Conditions on the entity - its type and its bundle ("sub-type")
  ->entityCondition('entity_type', 'node')
  ->entityCondition('bundle', 'publication')
  // Conditions on the entity's fields
  // * a "sub-sub-type" we use for publications.
  ->fieldCondition('field_publication_type', 'value', 'report')
  // * check for a full-text version of the body: once processed, this is empty.
  ->fieldCondition('field_body', 'value', '', '!=');
 
// Execute, returning an array of arrays.
$result = $efq->execute();
 
// Ensure we've got some node results.
if (!isset($result['node'])) {
  drush_log("No nodes to process.", "ok");
  return;
}
 
// Iterate over the result, loading each node at a time.
foreach($result['node'] as $nid => $stub_node) {
  // Load the full node and wrap it with entity_metadata_wrapper().
  $node = node_load($nid);
  $wrapped_node = entity_metadata_wrapper("node", $node);
 
  // If there's a full-text field_body, swap it into the summary;
  // then delete that full version, so it's blank, and save.
  $full_body = $wrapped_node->field_body->value();
  if ($full_body["value"]) {
    $full_body["summary"] = $full_body["value"];
    $full_body["value"] = "";
    $wrapped_node->field_body->set($full_body);
    $wrapped_node->save();
  }
 
  // Log our progress.
  drush_log("Processed nid={$node->nid}, title={$node->title}", "ok");
}

EFQ and EMW both worked together really well, and it felt clear to me what was going on, and where bugs or problems might arise. I can heartily recommend using them both, separately or together.

Comments

Just for the sake of optimization it's probably worth using node_load_multiple() here instead of node_load in a loop. Aside from that a very interesting read - definitely bookmarking!

That's certainly a good point, but I would be wary of using node_load_multiple() if you're going to have very large result sets. It's fine IMHO to use it for maybe 5 or 10 nodes (e.g. a nodereference or entityreference field) but the data I've been working on has returned 100s or 1000s of them.

At that point, you would want to: load each node in turn; keep overwriting the $node and $wrapped_node variables; and let PHP's garbage collection remove the older nodes and EMW objects from memory for you (it might need an unset() hint, I'm not sure.) However, if you're loading the whole data set at the start, then doing more complex manipulations on each node in turn, then on big, complex sites you're liable to run out of memory before you finish.

It's definitely worth looking at on a case-by-case basis, though.

Nice and simple article but

third-party Entity API module

sounds too commercial where that module(and almost all) is contributed one :)

Thanks, but I've found from my clients that people who aren't used to Drupal's terminology don't generally understand what "contributed" means. People "contribute" to Drupal core, after all, so Drupal core is "contributed" as far as they're concerned.

"Third party" makes it clearer that it's separate from the main Drupal project, and it doesn't sound at all commercial to me: but then I very rarely work with any commercial software.