Extend RSS in Drupal with arbitrary extra data

RSS 2.0 (which is what we usually talk about when we talk about RSS, although other formats are available) is a great format for syndicating, but in itself it isn't very flexible. However, within certain limits, you can customize RSS with extra XML elements and attributes. While not every RSS reader is likely to be able to interpret your custom results, then as long as you "play the game" properly, your extensions won't get in the way of those third parties, and they'll still be able to parse the underlying plain RSS.

A lot of extensions exist which you can already use, as long as you declare their namespaces at the top of your RSS file; for example, a lot of existing RSS feeds include Dublin Core authorial metadata, tags usually beginning "dc":

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
  xmlns:dc="http://purl.org/dc/elements/1.1/">
  <!-- ... -->
  <item>
    <dc:creator>Brian Blessed</dc:creator>
    <!-- ... -->
  </item>
</rss>

GeoRSS is a separate RSS-extension standard, to provide RSS items with simple location data:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
  xmlns:georss="http://www.georss.org/georss">
  <!-- ... -->
  <item>
    <georss:point>53.3836 -1.4669</georss:point>
    <!-- ... -->
  </item>
</rss>

As long as your element and attribute names begin with the same namespace as you declare at the top, then as far as a basic RSS parser is concerned it's like your extensions live in a parallel world; whereas RSS parsers that understand the extensions (and many understand the above two examples) can delve into that parallel world and find the information. Think of your extensions as being written with an ultraviolet marker, and only people with special UV lights can see them: everyone else can still see all the other writing.

Below we're going to show how you can create your namespace, your own parallel storage engine, for keeping track of what Drupal vocabulary a given RSS category belongs to. This serves as an example of some of the things you'd need to do, to extend both RSS syndication and also RSS consumption. The point with the below is that to get complete success you need to have some control over both syndicator and consumer website. But other consumers, outside of your control, will still understand the result.

Syndicator server: extending outputted RSS to add vocabularies to categories

Let's imagine your module is called rsshierarchy in the following examples. We're going to add a custom namespace "rh" to our RSS:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
  xmlns:rh="http://www.jpstacey.info/blog/2015-11-25/extend-rss-drupal-arbitrary-extra-data">
  <!-- ... -->
  <item>
    <category rh:vocabulary="ingredients">Baking powder</category>
    <!-- ... -->
  </item>
</rss>

The URL you define for your namespace doesn't actually have to resolve to anything—it just has to be unique—but it's good if it can do so. You can point it at human-readable documentation, or at a DTD or an XSL schema: something defining your extensions is useful.

We want to be able to use this RSS extension to only map certain terms into certain recipient vocabularies. For example, imagine your syndicating website, a big repository of world cuisine recipes, has "region" categories and "ingredient" categories: how do we make sure that, in your consuming website, the right categories end up in the right vocabularies?

As we'll see below, outputting these extra attributes is actualy the easy part, although first you have to decide which of the two main methods of displaying RSS you're going to use.

1. Using core's RSS display

If you're just using Drupal's core RSS display, the following will quickly add extra attributes to the category tags, and an XML namespace so that they can be understood:

<?php
 
// Our custom XML namespace.
define('RSSHIERARCHY_NS', 'http://www.jpstacey.info/blog/2015-11-25/extend-rss-drupal-arbitrary-extra-data');
 
/**
 * Implements hook_field_attach_view_alter().
 *
 * Add custom rh:vocabulary attributes to RSS <category> elements.
 */
function rsshierarchy_field_attach_view_alter(&$output, &$context) {
  // Need to have an entity with RSS elements context already.
  if (!isset($context['entity']) || !isset($context['entity']->rss_elements)) {
    return;
  }
 
  // Add custom attributes to categories.
  foreach ($context['entity']->rss_elements as $elementKey => &$element) {
    if ($element['key'] !== 'category') {
      continue;
    }
 
    // Retrieve each term's tid from the <category domain="..."> URL. 
    // Turn a path alias into an internal path if necessary.
    $domain = parse_url($element['attributes']['domain']);
    $path = trim($domain['path'], '/');
    if (strpos($path, 'taxonomy/term') === FALSE) {
      $path = drupal_lookup_path('source', $path);
    }
    $element['tid'] = 1 * preg_replace(
      '!.*/taxonomy/term/(\d)$!',
      '$1',
      $element['attributes']['domain']
    );
    if (!$element['tid']) {
      continue;
    }
 
    // Load the vocabulary and add its machine name as the custom attribute.
    $term = taxonomy_term_load($element['tid']);
    if ($term) {
      $element['attributes']['rh:vocabulary'] = $term->vocabulary_machine_name;
      // Also conditionally add the namespace to the <rss> root element.
      $context['entity']->rss_namespaces['xmlns:rh'] = RSSHIERARCHY_NS;
    }
  }
} 

And that's it! Your RSS feed will now output categories with vocabularies attached.

2. Using Views RSS fields

Because Views RSS does so much more—it allows you to override the outgoing author field, or the published date, for example—then it's necessarily a more complex framework to integrate your code with. Also, it has some slightly odd logic in its handling of preprocessing hooks, which means that—if you define one, as we do below, then you have to build the XML tag entirely from scratch. This is OK, if you only have one module playing this trick; otherwise, whoever gets there last....

Here's an example using Views RSS:

<?php
 
// Our custom XML namespace.
define('RSSHIERARCHY_NS', 'http://www.jpstacey.info/blog/2015-11-25/extend-rss-drupal-arbitrary-extra-data');
 
/**
 * Implements hook_views_rss_namespaces().
 *
 * We must a add the rh namespace, even if we might not need it.
 */
function  rsshierarchy_views_rss_namespaces() {
  $namespaces['rh'] = array(
    'prefix' => 'xmlns',
    'uri' => RSSHIERARCHY_NS,
  );
  return $namespaces;
}
 
/**
 * Implements hook_views_rss_item_elements_alter().
 *
 * This adds our own custom preprocess function to Views RSS's theme layer.
 */
function rsshierarchy_views_rss_item_elements_alter(&$elements) {
  $elements['views_rss_core']['category']['preprocess functions'][] =
    'rsshierarchy_views_rss_preprocess_item_category';
}
 
/**
 * Preprocess function for Views RSS item:category.
 */
function rsshierarchy_views_rss_preprocess_item_category(&$variables) {
  $variables['elements'] = array();
 
  // Implementing a preprocess hook changes the internal logic of views_rss,
  // so we have to completely rebuild each <category> XML from scratch.
  foreach ($variables['raw'] as $tid => $term_wrapper) {
    // We do at least get each term's information, as an array.
    $term = $term_wrapper['raw'];
    $variables['elements'][] = array(
      'key' => 'category',
      'value' => $term['name'],
      'attributes' => array(
        'domain' => url($term['path'], array('absolute' => TRUE)),
        'rh:vocabulary' => $term['vocabulary_machine_name'],
      ),
    );
  }
}

Consumer website: consuming RSS so that categories are understood

Even with the slight added complexity of Views RSS, that seemed fairly simple, right? Yes, but while you've been able to extend outgoing RSS with your own namespace, the hard part is convincing an existing RSS parsing system to actually find and use your extensions.

In Drupal, we usually use Feeds to consume RSS. It's a great framework and pretty extensible: a lot of it is object-oriented code, so the objects can be extendsed to minimize the extra code required to make something work. However, extensions by default appear as entirely new feed processors etc. and our preference here is to merely augment the existing RSS Parser and Node Processor plugins.

To get around the limitation, we override the existing Feeds plugins using an alter hook. That way they silently replace the existing plugins and no further work is needed.

First, ensure your module's .info file contains references to the classes as follows:

files[] = plugins/FeedsNodeProcessorRH.inc
files[] = plugins/FeedsSyndicationParserRH.inc

The following might work without these two lines; this is being a good Drupal citizen, though, as anyone having to override your work in future will thank you for making your classes autoloadable.

It's time to work on the module code itself. First, in the .module file, extend the Feeds "field mapping" UI, so that for categories ("tags") mappings, you can also store a "source vocabulary", the expected vocabulary name that will be used to filter out any incoming categories that don't declare themselves in that vocabulary:

<?php
/**
 * Implements hook_form_ID_alter().
 *
 * Alter the expanded chunk of the Feeds UI mapping form which lets you
 * specify criteria for mapping into elements.
 */
function rsshierarchy_form_feeds_ui_mapping_form_alter(&$form, &$form_state) {
  foreach ($form['config'] as $config_key => &$config) {
    // We have to have both config settings, and also separately
    // a mapping already registered so we can check it's from "tags"
    if (!isset($config['settings'])) {
      continue;
    }
    if (!isset($form['#mappings'][$config_key]) || ($form['#mappings'][$config_key]['source'] !== 'tags') ) {
      continue;
    }
 
    // Add our extra configuration item to the configuration here and it will
    // automatically be saved into the #mappings settings.
    $config['settings']['vocabulary_name'] = array(
      '#type' => 'textfield',
      '#title' => t('Source vocab name to map'),
      '#description' => t('The fd:vocabulary extension to RSS will carry a vocabulary name we can search for.'),
      '#default_value' => $form['#mappings'][$config_key]['vocabulary_name'],
    );
  }
}

Next, register your two custom classes with Feeds, by replacing the existing classes via the following hook:

<?php
/**
 * Implements hook_ctools_plugin_pre_alter().
 *
 * Switch plugins to use our extended classes.
 */
function rsshierarchy_ctools_plugin_pre_alter(&$plugin, &$info) {
  if ($plugin['name'] === 'Node processor') {
    $plugin['handler']['class'] = 'FeedsNodeProcessorRH';
    $plugin['handler']['file'] = 'FeedsNodeProcessorRH.inc';
    $plugin['handler']['path'] = drupal_get_path('module', 'rsshierarchy') . '/plugins';
  }
 
  if ($plugin['name'] === 'Common syndication parser') {
    $plugin['handler']['class'] = 'FeedsSyndicationParserRH';
    $plugin['handler']['file'] = 'FeedsSyndicationParserRH.inc';
    $plugin['handler']['path'] = drupal_get_path('module', 'rsshierarchy') . '/plugins';
  }
}

Now we have to extend the two Feeds-bundled classes with our own new ones. First we extend the RSS parser, so that it can understand your transmitted RSS extensions. Inside your module, create a folder plugins/ and add the following file as plugins/FeedsSyndicationParserRH.inc:

<?php
 
/**
 * @class
 * FeedsSyndicationParserRH.
 */
class FeedsSyndicationParserRH extends FeedsSyndicationParser {
  /**
   * Implements FeedsParser::parse().
   */
  public function parse(FeedsSource $source, FeedsFetcherResult $fetcher_result) {
    // Call parent class to do most of the heavy lifting.
    $result = parent::parse($source, $fetcher_result);
 
    // Have to re-parse the XML as we only get a data array from Feeds.
    @ $xml = simplexml_load_string($fetcher_result->getRaw(), NULL, LIBXML_NOERROR | LIBXML_NOWARNING | LIBXML_NOCDATA);
 
    // Extract rh:* attributes and add to the original $result.
    foreach ($xml->xpath('//item') as $i => $item) {
      foreach ($item->xpath('category') as $category) {
        // Store in temporary array so we always set *something* for each
        // category; that way ->tags and ->c2v match up 1-to-1, even if empty.
        $categoryToVocabulary = array();
        foreach($category->attributes(RSSHIERARCHY_NS) as $key => $value) {
          $categoryToVocabulary[$key] = (string)$value;
        }
        $result->items[$i]['categoryToVocabulary'][] = $categoryToVocabulary;
      }
    }
 
    return $result;
  }
}

Finally, add the following code to plugins/FeedsNodeProcessorRH.inc. This will extend the node processor, so that when it maps RSS categories into a taxonomy field, it will check to see if there are any vocabularies on each <category/> tag; if so, and if the taxonomy field is configured to only receive certain vocabularies (using your RSS extension above) then the following will only map terms whose rh:vocabulary attribute matches:

<?php
/**
 * @class
 * FeedsNodeProcessorRH.
 */
class FeedsNodeProcessorRH extends FeedsNodeProcessor {
  /**
   * Overrides parent::map().
   *
   * By default, $item is just an array of text categories. Extend this to
   * an array containing a key 'tags' with a sub-array, plus arbitrary
   * other keys. We then override mapToTarget() to unwrap this.
   */
  protected function map(FeedsSource $source, FeedsParserResult $result, $target_item = NULL) {
    // Smuggling categoryToVocabulary inside tags.
    $item =& $result->current_item;
    if (isset($item['tags']) && isset($item['categoryToVocabulary'])) {
      $item['tags'] = array(
        'tags' => $item['tags'],
        'categoryToVocabulary' => $item['categoryToVocabulary'],
      );
    }
 
    return parent::map($source, $result, $target_item);
  }
 
  /**
   * Overrides parent::mapToTarget().
   *
   * Unwrap the 'tags envelope' from above if we need to, then
   * pass everything on to the parent method to do the heavy lifting.
   */
  protected function mapToTarget(FeedsSource $source, $target, &$target_item, $value, array $mapping) {
    if (isset($mapping['vocabulary_name']) && isset($value['categoryToVocabulary'])) {
      foreach ($value['tags'] as $valueKey => $valueData) {
        // Maybe this tag hasn't been extended with fd:vocabulary data?
        if (!isset($value['categoryToVocabulary'][$valueKey]['vocabulary'])) {
          continue;
        }
        // If it has, and the vocabulary doesn't match our setting, unsetjhk
        if ($value['categoryToVocabulary'][$valueKey]['vocabulary'] !== $mapping['vocabulary_name']) {
          unset($value['tags'][$valueKey]);
        }
      }
      // Now unpack tags again, after having smuggled it through with extras.
      $value = $value['tags'];
    }
 
    return parent::mapToTarget($source, $target, $target_item, $value, $mapping);
  }
}

And that's it: you should now be able to map terms from vocabulary to vocabulary, across RSS.

Summary

RSS is so simple it's almost simplistic, to the point where Atom and RDF were designed to try to provide a more complex and configurable system. But if you need to get most of your point across in RSS 2.0, you can always extend it, either with off-the-shelf namespaces like Dublin Core, or custom namespaces that you build yourself.

Not everyone will be able to understand the data in your custom namespaces, but if you control both ends of the RSS pipeline then you will be able to pass rich information between websites by piggybacking on RSS. And if you use namespaces properly, other sites will still be able to consume the resulting RSS.