Unicode, accented characters, Drupal Views Data Export and Excel

If you need to assemble listings of content in Drupal, Views is what you use. And if you need to export such a listing, into offline formats like CSV, Views Data Export is a definite contender for how to do it. However, when you open the output in Microsoft Excel, you can end up—intentionally or otherwise—learning a great deal about the internals of Unicode encoding.

Excel's handling of Unicode characters

Excel has a problem. Well, Excel has many problems, but handling Unicode is one of them. It's unable to reliably detect when a file is Unicode rather than ASCII or one of the old Latin encodings.

I say "reliably", because I've had some exports that were fine (accented characters in Spanish names) and others that were broken (pound symbols in currency amounts.) The problem also only occurred on double-click opening rather than importing and thus selecting encoding options; but as exports sometimes need to be available to random members of the public, or transient staff or volunteers, user training was not an option.

We only had to deal with Excel for Windows, which simplified things slightly: but we did still need a fix.

The encoding hint at the start of Unicode files

As the table on the Stack Overflow issue above suggests, we can solve this problem by beginning our CSV files with a Byte Order Marker (BOM). This solves a specific problem, whereby Unicode characters can be represented by more than one byte on disk, and those bytes can be in either ascending or descending order. Arguments about which byte should come first are generally likened to those between characters from Gulliver's Travels. But Unicode as a standard remains strictly byte-order agnostic: it just states that, if you're using an order-sensitive format, you should definitely start the file with a BOM.

In theory, BOMs are therefore recommended in UTF-16 but certainly not required in UTF-8: its special, "compressed" encoding means that it rapidly becomes clear from the file which order the bytes are in, because only certain byte combinations are permitted. But adding a BOM to UTF-8 files at least hints at the fact that (a) the file is Unicode and (b) each character's byte sequence is ordered in one or other direction. This hint is at least useful for badly-behaved software; like, well, Excel!

Overriding Views Data Export's theme templates

Each VDE file format uses three files in the theme/ subfolder of the module:

  • views-data-export-[FORMAT]-header.tpl.php
  • views-data-export-[FORMAT]-body.tpl.php
  • views-data-export-[FORMAT]-footer.tpl.php

Because we're only modifying the beginning of the file, we therefore want to copy views-data-export-csv-header.tpl.php to our site's custom theme, and add a single executable line (added below, with two comment lines) to write the BOM:

<?php
 
// Print out header row, if option was selected.
if ($options['header']) {
  // Begin file with UTF-8 BOM.
  print "\xEF\xBB\xBF";
  // Now continue to output header as normal.
  print implode($separator, $header) . "\r\n";
}

You will need to clear the theme-registry cache; you might also need to edit the view and, under "Advanced > Theme information", rescan templates for theme files (I'm never completely certain of how Views' independent cache interacts with the theme registry.) And you could rename your file with the name of the view suffixed to it, if you really only wanted certain views to have the BOM fix.

But that's all you need to prefix a BOM to CSV files with Views Data Export, almost...

The catch: if you're using an admin theme

... that is: for anonymous users. However, some data exports are understandably restricted to only certain users, and we found that a particular admin-only export wasn't beginning with a BOM: our custom theme files weren't being used.

The clue was that the view's path(s) began /admin/*—in the site's administration backend—and this section of the site was using a different theme: an off-the-shelf admin theme, that we didn't really want to hack. Instead, I created a "shell" sub-theme of that admin theme, in order to put the templates in that.

As the sub-theme inherited everything in its parent, the only file I needed alongside my csv-header template was a .info file containing:

  • Sub-theme name, description and core=7.x compatibility declaration.
  • Reference to the other theme as base theme.
  • Duplication of all the regions from the base theme.

(I also added a screenshot.png, for completeness.) Then, with a copy of our custom csv-header template (as printed inline above) in this new admin theme; that admin theme selected on the "Appearance" page in Drupal; the fix finally worked for everyone: even admin users could download CSV with a BOM at the start, and Excel was finally happy.

Summary

It takes three bytes of output, and around a hundred extra characters of PHP in a file, to categorically fix Views Data Export for Excel. But the extra fiddliness in Drupal's theme layer, and the admin theme, means that you might need to duplicate that PHP file in a couple of places: you might even need to create a new theme to hold it in. But it's still a straightforward fix. If only the rest of Excel were so straightforwardly fixable!