You are here

xsl

Pretending that Javascript is XSL, part 3: hCard to vCard

In previous posts (part 1, part 2) I established the possibility that there were advantages to making Javascript more functional, to bring it in line with CSS and XSL. I didn’t say what these were, particularly, but I then provided a few bits and pieces on top of jQuery to make Javascript just that: functional and quasi-XSL in its behaviour.

Now I’d like to start exploiting that behaviour, and I’m going to use the hCard microformat to illuminate its use. Briefly: a microformat is a set of agreed HTML classes used to invisibly encode structured semantic data in HTML; hCard is the implementation in HTML of the vCard specification for “virtual business card” files, using microformat classes. If you mark up people’s addresses using the hCard classes, then it’s possible to automate the conversion from hCard-enabled HTML to vCards, meaning you can click on buttons on webpages and have a vCard served up to you containing the contact information present in the webpage, verbatim, in a format you can put into your address book of choice.

One of the most-used conversion methods—Brian Suda’s X2V, a web service which converts XHTML with hCard markup into vCards and then presents them to the site visitor—uses XSL. In fact, that was what got me thinking about this whole system. Brian’s work is neat, although his own server takes a hit every time someone uses the web service (and it only works on XHTML, not non-XML HTML. What if, I thought, we could get the browser to do it instead; if we could implement template-like functional Javascript?

Anyway, below we find a couple of hCards, culled more or less directly from the Microformats examples page.

Frank Dawson
Lotus Development Corporation
work address (mail and packages):
6544 Battleford Drive

Raleigh NC 27613-3502

U.S.A.
+1-919-676-9515 (w, vm)
+1-919-676-9564 (wf)

Netscape Communications Corp.
work address:
501 E. Middlefield Rd.

Mountain View, CA 94043

U.S.A.
+1-415-937-3419 (w, vm)
+1-415-528-4164 (wf)

They look like slightly unstructured HTML, don’t they? That’s sort of the point. But hidden in the HTML are vCard classes. How do we tease them out with Javascript?

Well, there’s a question to be asked before that, I suppose, which is: why would we follow your method, and not someone else’s? What’s so good about functional Javascript? Good question. Well, if every hCard looked like the above, then you could write some completely procedural Javascript to turn it into a vCard. No problem.

But what if the order of the content was changed? The hCard—indeed, microformats in general—has quite a malleable structure, with some classes sometimes appearing on elements inside other elements, and sometimes not. What if there were more telephone numbers and email addresses, and what if they turned up in all sorts of different orders? These are just HTML classes, after all. With procedural Javascript you could start writing switch/case statements to cover every opportunity, and essentially come up with one big unavoidably recursive function. It’ll be hard to structure, hard to maintain and completely unmodular. A document-driven method of extracting the vCard, on the other hand, doesn’t need to worry about all the various different combinations of nested elements: it would just keep one eye on context and process whatever it found. Also, the development cycle could be faster, because templates could be overridden without breaking existing behaviour: just use the template() command to override existing behaviour.

Let’s instead assume you’re following my every word. For this next bit, you’ll need Firefox and Firebug, or to stuff all these instructions into a single file. Otherwise, you’ll have to take my word for it. Firstly, I’ve included jQuery on every page of my blog, so if you’ve got the ‘bug then you don’t have to resort to my insert-JS bookmarklet to squirt it in.

So: first, create the treewalker() and template() functions from part 2. Next, assign treewalker() to body and everything below it:

template("body, body *", treewalker);
template("body, body *", treewalker, "default");

You could restrict this assignment to everything within the .vcard elements, by giving the relevant CSS specifier instead, if there were a lot of content outside the hCards. It would speed up the initial setup phase, but it does complicate the demonstration so I’ve left that refinement out.

Remember we ran the treewalking before? Do that now:

var result = document.body.treewalk();

All being well, you should get a blank string back. Now it’s time to start adding some alternative rules with template(). Try this:

template(".vcard", function() { return "BEGIN:VCARDn" + this.default() + "END:VCARDn" });

Now run the treewalker again. Oh, each hCard has just given you a vCard! An… empty vCard. Isn’t that great? Um. We can add to that, though:

template(".vcard .fn", function() { return "FN:" + $(this).text() + "n" + this.default(); });
template(".vcard .org", function() { return "ORG:" + $(this).text() + "n" + this.default(); });

Now document.body.treewalk() doesn’t just return a vCard for every hCard, but it knows about names and organisations. Also, because we keep including the call to this.default() in our overrides, we still treewalk into any element inside the FN or ORG containers.

What about emails? Well, in the source we can spot an a.email element up there, so let’s give the following a whirl:

template(".vcard .email", function() { return "EMAIL;TYPE=internet:" + this.href.replace(/mailto:/, "") + "n" + this.default(); });

Try running document.body.treewalk() again. Hm. I don’t know about you, but I’m getting an error from that. Ah, wait: sometimes we have span.email rather than a.email. Spans don’t have @href attributes. Well, we could change the above rule and immediately reapply it using template() with no ill effects. But instead let’s keep it in place, and use a more specific specifier to override it just on spans:

template(".vcard span.email", function() { return "EMAIL;TYPE=internet:" + $(this).find(".value").text() + "n" + this.default(); });

Re-run the treewalker. It now finds all email hCard elements and brings them out into the vCards!

I’ll leave you with one more demonstration, for the slightly more complex TELephone field. As you can see above, there are lots of “types” for this field (Work, VoiceMail, etc.) and these sit in child elements of the telephone element. So we need to assign overrides to both the telephone element and its children.

Here’s a rule for the TELephone container:

template(".vcard .tel", function() {
  var t = "TEL";
  // Run defaults to get types where appropriate
  t += this.default().replace(/,/, ";") + ":";
  // See if we’ve got a “value” child
  var val = $(this).find(".value");
  return t + (val.length ? val.text() : $(this).text()) + "n";
});

This method is a bit more complex because we need default() to just get the .type children, and then we reach down to get. Maybe if we could give specifier argument to the default behaviour e.g. default('.type') first, then default('.value')… But that’s a project for another day, I think. Right now, let’s assign a rule to the types children and then run our treewalker:

template(".vcard .tel .type", function() {
  var jQ = $(this);
  return "," + (jQ.attr("title") ? jQ.attr("title") : jQ.text());
});

Result? You should now have Javascript which can produce vCards (currently without geographical address support, as I don’t have time and you might get bored) from the hCard microformat. It’s easy to extend, easy to maintain and, in my opinion, fairly concise. Here’s the whole shebang, less the two framework functions from my previous posts:

// Start with body
template("body, body *", treewalker);
template("body, body *", treewalker, "default");
// vCard wrapper
template(".vcard", function() { return "BEGIN:VCARDn" + this.default() + "END:VCARDn" });
// FN and ORG
template(".vcard .fn", function() { return "FN:" + $(this).text() + "n" + this.default(); });
template(".vcard .org", function() { return "ORG:" + $(this).text() + "n" + this.default(); });
// Email - A and SPAN tags
template(".vcard .email", function() { return "EMAIL;TYPE=internet:" + this.href.replace(/mailto:/, "") + "n" + this.default(); });
template(".vcard span.email", function() { return "EMAIL;TYPE=internet:" + $(this).find(".value").text() + "n" + this.default(); });
// TEL
template(".vcard .tel", function() {
  var t = "TEL";
  // Run defaults to get types where appropriate
  t += this.default().replace(/,/, ";") + ":";
  // See if we’ve got a “value” child
  var val = $(this).find(".value");
  return t + (val.length ? val.text() : $(this).text()) + "n";
});
// TEL types
template(".vcard .tel .type", function() {
  var jQ = $(this);
  return "," + (jQ.attr("title") ? jQ.attr("title") : jQ.text());
});

And that’s it. I hope the approach comes in useful. By next year, you’ll have hCard-enabled pages, with vCard conversion in the browser. Happy Christmas!

Pretending that Javascript is XSL, part 2: jQuery++

If you’re here, then you probably came from here, and you want to make Javascript more functional. If you didn’t come from there—and this is getting a bit like a Choose-Your-Own-Adventure book, isn’t it?—then you might want to go there first, to see if you want to be here.

So: functional Javascript. Not just functional, but with all the automation of XSL transformations and CSS applications, where you can set it all running and it’ll produce something and hopefully throw no errors. Let’s start with jQuery.

jQuery provides Javascript with a functional framework. Here’s the equivalent of the examples in XSL and CSS, supported by the inclusion of jquery.js:

jQuery("p.intro").each(
  function() { this.style.color = green; }
);

I hope the similarities are clear, and not too strained. Now all three languages do implicit looping over sets of element nodes, and no longer require checks for missing elements; that’s evidence that it’s starting to behave functionally. There’s still a few pieces missing, though. We’d like to be able to iterate over the tree with a set of default rules, and also replace the default rules with our own where necessary.

What would the default rule look like? Well, we can pass around all sorts of objects—this being an object-oriented language—but for now let’s play it safe and follow XSL’s lead, and have each node return the concatenated text returned by all its child nodes. That means that, by default, the whole HTML document would return an empty string. It might be nice to return an array of equivalent objects, or even some transformed tree, but let’s remain old-skool. Anyway, we can always serialize any HTML elements we want to include as text, and then stick them back into the DOM later. There’s probably a way of doing some of these tasks with core jQuery, but as we’re also passing result data around as well as input data, I’m going to step outside the framework (its extension model typically takes a jQuery object in, and returns a modified jQuery object, which isn’t quite what we’re after).

Here’s a default rule: it says “call the default rule (i.e. me) on all my children”. We’ll call this rule treewalker, because that’s what it does. We’ll also assume that we’re going to assign this function as the .treewalk method on each element:

var treewalker = function(i) {
  var t = "";
  $(this).children().each(function(i) { t += this.treewalk(); } );
  return t;
}

And here’s a way of assigning rules to elements. It’ll assign the rule as the .treewalk method unless we specify otherwise.

var template = function(specifier, fn, property) {
  if (typeof property == "undefined") property = "treewalk";
  $(specifier).each(function(i) { this[property] = fn; });
}

It looks a bit clunky, because falling back on the default property means we have to have an if-exists check. That’s to be avoided where possible in functional programming, but bear in mind that we’re still looking under the bonnet (or “hood” if you like), not at the actual functional code. We’ll get to the fully-functional bit shortly.

We’ve got one last bit and we’re done. We want to put the default rule on every element within a certain scope: we’ll assume for now that the whole HTML document body is to be treated; that might be computationally heavy for big documents, but we could change that. We’ve already defined a way of putting rules onto things, so let’s use that to put the treewalker function in as both .treewalk and .default. That way, we have a copy of the method hanging around, that we can fall back on if we overwrite it:

template("body, body *", treewalker);
template("body, body *", treewalker, "default");

That’s it. We’re now ready to pretend our Javascript is XSL. Here’s how we run it:

var result = document.body.treewalk();

Try it. “But that’s just an empty string!” you might, if you were feeling ungrateful, complain. Are you never satisfied? More on this later.

Pretending that Javascript is XSL, part 1: XSL, CSS and JS side by side

There are three main technologies that your browser employs to present HTML for you: XSL, CSS and Javascript. The first two of these are functional: that is, they turn your incoming (X)HTML documents into a set of functions, or behaviours if you like. Because CSS isn’t generally considered a language, let alone a functional one, then it’s worth seeing an example in both languages. Here’s the CSS:

p.intro { color: green; }

And here’s a sort-of XSL equivalent:

<xsl:template match="p[@class='intro']">
  <p color="green"><xsl:apply-templates /></p>
</xsl:template>

They both take place in the context of some generic processor, which rattles through the document executing default rules (XSL: strip out all but text nodes; CSS: apply the plain styles of your browser) unless your program—a list of disconnected rules, really—tells it differently. The combination of (XSL/CSS)+(X)HTML+defaults is thus turned into an explicit script for the browser to run.

So far, so reasonable. But what about the third technology, Javascript? Well, plain Javascript is an object-oriented procedural language. It orders the browser around for a bit, and then when you want to do something to the current page, Javascript manipulates the (X)HTML tree by grasping hold of it with both hands and giving it a tug, using DOM methods like .getElementById(id) and attributes like .parentNode. This procedural approach expects the tree to have a certain structure, or at the very least has to keep checking if the structure has changed and coping with that. This means that the programmer generally has to construct a lot of loops over, say, child elements, and also check for existence a lot. There’s a slight anomaly, in that you could think of the event-driven aspect of Javascript as being functional—it turns the user’s input through clicks, mouse movement and keypresses into browser behaviour, remaining otherwise dormant—but by and large Javascript’s meat is procedural.

There’s two routes you can take at this point. You can either say that, because Javascript is meant to be object-oriented, then the best way to work with it is to augment its functionality and simplify object construction, but ultimately leave it as that: if it weren’t functional, then it wouldn’t be Javascript. Or you can say that, given the advantages that XSL and CSS gain by being functional—a kind of “safety”, some scaleability, and document-driven processing—Javascript might want to have a piece of that too, while sacrificing some of its object orientation.

The first route is entirely laudable, because some problems are object-shaped and some are function-shaped. But, in the spirit of adventure, let’s investigate the second route for a while: pack some sandwiches and get some stout shoes on, and I’ll meet you in my next blog post.

CFJavaXML - a component for cached XML transformations

Mark Mandel wrote his own version of Coldfusion’s XmlTransform() function, using the underlying Java transform classes. Although one of his annoyances—that XmlTransform() won’t take any parameters—has been solved in CFMX7, XmlTransform() is nonetheless a slow operation, as the transform engine has to be cranked up, the XSL compiled, the transform effected and then everything garbage-collected, each call to the function, each request.

To improve Coldfusion for dedicated XSL programmers, I’ve turned Mark’s one-off function into a more granular component for cacheable, Java-based XSL transformations, called CFJavaXML. You can cache this component from request to request in a persisting scope. You can also compile an XSL transformation once, then store that in a persisting scope too and re-use it without having to keep accessing the XSL file (and compiling it, which can take time). It’ll bring in all its xsl:import references at compilation too, so you needn’t worry about having to keep track of your XSL directory from transformation to transformation.

The component needs no initialization, so create it as follows:

<cfset comp_cfjx = createObject(”component”, “#PATH_TO_COMPONENTS#.cfjavaxml”)>

You can cache this in a persisting scope at this point.

Using CFJavaXML is always a two-stage process, so that you get a compiled transformation object you can store and re-use. In the following example, XSL and XML can be either local file://(/) references, http:// URLs or even strings of valid XML:

<script>
    t = comp_cfjx.XslTextToTransformer( XSL );
    xmltext = comp_cfjx.XmlTransformFromTextAndJava( XML, t [, params]);
<script>

(The transform t is the compiled transform that you can cache from request to request.) Note also where any optional parameters can be inserted, using a params struct.

Depending on the transformation (and how many times you use it) the speed increase of using CFJavaXML has been quite striking: up to ten times with certain transformations. Benchmarking is quite difficult because it depends so much on the intricacy of the XSL: your mileage may vary.

(Thanks to my employers, Torchbox, who’ve given me permission to make this code available, and thanks to Mark for showing how to do it in the first place!)

Blog category: 

What's not in a name?

If you’re working with XML, as I currently am, XSLT can sometimes be a godsend. Something that would take ages to do in a structured, procedural way can be reduced to two or three lines of functional XSL code.

So it was with a growing sense of consternation that I noticed that adding XML namespaces to the original document seemed to break XSL’s ability to recognise elements! Consider:

<elem/>

This document can be processed by an out-of-the-box <xsl:template match=”elem”/> instruction. However, if we add a default namespace:

<elem xmlns=”http://example.com/foobar” />

then suddenly the template can no longer see the “elem” element. What’s gone wrong? If we replace the contents of our @match attribute with “*[name(.) = 'elem']” then the XSL template works as before, so there’s clearly an element there and it’s clearly “called” “elem”. So how

The solution is: add to the top of the XSL template an attribute of @xmlns:[something]=”http://example.com/foobar” to match the incoming document.

The confusion (on XSL’s part and mine) arises because of the way that the default namespace is treated. When an XSL stylesheet’s tags are carted off to the xsl: namespace and we define xmlns:xsl, we’re keeping them out of the way of the original document’s own default namespace (the tag name “elem” might be considered to be equivalent to “[blank]:elem”). This means that the document can flow through the XSL parser without having to worry about the XSL tags getting in the way: input and output namespaces can both be undefined.

However, when the incoming document has an @xmlns defining the default namespace for all its tags, then XSL no longer sees this as the “default” namespace, unless you tell it to. If you don’t provide an @xmlns:[something] attribute for XSL, it won’t recognise the tags at all. It can still see them as having a tagname “elem”, but no XPath search will find that tag directly.

Because you don’t want your output to necessarily have @xmlns=”http://example.com/foobar” peppered throughout (maybe your output is just plain (X)HTML?) then you should tell XSL to treat the incoming default namespace as actually having a prefix, so when it reads “elem” it actually thinks of it as, say “in:elem”. Then change your references to anything in the incoming document to have “in:” in front of the nodes and it all works:

<?xml version=”1.0″ encoding=”utf-8″?>
<xsl:stylesheet version=”1.0″
    xmlns:xsl=”http://www.w3.org/1999/XSL/Transform”
    xmlns:in=”http://example.com/foobar”>

    <xsl:template match=”/”>
        <xsl:apply-templates/>
    </xsl:template>

    <xsl:template match=”in:elem”>
        <found/>
    </xsl:template>
</xsl:stylesheet>

Thanks to Dimitre Novatchev via Mark Bosley, for the solution to this particularly knotty puzzle.

This space intentionally left blank

I’ve been asked a couple of times recently, as part of separate projects, to split the results of a SQL query on whitespace within. Simply put, how does one go from:

foo
foo bar
quux
blort wuu spong

to the expanded form:

foo
foo
bar
quux
blort
wuu
spong

efficiently and cleanly, only using SQL? (In case anyone’s worried, I’ve scrubbed the data sets of any personal details they might have previously contained: any resemblance to the real Blort Wuu-Spong is entirely coincidental.)

I finally decided it wasn’t possible, and although without the pure mathematics to back me up I could have kept hunting—partial solutions involving a self-join for each whitespace splitting kept rearing their heads—what finally convinced me was comparing the behaviour of SQL with that of XSL(T). The two are more alike than you might think; and no, I don’t mean SQL and XQuery, although that easy comparison provides a clue for the underlying similarity.

In XSL(T), the XML node in your original document(s) is in a sense king: it’s considered bad form (and is at any rate inefficient) to do data management on some transient data set, created within the template. Loops work best over nodesets rather than with some sort of conditional or from/to structure. This stems from XSL(T)’s underlying functional paradigm, where each nodeset is created

Of course, it’s always possible to twist non-functional behaviour out of the stylesheet (and most real-world solutions have to take a pragmatic approach to such programmatic purity) and interpreter-specific kluges exist to node-ize strings based on some non-XML token, but the language works fastest and cleanest when it’s hanging functions off nodes.

In SQL, the equivalent to the node in an XML document is the row in a query. Rows are passed around, compared with other rows based on the content of some of their cells, tied together and discarded, but very rarely can rows be created out of thin air. The closest one gets is the LEFT/RIGHT OUTER JOIN where the ON-condition is not satisfied: then the left-hand row, rather than being discarded as in the INNER JOIN, is in a sense tied to a row of NULLs. Although that equates to it being tied to no row at all, then when the SQL99 dust settles and post-processing can begin, NULLs can be reinterpreted (Coldfusion does this without being asked, for example).

So to create new rows, one can UNION two rowsets, or entangle the rowsets with some sort of a JOIN, but in simplest, non-iterative SQL, there ought to be no easy way to make one row magically split into two, or maybe three, or maybe four, based on its textual content. It breaks the underlying principle, that rows should flow through the SQL into bit-buckets or the STDOUT tray, but shouldn’t be tossed into the stream with flamboyant verve like chillis into a stir-fry.

Exit gracefully: regardless of the data itself, the data model that a given language’s designers had in mind can have the most effect on what’s plausible to do in the language. Almost all languages evolve through proprietary extensions until they can do associative arrays, every kind of loop structure and, if left alone for long enough, GOTOs, but being able to complete a task with a given language is not the same as being able to complete it, for a sufficiently large data set, before the death of your server, your development team or the universe.

Subscribe to RSS - xsl