Skip navigation

YO DAWG I HERD YOU LIKE XSL

I spent most of this week writing XSL templates for transforming XML messages into other, similar, XML messages. Ah, bless.

Thanks to the practice of companies (including us!) making minor changes to standard XSDs, several of the input/output format combinations are identical apart from namespaces and a couple of tag names. Copy-pasting a template six times seemed like a bad idea – it contains some non-obvious bits, and the formats are bound to change since the requirements aren’t yet nailed down.

The first thing I did was to split the templates apart and use <xsl:import href=”/xsl/lib/foo.xsl” > to pull them together again. This didn’t work the first time because the XSLT processor looked for foo.xsl on the filesystem rather than in the Java classpath. Solving this was easy enough – I wrote a javax.xml.transform.URIResolver like the one below to lookup any URIs that don’t have a “scheme://” from the classpath instead. This allows people to write “file:///xsl/lib/foo.xsl” if they really do want to reference the filesystem.

public Source resolve(String href, String base) throws TransformerException {
    URI uri = URI.create(href);
    if (uri.getScheme() == null) {
        InputStream is = getClass().getResourceAsStream(href);
        if (is == null) {
            return null;
        }
        return new StreamSource(is);
    } else {
        return parentResolver.resolve(href, base);
    }
}

Ok, so that worked, but I had an uncomfortable feeling that the InputStream would never get closed. It makes sense that it’d be the caller’s responsibility to close it, however there’s no Source#close() method. The javax.xml.transform javadocs were no help, a web search showed up plenty of people doing the same thing I was, but without any discussion of whether it is safe. Stepping through a debugger told me that at least the XSLT processor implementation we’re using (Saxon) does close the InputStream, and in any case Class#getResourceAsStream is returning a FileInputStream which has a documented finalize method.

It can be a bit dangerous making conclusions about libraries on the basis of runtime debugging information rather than documented behaviour and interfaces – what happens in the future when I’m no longer on the project and someone switches to a newer version of Saxon or to a different XSLT processor? A few weeks ago I wrote a post denying the benefits of unit testing (but promoting automated end-to-end functional tests) for projects like mine, however this case – verifying a technical question – is a perfect candidate for a unit test. In fact I would have been better off creating a unit test in the first place instead of using the debugger.

public void testCallerClosesInputStream() throws Exception {
    // Setup mocks
    IMocksControl mockControl = EasyMock.createControl();

    final boolean[] closed = { false };
    InputStream in = new ByteArrayInputStream(buildXSL("").getBytes()) {
        @Override
        public void close() {
            assertFalse("Already closed", closed[0]);
            closed[0] = true;
        }
    };

    URIResolver uriResolver = mockControl.createMock(URIResolver.class);
    expect(uriResolver.resolve("fake-href.xsl", "")).andReturn(new StreamSource(in));

    mockControl.replay();

    // Execute
    String xsl = buildXSL("<import href=\"fake-href.xsl\"/>");
    String xml = buildXML("<root/>");

    transformerFactory.setURIResolver(uriResolver);

    Templates template = transformerFactory.newTemplates(stringToSource(xsl));
    Transformer transformer = template.newTransformer();
    transformer.transform(stringToSource(xml), new StreamResult(new StringWriter())); 

    // Verify
    mockControl.verify();
    assertTrue("InputStream#close method should have been called", closed[0]);
}

// buildXML and buildXSL are simple helpers that wrap the standard "<?xml ..." and
// "<stylesheet ..." declarations around their arguments.
//
// stringToSource turns a String into a Source via a StreamSource and StringReader.
// new StreamSource(String) is a trap for the uninitiated; the argument is taken as
// an id rather than contents.
&#91;/sourcecode&#93;

Why didn't I use EasyMock for the InputStream? First, java.io.InputStream is an abstract class rather than an interface. That's not a problem; all I need to do is use org.easymock.classextension.EasyMock. Initially I did this, and expect()'ed a single read() method that returned -1. The XSLT processor complained though because an empty document is not a valid XSL. I could have used andStubAnswer instead of andReturn in order to delegate read calls to a ByteArrayInputStream but having to do this for both read() and read(byte&#91;&#93;, int, int) methods makes this pretty wordy. Instead I simply create an anonymous inner class that subclasses ByteArrayInputStream.

I should also note that despite my description of this as a perfect example of where a unit test is more suitable than an integration test, what you can't see is that this test lives in the project's engine-integration module rather than in our transformation library, and there's a set-up call to get the Spring bean named "xsltTransformerFactory". I've done this because I want to ensure I'm testing the exact same TransformerFactory implementation that the application is using, rather than just whatever implementation the transformation library's tests happen to use. Or perhaps I just really like integration tests.

Right, so all that helps a bit. Instead of one big template for which I need to vary namespaces I've now got three smaller templates for which I need to vary the namespaces. What to do. I was already dealing with one case around namespaces - many of the tags in input were copied exactly to output (including their sub-elements), but with a different namespace.

&#91;sourcecode language='xml'&#93;
<xsl:apply-templates mode="copy-dest-ns" select="SomeElement">
    <xsl:with-param name="copy.dest.ns.namespace" select="$whatever.namespace" />
</xsl:apply-templates>

<xsl:apply-templates mode="copy-dest-ns" select="AnotherElement">
    <xsl:with-param name="copy.dest.ns.namespace" select="$whatever.namespace" />
</xsl:apply-templates>

copy-dest-ns is a template I based off the identity transformation. Writing out the parameter all the time makes the template hard to read, so I use the XSL “tunnel parameter” feature to allow me to simply write:

<xsl:apply-templates mode="copy-dest-ns" select="SomeElement" />
<xsl:apply-templates mode="copy-dest-ns" select="AnotherElement" />

Tunnel parameters are like dynamic scoping in programming languages. The parameter is passed to a higher-up template and is “magically” available to the called template.

So I’ve got things to the point where I’ve effectively got the following two XSLs that I want to turn into one:

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="urn:some:namespace1" xpath-default-namespace="urn:some:namespace2">

    ... output and parameter declarations ...

    <xsl:template match="Foo">
        <xsl:apply-templates mode="copy-dest-ns" select="SomeElement" />
        <xsl:apply-templates mode="copy-dest-ns" select="AnotherElement" />
        <More>
            <xsl:value-of select="Moo" />
        </More>
        <SomeStatus>GOOD</SomeStatus>
        ... lots more elements ...
    </xsl:template>
</xsl:stylesheet>

and

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="urn:some:namespace3" xpath-default-namespace="urn:some:namespace4">
    ... rest of the contents are the same ...
</xsl:stylesheet>

What I’d really like to write is xmlns=”$some.variable”. That’s not going to fly though – XSLs are XML documents and xmlns is part of XML, whereas the dollar-for-variable syntax is part of XSL. It’s at the wrong level of abstraction for me to use an XSL variable, and XML doesn’t have any parameterisation concept. Maybe I can set some namespace substitutions on the XSLT processor? Nope, nothing in there. The XSLT 2.0 specification, while quite readable, didn’t have any clues that I could spot (although <xsl:namespace-alias> was tempting).

It’s about this time that I start thinking that, well, I’ve got two XSLs that are almost the same. XSL documents are XML documents. XSLs are good for transforming between XML documents. Hmmm. Those I’ve worked with in NZ may not be surprised to hear that I started chuckling at this point about the “evil” and the “muhahahahwahah”. Actually I was trying really hard to avoid doing this. Having an XSL that transforms another XSL is cute and is a correct way of solving this problem robustly but I actually do have sympathy for the person who’s going to have to come along and maintain this stuff. Jason, the Python fan who I sat next to for six months would probably enjoy hearing that over the last year and a half of working in this environment I have come to appreciate explicit-over-implicit and the avoidance of magic. “Discoverability” is a word I like to use.

What else can we do? How about instead of writing the output elements “literally”, we use <xsl:element>, the very point of which is to support dynamic element outputs.

    ...
       <xsl:element name="More" namespace="$some.variable">
            <xsl:value-of select="Moo" />
        </xsl:element>
        <xsl:element name="SomeStatus" namespace="$some.variable">GOOD</xsl:element>
    ...

That’s not too bad, and it’s fair to say this is the “right” way of solving this problem in XSL. I started converting the template to use this style but it ended up hard to read and awfully verbose. Not very nice at all. What else can we do? I had a chat to Jason, who’d had the same kind of issue on the project he’s working on. He ended up making it so his XSLs didn’t care about namespaces at all, by stripping them out before passing them to the XSL, and adding back a fixed header on output. Unfortunately my output documents contain elements from a mixture of namespaces, so I really need my templates to be namespace aware.

I went back and forth and decided that finally, yes, I would write an XSL that transforms an XSL. Fine. Sigh. I’d have to make something that also somehow transforms any XSLs that are included via <xsl:include> or <xsl:import>. No problem. Recall the URIResolver I wrote previously to load documents from the classpath instead of the filesystem. I can also write a TransformingURIResolver that first delegates to the parent ClasspathURIResolver to get the StreamSource then applies a transformation and returns the result. I implemented this, no problem, and thanks to the fact that we are “compiling” the templates with TransformerFactory#newTemplates, this transformation stage happens entirely at startup. By the time we get around to busily transforming lots of large XML files we’ve got the resultant XSLs all ready in memory in an efficient form.

The next step was to write the XSL to XSL XSL. I had fun naming the file, and especially the XSL inputs to the template which I called “xxx001-to-yyyyyy-xxx002-header-meta-template-input.xsl” (xxx and yyy added to anonymise things a bit). Deliciously wicked.

So how do we change the value of “xmlns”? I wrote something like the following:

    <xsl:template match="xsl:stylesheet">
        ...
    <xsl:template>

    <xsl:template match="@xmlns">
        <xsl:attribute> ... </xsl:attribute>
    </xsl:template>

It didn’t work. A quick test showed that if I renamed “xmlns” to “xmlns2” then everything worked fine. Aha! A special-case bug in the XSLT processor. Oh really? Nope. The problem is that “xmlns” and “xmlns:foo” etc. aren’t actually attributes, although they look like them. They’re namespaces. “xmlns” is an XML thing not an XSL thing. It helps if you imagine there’s special syntax for namespaces instead of having just a naming convention to differentiate them from attributes.

<?xml version="1.0"?>
<xsl:stylesheet &#91;xmlns="..."&#93; &#91;xmlns:foo="..."&#93; xpath-default-namespace="...">
    ...
</xsl:stylesheet>

From this we can see that we could match “@xpath-default-namespace” but not “@xmlns”. How do we define namespaces on an element? <xsl:element> has a namespace attribute so that’ll do the trick. We can also use <xsl:namespace> to define “xmlns:foo” and so on.

    <xsl:template match="xsl:stylesheet">
        <xsl:element name="xsl:stylesheet" namespace="{$some.variable}">
            <xsl:apply-templates mode="copy-attributes" />
        </xsl:element>
    <xsl:template>

    <xsl:template mode="copy-attributes" match="@*">
        <xsl:copy />
    </xsl:template>

Having finally decided to bite the bullet and solve this problem as an XSL, despite my misgivings about “magic”, by the time I’d got the template to a state where it was more or less complete I ended thinking that it was so ugly and hard to follow that I wasn’t really willing to live with it. XSL documents are quite readable when your templates resemble the structure of your output document, but they get quite messy at other times.

Damnit. So I went back and did what I should have done in the first place – working out exactly what transformations I’d need between the different templates. Do I need to rename any elements, or is it just the namespace declarations on the root stylesheet element? With that information in hand it was clear that the XSL solution, while robust, was way more powerful than I needed. I started to ponder that a simple solution – straightforward text replacement – might actually be the way to go.

The problem with this is that this is (again) working at the wrong abstraction level. XML is a text format, yes, but it’s a structured text format. If I replace all occurences of “urn:foo:bar” with “urn:baz:quux” then what happens if the text just happens to appear somewhere else in the document outside the namespace declaration I’m intending to match. The namespace names are pretty distinctive, so this isn’t too likely, although $copy.dest.ns.namespace parameter values are good candidates for accidental replacement.

Programming taste is often a matter of balancing good and evil. The evil of treating XML as unstructured text, the good of a simple understandable solution. The evil of a complicated XSL-transforming XSL, the good of a robust, theoretically safe solution. I decided that the former was the lesser evil, but added comments over the XSLs noting that text replacements are applied on this file, so beware. Finally, instead of having the template be correct for one format, and transformed for the others, I decided to make it really explicit, and used xmlns=”TEXT_REPLACEMENT_FOO1_NAMESPACE” instead of xmlns=”urn:foo:bar”. That makes it really obvious to anyone coming along that any modifications to this template will impact multiple formats. (And gives a string they can search for in the source tree to figure out what’s going on.)

For implementing text replacement, I’d like to use a regular expression with “a|b|c|d” matching all the keys I’m going to replace, and using the regexp API that lets you iterate over match information in order to find exactly which key you’ve matched. Unfortunately Java’s regexp API lacks a Pattern.escape method. Very annoying. I wrote my own simplistic implementation instead that doesn’t bother to ensure that different keys are replaced “simultaneously” and doesn’t care about performance. Good enough enough for this situation.

Sorted. Cleaned it up, wrote tests, committed. If I was doing this a lot I’d probably write a robust namespace changing transformer, using a non-XSL mechanism like the DOM API. I’ve spent enough time on this already, so I’m ready to move on.

Update 2009-03-29 13:13 – After writing this post, I realised that it actually wouldn’t be hard to write a generic namespace-transforming XSL, where one of the input documents is an XSL and the other is of the form

<namespaceMappings>
    <mapping>
        <old>urn:foo</old>
        <new>urn:bar</old>
    </mapping>
    <mapping>
        <old>urn:foo2</old>
        <new>urn:bar2</old>
    </mapping>
</namespaceMappings>

One of those cases where writing a generic but limited solution (transforming namespaces only) is simpler than writing a meta-template for a specific case that tries to know too much about the input document. I’ll stick with the text replacement solution for now though.

Post a Comment

Required fields are marked *
*
*

%d bloggers like this: