Node and xml-stream

For the last couple of months, I’ve been experimenting more and more with node.js. My impression is that most people who have heard of node but not really used it think of it as a server technology for building web applications using JavaScript. Of course, it does that and there is good support for node.js hosted on Microsoft Azure Web Sites. But you can also use node as a scripting language for local tasks. There are lots of popular scripting languages like Python and Ruby but if you’re already a JavaScript developer then node is a convenient choice.

I guess most scripting environments have a package management framework and node’s is called npm. I recently wanted to do some scripting against an XML file and discovered the xml-stream module by searching on npmjs.org. One of the helpful things about npmjs.org is that it tells you how often a particular package has been downloaded, which gives you an idea of whether it is a main stream module or just someone’s hobby module that might not work so well yet.

Installing xml-stream on Windows

Installing modules is easy and the command npm install xml-stream should take care of installing the module into the node_modules folder below the current directory.

However. When I first tried this I ran into some problems. First of all, this module needs Python to be available. I installed the latest version of Python (v3.4.0) and tried again. This time it complained because Python 2.7 was needed. I installed Python 2.7.6 too but now there was a problem – how would npm know which version of Python to use? You can specify this each time or you can use the npm config command to tell npm where to look:

npm config set python "C:Python27python.exe"

You can also configure the version of Visual Studio tools you have installed so that npm knows how to use the compilers:

npm config set msvs_version 2013

You can check that this is configured correctly by typing:

npm config list

With this configuration in place, issuing the command npm install xml-stream successfully downloaded a built the xml-stream module.

Using xml-stream

Now that I had xml-stream installed, I could try it out. The W3C publishes a list of all of their published documents in an RDF/XML file. I wanted to parse this file and identify the latest version of each document.

The first thing to do is to import the http and xml-stream modules and to download the XML file:

"use strict";

var http = require('http');
var XmlStream = require('xml-stream');
var url = "http://www.w3.org/2002/01/tr-automation/tr.rdf";

var request = http.get(url).on('response', function (response) {
    //TODO: process response here
});

The xml-stream module allows you to set-up event listeners for different elements in the document. The W3C file has different elements for Working Draft (WD), Last Call (LastCall), Candidate Recommendation (CR), etc. Here is the code that listens for each document type.

"use strict";

var http = require('http');
var XmlStream = require('xml-stream');
var url = "http://www.w3.org/2002/01/tr-automation/tr.rdf";

var request = http.get(url).on('response', function (response) {
    // Collection to store documents in
    var documents = {};

    var processDocument = function (item) {
        //TODO: process document
    };

    var xml = new XmlStream(response, 'utf8');

    // Process each type of document
    xml.on('updateElement: WD', processDocument);
    xml.on('updateElement: LastCall', processDocument);
    xml.on('updateElement: CR', processDocument);
    xml.on('updateElement: PR', processDocument);
    xml.on('updateElement: REC', processDocument);
    xml.on('updateElement: NOTE', processDocument);

    xml.on('end', function () {
        // Write out JSON data of documents collection
        console.log(JSON.stringify(documents));
    });
});

Finally, we can add in a definition for the processDocument function, which will gather together all the documents into the documents collection:

    var processDocument = function (item) {
        // Collect document properties
        var document = {};
        document.type = item.$name;
        document.title = item['dc:title'];
        document.date = item['dc:date'];
        document.verURL = item.$['rdf:about'];
        document.trURL = item['doc:versionOf'].$['rdf:resource'];

        // If we have already seen a version of this document
        if (documents[document.trURL]) {
            // Check to see if this one is newer and if so overwrite it
            var old = documents[document.trURL];
            if (old.date < document.date) {
                documents[document.trURL] = document;
            }
        } else {
            // Store the new entry
            documents[document.trURL] = document;
        }
    };

At the end, the script writes out the JSON data to the console.

Of course, this script is a little fragile because it doesn’t map any of the namespace prefixes based on their declarations but it does the job I needed and is a good example of having a powerful JavaScript scripting environment coupled to a wide array of different packages to help you get tasks completed.