Lots of little trees, part 2

I just noticed that the Lawrence Berkeley National Laboratory's Nux library provides streaming XQuery functionality that makes it very easy to do the kind of XML processing that I described in this post last week.

Using Scala, for example, we can start with some imports:

import nu.xom.{ Builder, Element, Nodes }
import nux.xom.xquery.{ StreamingPathFilter, StreamingTransform, XQueryUtil }

Next we write the "transformer" that we want to apply to every record element:

val processor = new StreamingTransform {
  def transform(record: Element) = {
    val id = XQueryUtil.xquery(record, "IndexCatalogueID").get(0)
    val placeResults = XQueryUtil.xquery(record, "//Place")
    val places = (0 until placeResults.size) map placeResults.get

    println(id.getValue + " " + places.map(_.getValue).mkString(", "))
    new Nodes()

We're not really transforming anything here, of course—just performing a side effect as we iterate through the records. We could just as easily be adding some representation of the record to a mutable collection, sending a message to an actor, etc.

Continue reading

Lots of little trees

In my field (computational humanities), people like to distribute databases as enormous XML files. These are often very flat trees, with the root element containing hundreds of thousands (or millions) of record elements, and they can easily be too big to be parsed into memory as a DOM (Document Object Model) or DOM-like structure.

This is exactly the kind of problem that streaming XML parsers are designed to solve. There are two dominant approaches to parsing XML streams: push-based models (like SAX, the "Simple API for XML"), and pull-based models (like StAX, or—shudderscala.xml.pull). Both of these approaches save memory by producing streams of events (BeginElement, Comment, etc.) instead of reconstructing a tree-based representation of the file in memory. (Such a representation can be 5-10 times the size of the file on disk, which quickly becomes a problem when you have four gigs of memory and your XML files are approaching a gigabyte in size.)

Push-based APIs like SAX are inherently imperative: we register callbacks with the parser that specify how to handle events, and then it calls them as it parses the XML file. With a pull parser, on the other hand, the programmer sees the events as an iterator or lazy collection that he or she is responsible for iterating through. Newer frameworks that support streaming XML processing tend to provide pull-based APIs, and many developers find pull parsing more intuitive than SAX (or at least slightly less miserable).

Continue reading