Lots of little trees, part 2
I just noticed that the Lawrence Berkeley National Laboratory's Nux library provides streaming XQuery functionality that makes it very easy to do the kind of XML processing that I described in this post last week.
Using Scala, for example, we can start with some imports:
import nu.xom.{ Builder, Element, Nodes }
import nux.xom.xquery.{ StreamingPathFilter, StreamingTransform, XQueryUtil }
Next we write the "transformer" that we want to apply to every record element:
val processor = new StreamingTransform {
def transform(record: Element) = {
val id = XQueryUtil.xquery(record, "IndexCatalogueID").get(0)
val placeResults = XQueryUtil.xquery(record, "//Place")
val places = (0 until placeResults.size) map placeResults.get
println(id.getValue + " " + places.map(_.getValue).mkString(", "))
new Nodes()
}
}
We're not really transforming anything here, of course—just performing a side effect as we iterate through the records. We could just as easily be adding some representation of the record to a mutable collection, sending a message to an actor, etc.
Now we create and run our query:
val recordPath = "/IndexCatalogueRecordSet/IndexCatalogueRecord"
val factory = new StreamingPathFilter(recordPath, null).createNodeFactory(null, processor)
new Builder(factory).build(new java.io.File("IndexCatalogueSeries1.xml"))
And we're done. Like my conduit-based implementation, this will iterate through the records in a constant amount of memory. It's less elegant than that solution, but it works, it's easy, and it seems to be significantly faster.