home

Lots of little trees

In my field (computational humanities), people like to distribute databases as enormous XML files. These are often very flat trees, with the root element containing hundreds of thousands (or millions) of record elements, and they can easily be too big to be parsed into memory as a DOM (Document Object Model) or DOM-like structure.

This is exactly the kind of problem that streaming XML parsers are designed to solve. There are two dominant approaches to parsing XML streams: push-based models (like SAX, the "Simple API for XML"), and pull-based models (like StAX, or—shudderscala.xml.pull). Both of these approaches save memory by producing streams of events (BeginElement, Comment, etc.) instead of reconstructing a tree-based representation of the file in memory. (Such a representation can be 5-10 times the size of the file on disk, which quickly becomes a problem when you have four gigs of memory and your XML files are approaching a gigabyte in size.)

Push-based APIs like SAX are inherently imperative: we register callbacks with the parser that specify how to handle events, and then it calls them as it parses the XML file. With a pull parser, on the other hand, the programmer sees the events as an iterator or lazy collection that he or she is responsible for iterating through. Newer frameworks that support streaming XML processing tend to provide pull-based APIs, and many developers find pull parsing more intuitive than SAX (or at least slightly less miserable).

Continue reading