We've recently started using the W3C's
banana-rdf
library
at MITH, and it's allowing us to make
a lot of our code for working with RDF graphs
both simpler and less tightly coupled to a specific RDF store. It's a
young library, but also very clever and well-designed, and it does an excellent job of
exploiting advanced features of the Scala language to make its
users' lives easier. Alexandre Bertails
and his collaborators deserve a lot of credit for what they've accomplished
in just a little over a year.
One of the least pleasant aspects of working with any RDF library is writing
bindings for particular vocabularies. For example, if we wanted to use the
Open Archives Initiative's
Object Reuse and Exchange vocabulary
in our banana-rdf
application, we'd need to write something like the following:
Continue reading
In my field (computational humanities), people like to distribute databases as enormous XML files.
These are often very flat trees, with the root element containing hundreds of thousands (or millions)
of record elements, and they can easily be too big to be parsed into memory as a DOM (Document Object Model) or
DOM-like structure.
This is exactly the kind of problem that streaming XML parsers are designed to solve.
There are two dominant approaches to parsing XML streams:
push-based models (like SAX, the "Simple API for XML"),
and pull-based models (like StAX, or—shudder—scala.xml.pull
).
Both of these approaches save memory by producing streams of events (BeginElement
, Comment
, etc.)
instead of reconstructing a tree-based representation of the file in
memory. (Such a representation can be 5-10 times the size of the file on disk, which quickly becomes a problem
when you have four gigs of memory and your XML files are approaching a gigabyte in size.)
Push-based APIs like SAX are inherently imperative: we register callbacks with the parser that specify how to handle events,
and then it calls them as it parses the XML file. With a pull parser, on the other hand, the programmer sees
the events as an iterator or lazy collection that he or she is responsible for iterating through.
Newer frameworks that support streaming XML processing tend to provide pull-based APIs,
and many developers find pull parsing more intuitive than SAX (or at least slightly less miserable).
Continue reading