Macro-supported DSLs for schema bindings

We've recently started using the W3C's banana-rdf library at MITH, and it's allowing us to make a lot of our code for working with RDF graphs both simpler and less tightly coupled to a specific RDF store. It's a young library, but also very clever and well-designed, and it does an excellent job of exploiting advanced features of the Scala language to make its users' lives easier. Alexandre Bertails and his collaborators deserve a lot of credit for what they've accomplished in just a little over a year.

One of the least pleasant aspects of working with any RDF library is writing bindings for particular vocabularies. For example, if we wanted to use the Open Archives Initiative's Object Reuse and Exchange vocabulary in our banana-rdf application, we'd need to write something like the following:

Continue reading

Lots of little trees

In my field (computational humanities), people like to distribute databases as enormous XML files. These are often very flat trees, with the root element containing hundreds of thousands (or millions) of record elements, and they can easily be too big to be parsed into memory as a DOM (Document Object Model) or DOM-like structure.

This is exactly the kind of problem that streaming XML parsers are designed to solve. There are two dominant approaches to parsing XML streams: push-based models (like SAX, the "Simple API for XML"), and pull-based models (like StAX, or—shudderscala.xml.pull). Both of these approaches save memory by producing streams of events (BeginElement, Comment, etc.) instead of reconstructing a tree-based representation of the file in memory. (Such a representation can be 5-10 times the size of the file on disk, which quickly becomes a problem when you have four gigs of memory and your XML files are approaching a gigabyte in size.)

Push-based APIs like SAX are inherently imperative: we register callbacks with the parser that specify how to handle events, and then it calls them as it parses the XML file. With a pull parser, on the other hand, the programmer sees the events as an iterator or lazy collection that he or she is responsible for iterating through. Newer frameworks that support streaming XML processing tend to provide pull-based APIs, and many developers find pull parsing more intuitive than SAX (or at least slightly less miserable).

Continue reading