home

Yet another iteratee library

I'll start with the story of how I got saved, since it's kind of relevant. Back when I was an English Ph.D. student, I worked on a number of projects that involved natural language processing, which meant doing a lot of counting trigrams or whatever in tens of thousands of text files in giant messy directory trees. I was working primarily in Ruby at the time, after years of Java, and at least back in 2008 it was a pain in the ass to do this kind of thing in either Ruby or Java. You really want a library that provides the following features:

  1. Resource management: you don't want to have to worry about running out of file handles.
  2. Streaming: you shouldn't ever have to have all of the data in memory at once.
  3. Fusion: two successive mapping operations shouldn't need to traverse the data twice.
  4. Graceful error recovery: these tasks are all off-line, but you still don't want to have to restart a computation that's been running for ten minutes just because the formatting in one file is wrong.

Maybe there was such a library for Ruby or Java back then, but if there was I didn't know about it. I did have some experience with Haskell, though, and at some point in 2010 I heard about iteratees, and they were exactly what I'd always wanted. I didn't really understand how they worked at first, but with iteratee (and later John Millikin's enumerator) I was able to write code that did what I wanted and didn't make me think about stuff I didn't want to think about. I started picking Haskell instead of Ruby for new projects, and that's how I accepted statically-typed functional programming into my life.

Continue reading

Foldable considered confusing

Tomasz Nurkiewicz recently published an article arguing that the fold on Option (new in Scala 2.10) is unreadable and inconsistent and should be avoided. I personally disagree about the unreadability part and the should be avoided part, which isn't too surprising, since I generally disagree with Tomasz. I have a lot of respect for him, though, and I can actually get on board with a good chunk of what he says in the article. I can understand why you might want to avoid fold in some projects, and I agree that the way the standard library provides folds is inconsistent and frustrating—I still get a little angry when I think about the fact that Try doesn't have a fold even though it was introduced in the same version of the language that gave us the fold on Option.

So this post isn't about why Tomasz is wrong about readability, etc.—it's about how much I hate the name Foldable.

Continue reading

Lists are even easier

This is a quick follow-up to my post last night about stream processing with iteratees. It's worth pointing out that we can accomplish much the same thing even more concisely using Haskell's lists:

import Data.Char (isSpace)
import Data.List (mapAccumL)
import Data.List.Split (splitWhen)

nextPage (page, _)    = (page + 1, 0)
nextLine (page, line) = (page, line + 1)

locator = snd . mapAccumL go (0, 0)
  where
    go loc c = let nextLoc = advance c loc in (nextLoc, (c, nextLoc))
    advance '\f' = nextPage
    advance '\n' = nextLine
    advance _    = id

tokenizer = map unzip . filter (not . null) . splitWhen (isSpace . fst)

poem =
  "It is the same! - For, be it joy or sorrow,\n" ++
  "The path of its departure still is free:\f" ++
  "Man's yesterday may ne'er be like his morrow;\n" ++
  "Nought may endure but Mutability.\n"

main = mapM_ print $ tokenizer . locator $ poem

(We could make this even more concise by using the standard library's words in our definition of tokenizer, but I'm trying to stick to the same basic structure as the iteratee version.)

Continue reading

Iteratees are easy

This blog post is a short response to my MITH colleague Jim Smith, who several weeks ago published a blog post about a stream processing language that he's developing. His post walks through an example of how this language could allow you to take a stream of characters, add some location metadata to each, and then group them into words, while still holding onto the location metadata about the characters that make up the words.

The process he describes sounds a little like the functionality that iteratees provide, so I decided I'd take a quick stab at writing up an iteratee implementation of his example in Haskell. I'm using John Millikin's enumerator package, since that's the iteratee library that I'm most comfortable with.

Continue reading

Lots of little trees

In my field (computational humanities), people like to distribute databases as enormous XML files. These are often very flat trees, with the root element containing hundreds of thousands (or millions) of record elements, and they can easily be too big to be parsed into memory as a DOM (Document Object Model) or DOM-like structure.

This is exactly the kind of problem that streaming XML parsers are designed to solve. There are two dominant approaches to parsing XML streams: push-based models (like SAX, the "Simple API for XML"), and pull-based models (like StAX, or—shudderscala.xml.pull). Both of these approaches save memory by producing streams of events (BeginElement, Comment, etc.) instead of reconstructing a tree-based representation of the file in memory. (Such a representation can be 5-10 times the size of the file on disk, which quickly becomes a problem when you have four gigs of memory and your XML files are approaching a gigabyte in size.)

Push-based APIs like SAX are inherently imperative: we register callbacks with the parser that specify how to handle events, and then it calls them as it parses the XML file. With a pull parser, on the other hand, the programmer sees the events as an iterator or lazy collection that he or she is responsible for iterating through. Newer frameworks that support streaming XML processing tend to provide pull-based APIs, and many developers find pull parsing more intuitive than SAX (or at least slightly less miserable).

Continue reading