home

Yet another iteratee library

I'll start with the story of how I got saved, since it's kind of relevant. Back when I was an English Ph.D. student, I worked on a number of projects that involved natural language processing, which meant doing a lot of counting trigrams or whatever in tens of thousands of text files in giant messy directory trees. I was working primarily in Ruby at the time, after years of Java, and at least back in 2008 it was a pain in the ass to do this kind of thing in either Ruby or Java. You really want a library that provides the following features:

  1. Resource management: you don't want to have to worry about running out of file handles.
  2. Streaming: you shouldn't ever have to have all of the data in memory at once.
  3. Fusion: two successive mapping operations shouldn't need to traverse the data twice.
  4. Graceful error recovery: these tasks are all off-line, but you still don't want to have to restart a computation that's been running for ten minutes just because the formatting in one file is wrong.

Maybe there was such a library for Ruby or Java back then, but if there was I didn't know about it. I did have some experience with Haskell, though, and at some point in 2010 I heard about iteratees, and they were exactly what I'd always wanted. I didn't really understand how they worked at first, but with iteratee (and later John Millikin's enumerator) I was able to write code that did what I wanted and didn't make me think about stuff I didn't want to think about. I started picking Haskell instead of Ruby for new projects, and that's how I accepted statically-typed functional programming into my life.

Continue reading

Iteratees are easy

This blog post is a short response to my MITH colleague Jim Smith, who several weeks ago published a blog post about a stream processing language that he's developing. His post walks through an example of how this language could allow you to take a stream of characters, add some location metadata to each, and then group them into words, while still holding onto the location metadata about the characters that make up the words.

The process he describes sounds a little like the functionality that iteratees provide, so I decided I'd take a quick stab at writing up an iteratee implementation of his example in Haskell. I'm using John Millikin's enumerator package, since that's the iteratee library that I'm most comfortable with.

Continue reading