I'll start with the story of how I got saved, since it's kind of relevant. Back when I was an
English Ph.D. student, I worked on a number of projects that involved natural language
processing, which meant doing a lot of counting trigrams or whatever in tens of thousands of text
files in giant messy directory trees. I was working primarily in Ruby at the time, after years
of Java, and at least back in 2008 it was a pain in the ass to do this kind of thing in
either Ruby or Java. You really want a library that provides the following features:
- Resource management: you don't want to have to worry about running out of file handles.
- Streaming: you shouldn't ever have to have all of the data in memory at once.
- Fusion: two successive mapping operations shouldn't need to traverse the data twice.
- Graceful error recovery: these tasks are all off-line, but you still don't want to have to
restart a computation that's been running for ten minutes just because the formatting in one file
is wrong.
Maybe there was such a library for Ruby or Java back then, but if there was I didn't know about it.
I did have some experience with Haskell, though, and at some point in 2010 I heard about
iteratees, and they were exactly what I'd always wanted. I didn't really
understand how they worked at first, but with iteratee (and later
John Millikin's enumerator) I was able to write code that did what I wanted
and didn't make me think about stuff I didn't want to think about. I started picking Haskell
instead of Ruby for new projects, and that's how I accepted statically-typed functional programming
into my life.
Continue reading
Tomasz Nurkiewicz recently published
an article
arguing that the fold on Option
(new in Scala 2.10) is unreadable and
inconsistent and should be avoided. I personally disagree about the unreadability part
and the should be avoided part, which isn't too surprising, since I
generally disagree with Tomasz.
I have a lot of respect for him, though, and I can actually get on board with a
good chunk of what he says in the article.
I can understand why you might want to avoid fold
in some projects, and I
agree that the way the standard library provides folds is inconsistent and
frustrating—I still get a little angry when I
think about the fact that Try
doesn't have a fold even though it was introduced
in the same version of the language that gave us the fold on Option
.
So this post isn't about why Tomasz is wrong about readability, etc.—it's about how
much I hate the name Foldable
.
Continue reading
This is a quick follow-up to my post last night
about stream processing with iteratees. It's worth pointing out that we can
accomplish much the same thing even more concisely using Haskell's lists:
import Data.Char (isSpace)
import Data.List (mapAccumL)
import Data.List.Split (splitWhen)
nextPage (page, _) = (page + 1, 0)
nextLine (page, line) = (page, line + 1)
locator = snd . mapAccumL go (0, 0)
where
go loc c = let nextLoc = advance c loc in (nextLoc, (c, nextLoc))
advance '\f' = nextPage
advance '\n' = nextLine
advance _ = id
tokenizer = map unzip . filter (not . null) . splitWhen (isSpace . fst)
poem =
"It is the same! - For, be it joy or sorrow,\n" ++
"The path of its departure still is free:\f" ++
"Man's yesterday may ne'er be like his morrow;\n" ++
"Nought may endure but Mutability.\n"
main = mapM_ print $ tokenizer . locator $ poem
(We could make this even more concise by using the standard library's words
in our definition of tokenizer
, but I'm trying to stick to the same basic
structure as the iteratee version.)
Continue reading
This blog post is a short response to my MITH colleague
Jim Smith, who several weeks ago published
a blog post
about a stream processing language that he's developing.
His post walks through an example of how this language
could allow you to take a stream of characters,
add some location metadata to each, and then group them into words, while still
holding onto the location metadata about the characters that make up the words.
The process he describes sounds a little like the functionality that iteratees provide,
so I decided I'd take a quick stab at writing up an
iteratee implementation of his example in Haskell.
I'm using John Millikin's
enumerator package,
since that's the iteratee library that I'm most comfortable with.
Continue reading
In my field (computational humanities), people like to distribute databases as enormous XML files.
These are often very flat trees, with the root element containing hundreds of thousands (or millions)
of record elements, and they can easily be too big to be parsed into memory as a DOM (Document Object Model) or
DOM-like structure.
This is exactly the kind of problem that streaming XML parsers are designed to solve.
There are two dominant approaches to parsing XML streams:
push-based models (like SAX, the "Simple API for XML"),
and pull-based models (like StAX, or—shudder—scala.xml.pull
).
Both of these approaches save memory by producing streams of events (BeginElement
, Comment
, etc.)
instead of reconstructing a tree-based representation of the file in
memory. (Such a representation can be 5-10 times the size of the file on disk, which quickly becomes a problem
when you have four gigs of memory and your XML files are approaching a gigabyte in size.)
Push-based APIs like SAX are inherently imperative: we register callbacks with the parser that specify how to handle events,
and then it calls them as it parses the XML file. With a pull parser, on the other hand, the programmer sees
the events as an iterator or lazy collection that he or she is responsible for iterating through.
Newer frameworks that support streaming XML processing tend to provide pull-based APIs,
and many developers find pull parsing more intuitive than SAX (or at least slightly less miserable).
Continue reading