JSON numbers in circe 0.3.0
I'm publishing this article as a blog post (rather than as part of the circe documentation) because it's intended to be a discursive overview of the representation of JSON numbers in circe 0.3.0, not as an up-to-date description of the implementation in whatever the current circe version happens to be when you're reading this. For information about JSON numbers in circe versions after 0.3.0, please see the project documentation (in particular the changelogs and API docs).
The syntax of JSON numbersđź”—
JSON numbers have a fairly straightforward grammar—you've got an optional sign, an
integral part, an optional decimal part, and an optional exponent (with 10 as the base). There are
some details about where e.g. +
and 0
are allowed, but these are mostly uninteresting. Given a
representation (for example a 4-tuple of strings), writing a complete, correct, and reasonably
efficient parser for JSON number expressions is unlikely to take an experienced programmer more than
half an hour or so. In short: this part is boring and well-defined by the spec.
Representing JSON numbersđź”—
Whether we're writing a JSON library or an application that processes JSON, we'll generally want some kind of normalized form for JSON numbers that collapses distinctions between some pairs of valid JSON number strings. For example, it's unlikely that we'll want to preserve a distinction between these two expressions:
1e100
1E100
Or these:
100.0
100.00000000
Or maybe even these:
100
1e2
Other cases are more difficult. What about numbers with and without decimal parts that are made up entirely of zeros?
100
100.00000000
Or signed zeros?
0.0
-0.0
Or numbers that have the same double-precision floating-point representation?
0.00...imagine a few hundred more zeros here...001
0.0
In circe only the last two cases are distinguished (assuming the JSON parser supports these distinctions; Jawn does, and Scala.js supports signed zeros but does lose precision). The following rules summarize the distinctions that circe supports:
- Precision is never lost (assuming the JSON parser doesn't lose it)
- If there's an exponent, the case of the
e
is irrelevant. - More generally, whether the same number is written with an exponent or not is irrelevant.
- Negative and positive zero are different.
The general principle is that if there are reasonable use cases for making a distinction, circe
should support it. Signed zeros are useful for some numerical applications, so we preserve the sign.
If someone wants to make a case for distinguishing 100.0
and 100
, that could potentially happen
in a future version. Distinguishing 1e2
and 1E2
is probably a nonstarter.
Implementing the JSON number representationđź”—
The last section says that circe never loses precision, but clearly there have to be some limits on the size of the numbers we can represent. According to the JSON grammar, JSON numbers can be arbitrarily large—the grammar would happily accept a number with trillions of digits in the exponent. RFC 7159 is a little more grounded:
This specification allows implementations to set limits on the range and precision of numbers accepted.
circe follows Argonaut in aiming to make this limit really, really high—something more or less like "does the expression fit in memory?". In both circe and Argonaut (6.1) we can do something like the following:
import scalaz._, Scalaz._, argonaut.Parse
val \/-(x) = Parse.parse(s"""1e${ "9" * 1000 }""")
val \/-(y) = Parse.parse(s"""10e${ "9" * 999 }8""")
x.nospaces // will be s"""1e${ "9" * 1000 }"""
x === y // will be true
In Argonaut (and circe before 0.3.0) this is accomplished by representing large numbers as a pair of
a BigDecimal
and a BigInt
exponent, with the BigDecimal
either being zero or having a single
decimal digit to the right of the decimal point.
This works great for equality, but unfortunately that's all Argonaut uses this representation for. If we try to do anything with these large number values except print them or compare them, they start to break in different ways:
scala> x.number.map(_.toBigDecimal)
java.lang.NumberFormatException
at java.math.BigDecimal.parseExp(BigDecimal.java:638)
...
scala> x.number.map(_.toLong)
java.lang.NumberFormatException
at java.math.BigDecimal.parseExp(BigDecimal.java:638)
...
(On the current Argonaut head toDouble
fails similarly, but this is a regression from 6.1, where
it returns positive infinity.)
Runtime exceptions are one thing, but it gets worse: user input can actually cause a thread to hang pretty much forever:
Parse.parse("1e2147483647").map(_.number.map(_.toBigInt))
This attempts to create a BigInt
with Int.MaxValue
digits, which takes… a very, very long time.
circe 0.3.0 tries to make this situation a little less horrible by introducing a new big number
type, which I've named BiggerDecimal
. This type is a lot like
java.math.BigDecimal
except that the scale is a BigInteger
instead of an int
, and the unscaled
value is constrained to have no trailing zeros (for the sake of making equality easy to determine).
It also provides a much more limited set of operations than BigDecimal
, and (most importantly) the
operations it does provide are guaranteed not to have godawful resource requirements.
All of this means that we can write the following in circe:
import cats.data.Xor, io.circe.jawn.parse
val Xor.Right(Some(x)) = parse(s"""1e${ "9" * 1000 }""").map(_.asNumber)
val Xor.Right(Some(y)) = parse("1e2147483647").map(_.asNumber)
And then:
scala> x.toBigDecimal
res0: Option[BigDecimal] = None
scala> x.toLong
res1: Option[Long] = None
scala> x.toDouble
res2: Double = Infinity
scala> y.toBigDecimal
res3: Option[BigDecimal] = Some(1E+2147483647)
scala> y.toBigInt
res4: Option[BigInt] = None
The conversions of x
to BigDecimal
and y
to BigInt
fail (immediately and safely) because
they are determined to be too expensive. You can still round-trip these values back to JSON, compare
them for equality, ask whether they're whole, etc.—you just can't convert them to these types.
The conversion of x
to Long
fails for a simpler reason—its value is outside the range of the
long integer type. In accordance with the horrible nature of Double
, x.toDouble
is more lossy
than the other operations—it returns the nearest Double
value or one of the infinities if the
value is out of range.
Representing JSON numbers: practical considerationsđź”—
We could simply represent JSON numbers as BiggerDecimal
values, but circe uses a slightly more
complex representation for practical reasons. If we're constructing a JSON number value from a
Long
or Double
, for example, we might as well make it possible to avoid converting those numbers
to BiggerDecimal
values. The following is a simplified version of
circe's JSON number ADT (note that these constructors are not part of the public
API):
case class JsonDecimal(value: String) extends JsonNumber
case class JsonBiggerDecimal(value: BiggerDecimal) extends JsonNumber
case class JsonBigDecimal(value: BigDecimal) extends JsonNumber
case class JsonLong(value: Long) extends JsonNumber
case class JsonDouble(value: Double) extends JsonNumber
The JsonDecimal
constructor is provided for cases where our parser hands us a string that has
already been validated as a JSON number and we want to parse it into a BiggerDecimal
lazily. The
final three constructors are provided solely for the sake of efficiency.
Decodingđź”—
One of the goals of circe is to make it possible for users to avoid ever interacting with types like
Json
or JsonNumber
. This means that everything above is (ideally) just a bunch of implementation
details. Typically users will work with JSON numbers by asking for them to be decoded into
meaningful types that have nothing to do with JSON. circe 0.3.0 introduces a few changes in this
respect.
The most important of these changes is a more clear-cut distinction between three groups of numeric
types. The first group is the arbitrary-precision types: BigInt
and BigDecimal
. circe's decoders
for these types will succeed under two conditions:
- an exact representation is possible (e.g. non-whole JSON number will never be decoded into
BigInt
values); and - coming up with an exact representation wouldn't consume too many resources.
The second group is the integral types: Byte
, Short
, Int
, and Long
. In Argonaut (and in
circe before 0.3.0), the decoders for these types would happily truncate JSON number values:
scala> argonaut.Parse.decodeOption[Int]("0.99999999")
res0: Option[Int] = Some(0)
scala> argonaut.Parse.decodeOption[Short]("32768")
res1: Option[Short] = Some(32767)
This is no longer the case in circe—these decoders now succeed only if an exact representation is possible.
scala> io.circe.jawn.decode[Int]("0.99999999")
res0: cats.data.Xor[io.circe.Error,Int] = Left(io.circe.DecodingFailure: Int)
scala> io.circe.jawn.decode[Short]("32768")
res1: cats.data.Xor[io.circe.Error,Short] = Left(io.circe.DecodingFailure: Short)
The third group is the floating-point types: Float
and Double
. These behave the same as they
always have—they will always succeed for any JSON number we can parse, but they do this by
truncating and losing precision:
scala> io.circe.jawn.decode[Double](s"""1e${ "9" * 1000 }""")
res2: cats.data.Xor[io.circe.Error,Double] = Right(Infinity)
scala> io.circe.jawn.decode[Double]("1e-10000000")
res3: cats.data.Xor[io.circe.Error,Double] = Right(0.0)
If you want to recover the old behavior for e.g. Int
, it's of course possible to decode your JSON
into a Double
and then truncate or round as desired.
Conclusionđź”—
All of the above adds some complexity to circe's number implementation, but I think the result is a
much more consistent and reliable API. This is brand new work (I merged the BiggerDecimal
stuff
yesterday morning and released circe 0.3.0 last night), so there may still be bugs (although the
test coverage is pretty good), and I'm sure there are aspects of the implementation that could be
improved. Any feedback—as a GitHub issue, in chat on Gitter,
complaints directed at me on Twitter, etc.—would be greatly appreciated.