Performance debugging in aeson

Posted on October 8, 2017

Making aeson run fast

Ideally, whether we are using Template Haskell or Generics, we would like automatically derived code to be as fast as code we could have written and optimized manually. To understand what it takes to achieve such a result, I’ve recently started to work on the performance of aeson, a JSON library in Haskell. In this blog post, I walk through the process of finding a simple performance bug.

Benchmarking

To get an idea of where to look, we need some benchmarks. The aeson repository has a few benchmarks for encoders and decoders derived using Template Haskell and Generics. Interestingly, without any handwritten implementation for reference, this is still sufficient to reveal some issues: if we get different performance out of the two derived implementations, then there are optimizations to fit in the slower one.

During development, compiler optimizations are disabled to reduce compile times. For benchmarking, we must remember to enable them again. Using stack, here is a typical-looking sequence of commands to build and run a benchmark:

# Step 0: Check that stack.yaml doesn't disable optimizations
# or override settings with the right options in the next commands

# Rebuild with optimizations
$ stack build

$ cd benchmarks/

# Install whatever the benchmark needs
$ stack install criterion

$ stack ghc -- AesonCompareAutoInstances -O2

$ ./AesonCompareAutoInstances

Generics is usually not faster than Template Haskell. Indeed, the latter is a straightforward way of generating arbitrary code, so it seems easier to write optimized code. Consequently, it is surprising to see numbers which contradict that intuition: benchmarks show Template Haskell to be slower than Generics at encoding records directly as byte strings.

Let us illustrate what encode does. With the default options, it would convert a record Record { x = 0, y = "lambda", z = True } to the string:

{"x":0,"y":"lambda","z":true}

For a big record of 25 fields in the benchmark above, this takes 3.6 μs for TH, against 3.0 μs for Generics. For a smaller record of 5 fields, the relative difference is even greater: 940 ns vs. 590 ns. (SmallRecord.hs, source)

The measured run times are summarized in the following table, in μs.

  1. fields

5

25

orig. TH 0.94 3.6
generics 0.59 3.0

Analyzing code

We need to take a closer look at the functions thBigRecordEncode and gBigRecordEncode. Instead of staring aimlessly at the library code behind them1, we can ask GHC to output the Core terms it optimized.

GHC Core

Core is the intermediate representation on which GHC performs optimizations. It is a minimalistic functional language; the main type representing it has only 10 constructors!

Straight from the source code of GHC:

data Expr b
  = Var   Id
  | Lit   Literal
  | App   (Expr b) (Arg b)
  | Lam   b (Expr b)
  | Let   (Bind b) (Expr b)
  | Case  (Expr b) b Type [Alt b]
  | Cast  (Expr b) Coercion
  | Tick  (Tickish Id) (Expr b)
  | Type  Type
  | Coercion Coercion

For the purpose of understanding performance issues, we can ignore the last four because they will ultimately be erased from run time. The remaining six constructors are a familiar bunch common to most functional programming languages: variables, literals, function applications, anonymous functions, local definitions, and pattern matching. So if you can already read Haskell, Core sounds pretty easy to pick up. We’ll see some output soon enough.

Reducing the test case

To make the final output easier to navigate, I simplified the original AesonCompareAutoInstances.hs from the aeson repository to SmallRecord.hs (source) with a single small record type2.

data BigRecord = BigRecord
  { field01 :: !Int,
    field02 :: !Int,
    field03 :: !Int,
    field04 :: !Int,
    field05 :: !Int } deriving (Show, Eq, Generic)

Here is an example of encoding such a record as JSON:

> let bigRecord = BigRecord 1 2 3 4 5

> ByteString.putStrLn (gBigRecordEncode bigRecord)

{"field01":1,"field02":2,"field03":3,"field04":4,"field05":5}

Dump options

The incantation to obtain optimized Core together with the Template Haskell output is the following:

$ stack ghc -- SmallRecord -O2 -ddump-splices -ddump-simpl   \
                               -dsuppress-all -ddump-to-file

Let us explain briefly the new options.

Template Haskell

-ddump-splices outputs the code fragments generated with Template Haskell. Here, we are trying to figure out why the mkToEncoding splice at line 61 is slow. It corresponds to the following output in SmallRecord.dump-splices:

(...)

  mkToEncoding opts ''BigRecord
======>
  \ value_aeD3
   -> case value_aeD3 of {
    BigRecord arg1_aeD4 arg2_aeD5 arg3_aeD6 arg4_aeD7 arg5_aeD8
     -> wrapObject
      (commaSep
       [((string "field01") >< (colon >< (toEncoding arg1_aeD4))),
        ((string "field02") >< (colon >< (toEncoding arg2_aeD5))),
        ((string "field03") >< (colon >< (toEncoding arg3_aeD6))),
        ((string "field04") >< (colon >< (toEncoding arg4_aeD7))),
        ((string "field05") >< (colon >< (toEncoding arg5_aeD8)))]) }

An object is a comma-separated list of fields (commaSep), surrounded by braces (wrapObject). Each field is represented by its name and contents, separated by a colon. Doesn’t it look fine?

Core dump

-ddump-simpl outputs the optimized (“simplified”) Core; -dsuppress-all hides a lot of type information that’s irrelevant to us; -ddump-to-file, as its name indicates, makes GHC write the output to files (SmallRecord.dump-splices, SmallRecord.dump-simpl) instead of flooding standard output by default.

I pasted the core for thBigRecordEncode here and gBigRecordEncode here below it, although we won’t need to look past the first 12 lines.

Even with -dsuppress-all, the Core output by GHC is quite an eyeful. Both gBigRecordEncode and thBigRecordEncode take about 500 lines each! We can see that all the low-level details of writing a ByteString got inlined. Indeed, inlining is key to enable other compiler optimizations, and one of the main methods to improve performance is ensuring inlining happens.

Spot the not inlined

Conversely, values that don’t get inlined are a common source of inefficiency. We can just look for those, without paying much attention to what the program is actually doing. User-defined values that aren’t inlined easily stand out if the source code has descriptive names, since Core reuses them (a bit mangled sometimes) and they contrast with the short names that are made up for fresh local variables.

Look, there’s one non-inlined function just at the top of thBigRecordEncode:

thBigRecordEncode
thBigRecordEncode =
  \ x_agtJ ->
    toLazyByteString     -- <- not inlined
      (...
       )

Oh, sorry, that’s a false positive. toLazyByteString is marked NOINLINE in the bytestring library. Let’s trust that it’s there for a good reason. Furthermore, the Generic variant gBigRecordEncode begins identically, so it seems unlikely to be the cause of the performance gap we observed earlier.

Dig just a bit deeper, and here’s another one:

thBigRecordEncode
thBigRecordEncode =
  \ x_agtJ ->
    toLazyByteString
      (case x_agtJ
       of _ { BigRecord dt_dgnD dt1_dgnE dt2_dgnF dt3_dgnG dt4_dgnH ->
       let {
         e_aktT
         e_aktT =
           commaSep_$scommaSep            -- <- not inlined
             (...
              ) ...
            } ...
       })

commaSep_$scommaSep is basically commaSep, which appears in the Template Haskell snippet from earlier. It inserts a comma between consecutive elements of a list. Its definition is (source):

commaSep :: [E.Encoding] -> E.Encoding
commaSep [] = E.empty
commaSep [x] = x
commaSep (x : xs) = x E.>< E.comma E.>< commaSep xs

It is recursive, hence the compiler makes it non-inlineable. In this case, its argument is essentially the list of record fields, so we know it would be safe to unroll the definition of commaSep here.

Looking at Core here may seem somewhat overkill, as commaSep is one of only six functions that appear in the Template Haskell splice, so it wouldn’t have taken too long to figure out the problem either way. But reading Core is a reliable method: it could also reveal non-inlining of functions that are not immediately visible in the source code.

Bug fixed

Let us have the Template Haskell code unroll the insertion of commas; the result now looks like this:

(...)

  mkToEncoding opts ''BigRecord
======>
  \ value_aeXm
   -> case value_aeXm of {
    BigRecord arg1_aeXn arg2_aeXo arg3_aeXp arg4_aeXq arg5_aeXr
     -> fromPairs
      ((pair "field01" (toEncoding arg1_aeXn))
       <>
        ((pair "field02" (toEncoding arg2_aeXo))
         <>
          ((pair "field03" (toEncoding arg3_aeXp))
           <>
            ((pair "field04" (toEncoding arg4_aeXq))
             <> (pair "field05" (toEncoding arg5_aeXr)))))) }

The monoid operation3 (<>) takes care of inserting a comma between its non-empty operands. With the recursive function out of the way, we get the speed up we were looking for (see third line):

  1. fields

5

25

orig. TH 0.94 3.6
generics 0.59 3.0
fixed TH 0.61 2.5

For small records, TH now performs as well as Generics. But for large records, TH performs better. This is because GHC fails to optimize Generics for large types. That will be another story to tell.

Time for a pull request.