Towards monadic bidirectional serialization, a monadic example

Posted on October 17, 2016 — 2000 words (10 minutes)

This is written in Literate Haskell.

{-# LANGUAGE InstanceSigs #-}
{-# LANGUAGE ScopedTypeVariables #-}

module Bidirectional.Serialization.Three where

import Control.Applicative
import Control.Monad (void, replicateM)
import Data.Int
import Data.Binary.Get (Get, getInt32be, getLazyByteString)
import Data.Binary.Put (Put, PutM, putInt32be, putLazyByteString)
import Data.ByteString.Lazy (ByteString)
import qualified Data.ByteString.Lazy as BS
import Data.Profunctor (Profunctor(..))

-- Intentional synonyms of undefined.

(...) :: omittedForBrevity
(...) = (...)

(???) :: can'tSolveThis
(???) = (???)

A simple (de)serialization problem

A common way to encode messages of variable lengths in binary is to prefix them with their lengths. This is for example how the WeeChat Relay protocol encodes long integers, strings, pointers.

More particularly, strings are prefixed with their length on 4 bytes.

┌────┬────┬────┬────╥────┬────┬────┬────┬────┐
│ 00 │ 00 │ 00 │ 05 ║ 68 │ 65 │ 6C │ 6C │ 6F │ ────► "hello"
└────┴────┴────┴────╨────┴────┴────┴────┴────┘
 └─────────────────┘ └──────────────────────┘
       length         'h'  'e'  'l'  'l'  'o'

– String format in the WeeChat Relay protocol.

Naive (de)serializer

A serializer and a deserializer are easily written using the binary library.

type Raw = ByteString

getInt32AsInt64 :: Get Int64
getInt32AsInt64 = fmap fromIntegral getInt32be

putInt32AsInt64 :: Int64 -> Put
putInt32AsInt64 = putInt32be . fromIntegral

getString :: Get Raw
getString =
  getInt32AsInt64 >>=
  getLazyByteString

putString :: Raw -> Put
putString bs =
  putInt32AsInt64 (BS.length bs) >>
  putLazyByteString bs

The binary library provides the following:

runGet :: Get a -> ByteString -> a
runPut :: Put -> ByteString

Pickler Combinators

The code above is quite repetitive due to the dual nature of (de)serialization. Instead, we could have a single type containing both the Get and the Put. This is basically PU as in the Pickler Combinators paper, but it works in contexts Get and PutM instead of using the more naive parser type String -> (a, String).

-- type Put = PutM ()

data GetPut a = GetPut
  { get :: Get a
  , put :: a -> Put
  }

We may wrap the binary primitives for (de)serialization as a separate library:

int32 :: GetPut Int32
int32 = GetPut getInt32be putInt32be

-- Bytestrings of fixed length.
byteString :: Int64 -> GetPut Raw
byteString n = GetPut
  { get = getLazyByteString n
  , put = \bs ->
      if BS.length bs == n
      then putLazyByteString bs
      else fail "Incorrect length."
  }

GetPut supports a couple of operations. We can map over them with a bijection:

mapGetPut :: (a -> b) -> (b -> a) -> GetPut a -> GetPut b
mapGetPut to from gp = GetPut
  { get = to <$> get gp
  , put = put gp . from
  }

We can bind GetPut values in a monadic fashion, by providing an additional mapping from :: b -> a:

bindGetPutWith
  :: (b -> a)
  -> GetPut a -> (a -> GetPut b) -> GetPut b
bindGetPutWith from a_ b_ = GetPut
  { get = get a_ >>= \a -> get (b_ a)
  , put = \b -> let a = from b in
      put a_ a >> put (b_ a) b
  }

Bidirectional (de)serializer

int32AsInt64 :: GetPut Int64
int32AsInt64 = mapGetPut fromIntegral fromIntegral int32

-- A bytestring prefixed by its length on 4 bytes.
string :: GetPut Raw
string =
  bindGetPutWith BS.length
    int32AsInt64
    byteString

Invertible Syntax Descriptions

The paper Invertible Syntax Descriptions describes a typeclass-based interface for (de)serializers in Applicative style. The monadic bindGetPutWith above could be provided as a subclass.

class Syntax f => MonadicSyntax f where
  bindWith :: (b -> a) -> f a -> (a -> f b) -> f b

Common abstractions

We can actually obtain mapGetPut and bindGetPutWith using Profunctor and Monad instances of a more general type.

data GetPut' b a = GetPut'
  { get' :: Get a
  , put' :: b -> PutM a
  }

-- Functor Get, Functor PutM.
instance Functor (GetPut' b) where

  fmap f a = GetPut'
    { get' = fmap f (get' a)
    , put' = fmap f . put' a
    }

-- Applicative Get, Applicative PutM.
instance Applicative (GetPut' b) where

  pure x = GetPut' (pure x) (\_ -> pure x)

  f <*> x = GetPut'
    { get' = get' f <*> get' x
    , put' = liftA2 (<*>) (put' f) (put' x)
    }

-- Monad Get, Monad PutM.
instance Monad (GetPut' b) where

  x >>= f = GetPut'
    { get' = get' x >>= get' . f
    , put' = \b -> put' x b >>= \a -> put' (f a) b
    }

instance Profunctor GetPut' where

  lmap :: (b1 -> b0) -> GetPut' b0 a -> GetPut' b1 a
  lmap f a = GetPut'
    { get' = get' a
    , put' = put' a . f
    }

  rmap = fmap

Indeed, it generalizes GetPut, with some modifications to erase/preserve the value returned by a put.

type GetPut_ a = GetPut' a a

toGetPut :: GetPut_ a -> GetPut a
toGetPut (GetPut' get put) = GetPut get (void . put)

fromGetPut :: GetPut a -> GetPut_ a
fromGetPut (GetPut get put) = GetPut' get (\a -> a <$ put a)


-- Primitives

byteString' :: Int64 -> GetPut_ Raw
byteString' = fromGetPut . byteString

int32' :: GetPut_ Int32
int32' = fromGetPut int32

Profunctor provides this generalization of mapGetPut.

dimap :: Profunctor f
      => (b1 -> b0) -> (a0 -> a1) -> f b0 a0 -> f b1 a1
dimap :: (b -> a) -> (a -> b) -> GetPut_ a -> GetPut_ b
dimap f g = lmap f . rmap g

A more principled (de)serializer

int32AsInt64' :: GetPut_ Int64
int32AsInt64' = dimap fromIntegral fromIntegral int32'

string' :: GetPut_ Raw
string' =
  lmap BS.length int32AsInt64' >>=
  byteString'

-- With do notation
string'_ :: GetPut_ Raw
string'_ = do
  n <- lmap BS.length int32AsInt64'
  byteString' n

It looks a bit underwhelming. The programmer must still provide the same three elements, a (de)serializer for the length (int32AsInt64), a (de)serializer for the rest of the data (byteString), and a mapping from the data back to the length (length).

The difference is that instead of an ad-hoc combinator bindGetPutWith, they now have access to a more familiar interface consisting of Profunctor and Monad instances, with increased flexibility.

A GetPut' can be seen as a Get (from the binary package), with annotations (acting on the first type parameter b) to handle the inverse Put at the same time. Trying to implement as close an interface to Get as possible may help make it simpler to migrate to GetPut': fewer changes are necessary (some of them could even be derived automatically).

Splitting bindGetPutWith as a composition of lmap and (>>=) also makes explicit the fact that the from parameter (in the definition of bindGetPutWith) is only used to (co)map over the put' component. Hopefully, this clarifies the shared structure between the serializer and the deserializer, while isolating the additional mappings used by the latter.

However, the usefulness of Monad and Applicative instances for this (de)serializer type remains rather limited in some respects. Indeed, manipulations on the contravariant type parameter b gets in the way of composing actions using just Monad.

Applicative and monadic combinators

Replicate

Given a GetPut_ a, parse a list of n elements. Let us try to use replicateM for this task:

-- Failed.
replicateGetPut0 :: forall a. Int -> GetPut_ a -> GetPut_ [a]
replicateGetPut0 n a = replicateM n ((???) :: GetPut' [a] a)

There is no way to fill the hole (???) :: GetPut' [a] a correctly (the a parameter has type GetPut' a a). Indeed, its put' component should have type [a] -> PutM a, i.e., it would serialize one element of the list. Replicating such an action would only serialize the same element n times. Sequencing n values of type GetPut' [a] a is also undesirable, because the put component of each one accesses an element the list to be serialized independently of all others, which cumulates to a complexity that is quadratic in n.

We can define a variant of replicateM by explicit recursion or by breaking abstraction using the GetPut' constructor. The definition is in any case quite ad-hoc to our application.

Explicit recursion

-- The implementation turned out to be generalizable.
replicateGetPut :: Int -> GetPut_ a -> GetPut_ [a]
replicateGetPut = replicatePA

replicatePA
  :: (Profunctor f, Applicative (f [b]))
  => Int -> f b a -> f [b] [a]
replicatePA n _ | n <= 0 = pure []
replicatePA n a  =
  (:)
    <$> lmap head a
    <*> lmap tail (replicatePA (n-1) a)

The main body of this function is an applicative definition. The notation used there is in fact quite useful for records; here we can consider a non-empty list as a record too, this should give an idea of how this notation can be used for larger records:

data [] a
  = (:) { head :: a, tail :: [] a }
  | []

Breaking abstraction

-- Also generalizable.
replicateGetPut_ :: Int -> GetPut_ a -> GetPut_ [a]
replicateGetPut_ = replicateGetPut_'

replicateGetPut_' :: Int -> GetPut' b a -> GetPut' [b] [a]
replicateGetPut_' n a = GetPut'
  { get' = replicateM n (get' a)
  , put' = \bs ->
      if length bs == n
      then traverse (put' a) bs
      else fail "Incorrect length."
  }

One benefit of this non-recursive definition might be that inlining it is more likely to trigger optimizations.

Traverse

Similarly, traverse cannot be used alone here.

traverse
  :: (Traversable t, Applicative f)
  => (c -> f a) -> t c -> f (t a)

A variant specific to GetPut needs to be defined.

traverseGetPut :: (c -> GetPut_ a) -> [c] -> GetPut_ [a]
traverseGetPut = traversePA

traversePA
  :: (Profunctor f, Applicative (f [b]))
  => (c -> f b a)
  -> [c]
  -> f [b] [a]
traversePA f [] = pure []
traversePA f (c : cs) =
  (:)
    <$> lmap head (f c)
    <*> lmap tail (traversePA f cs)

traverseGetPut_ :: (c -> GetPut_ a) -> [c] -> GetPut_ [a]
traverseGetPut_ = traverseGetPut_'

traverseGetPut_' :: (c -> GetPut' b a) -> [c] -> GetPut' [b] [a]
traverseGetPut_' f cs = GetPut'
  { get' = traverse (get' . f) cs
  , put' = \bs ->
      if length bs == n
      then traverse (\(c, b) -> put' (f c) b) (zip cs bs)
      else fail "Incorrect length."
  } where
    n = length cs

Pattern matching and the usage of zip prevent us to traverse any Traversable structure in GetPut'.

Open issues

Either, pattern matching, case analysis

-- Alternative Get.
eitherGetPut
  :: GetPut' bl al -> GetPut' br ar
  -> GetPut' (Either bl br) (Either al ar)
eitherGetPut l r = GetPut'
  { get' = Left <$> get' l <|> Right <$> get' r
  , put' = either (fmap Left . put' l) (fmap Right . put' r)
  }

Maybe use prisms.

Nested structures

Consider the concatenation of a bytestring (prefixed by its length as above) and an integer to encode (Raw, Int32), and let us write this monadically. ¹

rawAndInt32 :: GetPut_ (Raw, Int32)
rawAndInt32 = do
  n <- lmap (BS.length . fst) int32AsInt64'
  raw <- lmap fst (byteString' n)
  int <- lmap snd int32'
  return (raw, int)

The issue is that fst is written twice. With more complex accessors, this duplication is inefficient as the same data is accessed twice. A better definition would nest the part corresponding to the Raw component.

rawAndInt32' :: GetPut_ (Raw, Int32)
rawAndInt32' = do
  raw <- lmap fst $ do
    n <- lmap BS.length int32AsInt64'
    byteString' n
  int <- lmap snd int32'
  return (raw, int)

However, if the second component Int32 depended on n (for example, replace int32' with some f n), that transformation would not be possible, as it pulls n into a local scope. Some boilerplate is necessary to reexpose it.

rawAndInt32'' :: GetPut_ (Raw, Int32)
rawAndInt32'' = do
  (n, raw) <- lmap fst $ do
    n <- lmap BS.length int32AsInt64'
    raw <- byteString' n
    return (n, raw)
  int <- lmap snd (f n)
  return (raw, int)
  where
    f n = (...)

I wonder whether a more powerful Monad-like structure could achieve the syntactic simplicity of the first one, with the efficiency of the last one.

A better example would have the second element depend on the first.↩︎