# hs-to-coq and `Data.Sequence`

In the past few weeks I’ve made some improvements to
*hs-to-coq*.
In particular, I wanted to verify the `Data.Sequence`

module from the
*containers* library. I’ve managed to translate most of the module to Coq
so I can start proving stuff.

In this post, I will present some of the changes made in *hs-to-coq* to be able
to translate `Data.Sequence`

.

*hs-to-coq* had already been used to verify `Data.Set`

and `Data.IntSet`

,
and their map analogues, which are the most commonly used modules of the
*containers* library.^{1}
The main feature distinguishing `Data.Sequence`

from those is polymorphic
recursion. There were a couple of smaller issues to solve beyond that, and some
usability improvements made in the process.

## Polymorphic recursion

As its name implies, `Data.Sequence`

offers a data structure to represent
sequences. The type `Seq a`

has a meaning similar to the type of lists `[a]`

,
but `Seq a`

supports faster operations such as indexing and concatenation
(logarithmic time instead of linear time). The implementation is actually in
`Data.Sequence.Internal`

, while `Data.Sequence`

reexports from it.

The type `Seq`

is a thin wrapper around the type `FingerTree`

which is where the fun happens.
`FingerTree`

is what one might call an *irregular recursive type*.
In the type declaration of `FingerTree`

,
the recursive occurrence of the `FingerTree`

type constructor is applied
to an argument which is not the variable which appears in the left-hand side
of the definition. The right-hand side of the type declaration mentions
`FingerTree (Node a)`

, rather than `FingerTree a`

itself:

```
-- An irregular type. (Definitions of Digit and Node omitted.)
data FingerTree a
= EmptyT
| Single a
| Deep Int (Digit a) (FingerTree (Node a)) (Digit a)
newtype Elem a = Elem a
newtype Seq a = Seq (FingerTree (Elem a))
```

*Regular recursive types*^{2} are much more common. For example, the type of lists,
`List a`

below, is indeed defined in terms of the same `List a`

as it appears on the
left-hand side:

```
-- A regular type
data List a = Nil | Cons a (List a)
```

*hs-to-coq* has no trouble translating irregular recursive types such as
`FingerTree`

; do the naive thing and it just works.
Problems start once we look at functions involving them.
For example, consider a naive recursive size function, `sizeFT`

:

```
sizeFT :: FingerTree a -> Int
sizeFT EmptyT = 0
sizeFT (Single _) = 1
sizeFT (Deep _ l m r) = sizeDigit l + sizeFT m + sizeDigit r
-- This is wrong.
```

We want to count the number of `a`

in a given `FingerTree a`

, but the function
above is wrong. In the recursive call, `m`

has type `FingerTree (Node a)`

, so
we are counting the number of `Node a`

in the subtree `m`

, when we should
actually count the number of `a`

in every `Node a`

, and sum them up.
The function above actually counts the sum of all “digits” in a `FingerTree`

,
which isn’t a meaningful quantity when trees are viewed as sequences.

While it may seem roundabout, probably the most straightforward way to fix this
function is to first define `foldMap`

:^{3}

```
foldMapFT :: Monoid m => (a -> m) -> FingerTree a -> m
foldMapFT _ EmptyT = mempty
foldMapFT f (Single x) = f x
foldMapFT f (Deep _ l m r) = foldMap f l <> foldMapFT (foldMap f) m <> foldMap f r
sizeFT :: FingerTree a -> Int
sizeFT = getSum . foldMapFT (\_ -> Sum 1) -- Data.Monoid.Sum
```

What makes `foldMapFT`

unusual (and also `sizeFT`

even though its behavior is
unexpected) is that its recursive occurrence has a different type than its
signature. On the left-hand side, `foldMapFT`

is applied to `f :: a -> m`

;
in its body on the right-hand side, it is applied to `foldMap f :: Node a -> m`

.
This is what it means for `foldMapFT`

to be *polymorphic recursive*: its own
definition relies on the polymorphism of `foldMapFT`

in order to specialize it
to a different type than its type parameter `a`

.

In Haskell, type parameters are often implicit; a lot of details are inferred, so we don’t think about them. In Coq, type parameters are plain function parameters. Whenever we write a lambda, if it is supposed to be polymorphic, it will take one or more extra arguments. And now, because of polymorphic recursion, it matters where type parameters are introduced relative to the fixpoint operator.

```
(* A polymorphic recursive foldMapFT *)
fix foldMapFT (a : Type) (m : Type) (_ : Monoid m) (f : a -> m) (t : FingerTree a) : m :=
...
(* Here, foldMapFT : forall a m `(Monoid m), (a -> m) -> FingerTree a -> m *)
(* A non-polymorphic recursive foldMapFT, won't typecheck *)
fun (a : Type) (m : Type) (_ : Monoid m) =>
fix foldMapFT (f : a -> m) (t : FingerTree a) : m :=
...
(* Here, foldMapFT : (a -> m) -> FingerTree a -> m *)
```

In the body of the first function, `foldMapFT`

is polymorphic.
In the body of the second function, `foldMapFT`

is not polymorphic.

As you might have guessed, *hs-to-coq* picked the wrong version.
I created an edit to make the other choice:

```
polyrec foldMapFT
# Make foldMapFT polymorphic recursive
```

The funny thing is that *hs-to-coq* internally goes out of its way to factor
out the type parameters of recursive definitions, thus preventing polymorphic
recursion.
This new edit simply skips that step. One could consider just removing that
code path, but I didn’t want that change to affect existing code. My gut
feeling is that it might still be useful. It’s unlikely that there is one
single rule that will work for translating all definitions to Coq, so “hey it
works” is good enough for now, and things will improve as more counterexamples
show up.

## Decreasing arguments

In Coq, functions are total. To define a recursive function, one must provide
a *termination annotation* justifying that the function terminates.
There are a couple of variants, but the general idea is that some quantity must
“decrease” at every recursive call (and it cannot decrease indefinitely). The
most basic annotation (`struct`

) names one of the arguments as “the decreasing
argument”.

*hs-to-coq* already allowed more advanced annotations to be specified as edits,
but not this most basic variant—until I implemented it. It can be inferred in
simple situations, but at some point it is still necessary to make it explicit.

When we write a recursive function, we refer to its decreasing argument by its
name, but what really matters is its position in the list of arguments.
For example, here is a recursive function `f`

with two arguments `x`

and `y`

:

`fix f x y {struct y} := ...`

The annotation `{struct y}`

indicates that `y`

, the second argument of `f`

, is
the “decreasing argument”. The function is well-defined only if all occurrences
of `f`

in its body are applied to a second argument which is “smaller” than `y`

in a certain sense.
Otherwise the compiler throws an error.

That the argument is *named* is a problem when it comes to *hs-to-coq*: in
Haskell, some arguments don’t have names because we immediately pattern-match on
them. When translated to Coq, all arguments are given generated names, and they
are renamed/decomposed in the body of every function.

```
-- A recursive function whose second argument is decreasing,
-- [] or (x : xs) depending on the branch, but there is no variable to refer to it.
map :: (a -> b) -> [a] -> [b]
map f [] = []
map f (x : xs) = f x : map f xs
```

*hs-to-coq* now allows specifying the decreasing argument by its position in the
Haskell definition, i.e., ignoring type parameters.
To implement that feature, we have to be a little careful since type parameters
in Coq are parameters like any other, so they shift the positions of arguments.
That turned out to be a negligible concern because, in the code of *hs-to-coq*,
type parameters are kept separate from “value” parameters until a very late
phase.

```
termination f {struct 2}
# The second argument of f is decreasing
```

Another potential solution is to fix the name generation to be more predictable.
The arguments of top-level functions are numbered sequentially `arg_1__`

,
`arg_2__`

, etc., which may be fine, but local functions just keep counting from
wherever that left off (going up to `arg_38__`

in one case). Maybe they should
also start counting from 1.

More complex termination annotations than `struct`

involve arbitrary terms
mentioning those variables. For those, there is currently no workaround, one
must use those fragile names to refer to a function’s arguments.

I initially expected that some functions in `Data.Sequence`

would have to be
shown terminating based on the size of a tree as a decreasing measure, which
involves more sophisticated techniques than justifications based on depth.
In fact, only one function needs such sophistication (`thin`

, an
internal function used by `liftA2`

).
As mentioned earlier, the “size” of a `FingerTree`

is actually a little tricky
to formalize, and that makes it even harder to use as part of such
a termination annotation.
Surprisingly, the naive and “wrong” version of `sizeFT`

shown earlier also
works as a simpler decreasing measure for this function.

## Missing pieces, unsupported features

With the above two changes, *hs-to-coq* is now able to process quite
a satisfactory fragment of `Data.Sequence.Internal`

. A few parts are not
handled yet; they require either whole new features or more invasive edits than
I have experience with at the moment.

### Fancy mutual recursion

There remains another issue with the `thin`

function we just mentioned:
it is mutually recursive with another function.
*hs-to-coq* currently does not support the combination of mutually recursive
functions with termination annotations other than the basic one (`struct`

).

### Pattern synonyms

At the very beginning, *hs-to-coq* simply refused to process `Data.Sequence`

because *hs-to-coq* doesn’t handle pattern synonyms.
Now it at least skips pattern synonyms with a warning instead of failing.
One still has to manually add edits to ignore declarations that use pattern
synonyms, since it’s not too easy to tell whether that’s the case without
a more involved analysis than is currently done.

### Partial functions

The remaining bits are partial functions, internally use partial functions,
or are defined by recursion on `Int`

and I haven’t looked into how to do it
yet.

## Soft improvements

Some changes that aren’t strictly necessary to get the job done, but made my life a little easier.

### Order of declarations

In Haskell, declarations can be written in any order (except when Template Haskell is involved) and they can refer to each other just fine.

In Coq, declarations must be ordered because of the restrictions on recursion.
Type classes further complicate this story because of their implicitness:
we cannot know whether an instance is used in an expression without type
checking, and *hs-to-coq* currently stops at renaming.

For now, we have a “best guess” implementation using a “stable topological
sort”, trying to preserve an *a priori* order as much as possible, putting
instances before top-level values, and otherwise ordering value declarations as
they appear in the Haskell source.
Of course that doesn’t always work, so there are edits to create artificial
dependencies between declarations.

It took me a while to notice something wrong with the implementation:
independent definitions were sorted in reverse order, which is the opposite of
what a “stable sort” should do. The sort algorithm itself was fine: the obvious
dependencies were satisfied. And you expect to have things to fix by hand
because of the underspecified nature of the problem at that point. So any
single discrepancy was easily dismissed as “just what the algorithm does”. But
after getting annoyed enough that nothing was where I expected it to be, I went
to investigate. The culprit was GHC^{4}: renaming produces a list of declarations
in reverse order! This is usually not a problem since the order of declarations
should not matter in Haskell^{5}, but in our case we have to sort the declarations
in source order before applying the stable topological sort.
That ensures that the order in our Coq output is similar to the order in the
Haskell input.

### Module aliases

In edits files, identifiers must be fully qualified. This prevents ambiguities since edits don’t belong to any one module.

Module names can get quite long. It was tedious to repeat `Data.Sequence.Internal`

over and over.
There was already an edit to *rename* a module, but that changes the name of
the file itself and affects other modules using that module.
I added a new edit to *abbreviate* a module, without those side effects.
In fact, that edit only affects the edits file it is in. The parser expands
the abbreviation on the fly whenever it encounters an identifier, and after the
parser is done, the abbreviation is completely forgotten.

```
module alias Seq Data.Sequence.Internal
# "Seq" is now an abbreviation of "Data.Sequence.Internal"
```

Ready, Set, Verify!, IFCP 2018.↩︎

I don’t know whether

*irregular*/*regular*is conventional terminology, but my intuition to justify those names is that they generalize regular expressions. A regular recursive type defines a set of trees which can be recognized by a finite state machine (a*tree automaton*; Tree Automata, Techniques and Applications is a comprehensive book on the topic).↩︎Link to source which looks a bit different for performance reasons.↩︎

Tested with GHC 8.4↩︎

And the AST is annotated with source locations so we don’t get lost.↩︎