hs-to-coq and `Data.Sequence`

Posted on November 23, 2020 — 2200 words (11 minutes)

In the past few weeks I’ve made some improvements to hs-to-coq. In particular, I wanted to verify the Data.Sequence module from the containers library. I’ve managed to translate most of the module to Coq so I can start proving stuff.

In this post, I will present some of the changes made in hs-to-coq to be able to translate Data.Sequence.

hs-to-coq had already been used to verify Data.Set and Data.IntSet, and their map analogues, which are the most commonly used modules of the containers library.¹ The main feature distinguishing Data.Sequence from those is polymorphic recursion. There were a couple of smaller issues to solve beyond that, and some usability improvements made in the process.

Polymorphic recursion

As its name implies, Data.Sequence offers a data structure to represent sequences. The type Seq a has a meaning similar to the type of lists [a], but Seq a supports faster operations such as indexing and concatenation (logarithmic time instead of linear time). The implementation is actually in Data.Sequence.Internal, while Data.Sequence reexports from it.

The type Seq is a thin wrapper around the type FingerTree which is where the fun happens. FingerTree is what one might call an irregular recursive type. In the type declaration of FingerTree, the recursive occurrence of the FingerTree type constructor is applied to an argument which is not the variable which appears in the left-hand side of the definition. The right-hand side of the type declaration mentions FingerTree (Node a), rather than FingerTree a itself:

-- An irregular type. (Definitions of Digit and Node omitted.)
data FingerTree a
  = EmptyT
  | Single a
  | Deep Int (Digit a) (FingerTree (Node a)) (Digit a)

newtype Elem a = Elem a
newtype Seq a = Seq (FingerTree (Elem a))

Regular recursive types² are much more common. For example, the type of lists, List a below, is indeed defined in terms of the same List a as it appears on the left-hand side:

-- A regular type
data List a = Nil | Cons a (List a)

hs-to-coq has no trouble translating irregular recursive types such as FingerTree; do the naive thing and it just works. Problems start once we look at functions involving them. For example, consider a naive recursive size function, sizeFT:

sizeFT :: FingerTree a -> Int
sizeFT EmptyT = 0
sizeFT (Single _) = 1
sizeFT (Deep _ l m r) = sizeDigit l + sizeFT m + sizeDigit r
-- This is wrong.

We want to count the number of a in a given FingerTree a, but the function above is wrong. In the recursive call, m has type FingerTree (Node a), so we are counting the number of Node a in the subtree m, when we should actually count the number of a in every Node a, and sum them up. The function above actually counts the sum of all “digits” in a FingerTree, which isn’t a meaningful quantity when trees are viewed as sequences.

While it may seem roundabout, probably the most straightforward way to fix this function is to first define foldMap:³

foldMapFT :: Monoid m => (a -> m) -> FingerTree a -> m
foldMapFT _ EmptyT = mempty
foldMapFT f (Single x) = f x
foldMapFT f (Deep _ l m r) = foldMap f l <> foldMapFT (foldMap f) m <> foldMap f r

sizeFT :: FingerTree a -> Int
sizeFT = getSum . foldMapFT (\_ -> Sum 1)  -- Data.Monoid.Sum

What makes foldMapFT unusual (and also sizeFT even though its behavior is unexpected) is that its recursive occurrence has a different type than its signature. On the left-hand side, foldMapFT is applied to f :: a -> m; in its body on the right-hand side, it is applied to foldMap f :: Node a -> m. This is what it means for foldMapFT to be polymorphic recursive: its own definition relies on the polymorphism of foldMapFT in order to specialize it to a different type than its type parameter a.

In Haskell, type parameters are often implicit; a lot of details are inferred, so we don’t think about them. In Coq, type parameters are plain function parameters. Whenever we write a lambda, if it is supposed to be polymorphic, it will take one or more extra arguments. And now, because of polymorphic recursion, it matters where type parameters are introduced relative to the fixpoint operator.

(* A polymorphic recursive foldMapFT *)
fix foldMapFT (a : Type) (m : Type) (_ : Monoid m) (f : a -> m) (t : FingerTree a) : m :=
  ...
  (* Here, foldMapFT : forall a m `(Monoid m), (a -> m) -> FingerTree a -> m *)

(* A non-polymorphic recursive foldMapFT, won't typecheck *)
fun (a : Type) (m : Type) (_ : Monoid m) =>
  fix foldMapFT (f : a -> m) (t : FingerTree a) : m :=
    ...
    (* Here, foldMapFT : (a -> m) -> FingerTree a -> m *)

In the body of the first function, foldMapFT is polymorphic. In the body of the second function, foldMapFT is not polymorphic.

As you might have guessed, hs-to-coq picked the wrong version. I created an edit to make the other choice:

polyrec foldMapFT
# Make foldMapFT polymorphic recursive

The funny thing is that hs-to-coq internally goes out of its way to factor out the type parameters of recursive definitions, thus preventing polymorphic recursion. This new edit simply skips that step. One could consider just removing that code path, but I didn’t want that change to affect existing code. My gut feeling is that it might still be useful. It’s unlikely that there is one single rule that will work for translating all definitions to Coq, so “hey it works” is good enough for now, and things will improve as more counterexamples show up.

Decreasing arguments

In Coq, functions are total. To define a recursive function, one must provide a termination annotation justifying that the function terminates. There are a couple of variants, but the general idea is that some quantity must “decrease” at every recursive call (and it cannot decrease indefinitely). The most basic annotation (struct) names one of the arguments as “the decreasing argument”.

hs-to-coq already allowed more advanced annotations to be specified as edits, but not this most basic variant—until I implemented it. It can be inferred in simple situations, but at some point it is still necessary to make it explicit.

When we write a recursive function, we refer to its decreasing argument by its name, but what really matters is its position in the list of arguments. For example, here is a recursive function f with two arguments x and y:

fix f x y {struct y} := ...

The annotation {struct y} indicates that y, the second argument of f, is the “decreasing argument”. The function is well-defined only if all occurrences of f in its body are applied to a second argument which is “smaller” than y in a certain sense. Otherwise the compiler throws an error.

That the argument is named is a problem when it comes to hs-to-coq: in Haskell, some arguments don’t have names because we immediately pattern-match on them. When translated to Coq, all arguments are given generated names, and they are renamed/decomposed in the body of every function.

-- A recursive function whose second argument is decreasing,
-- [] or (x : xs) depending on the branch, but there is no variable to refer to it.
map :: (a -> b) -> [a] -> [b]
map f [] = []
map f (x : xs) = f x : map f xs

hs-to-coq now allows specifying the decreasing argument by its position in the Haskell definition, i.e., ignoring type parameters. To implement that feature, we have to be a little careful since type parameters in Coq are parameters like any other, so they shift the positions of arguments. That turned out to be a negligible concern because, in the code of hs-to-coq, type parameters are kept separate from “value” parameters until a very late phase.

termination f {struct 2}
# The second argument of f is decreasing

Another potential solution is to fix the name generation to be more predictable. The arguments of top-level functions are numbered sequentially arg_1__, arg_2__, etc., which may be fine, but local functions just keep counting from wherever that left off (going up to arg_38__ in one case). Maybe they should also start counting from 1.

More complex termination annotations than struct involve arbitrary terms mentioning those variables. For those, there is currently no workaround, one must use those fragile names to refer to a function’s arguments.

I initially expected that some functions in Data.Sequence would have to be shown terminating based on the size of a tree as a decreasing measure, which involves more sophisticated techniques than justifications based on depth. In fact, only one function needs such sophistication (thin, an internal function used by liftA2). As mentioned earlier, the “size” of a FingerTree is actually a little tricky to formalize, and that makes it even harder to use as part of such a termination annotation. Surprisingly, the naive and “wrong” version of sizeFT shown earlier also works as a simpler decreasing measure for this function.

Missing pieces, unsupported features

With the above two changes, hs-to-coq is now able to process quite a satisfactory fragment of Data.Sequence.Internal. A few parts are not handled yet; they require either whole new features or more invasive edits than I have experience with at the moment.

Fancy mutual recursion

There remains another issue with the thin function we just mentioned: it is mutually recursive with another function. hs-to-coq currently does not support the combination of mutually recursive functions with termination annotations other than the basic one (struct).

Pattern synonyms

At the very beginning, hs-to-coq simply refused to process Data.Sequence because hs-to-coq doesn’t handle pattern synonyms. Now it at least skips pattern synonyms with a warning instead of failing. One still has to manually add edits to ignore declarations that use pattern synonyms, since it’s not too easy to tell whether that’s the case without a more involved analysis than is currently done.

Partial functions

The remaining bits are partial functions, internally use partial functions, or are defined by recursion on Int and I haven’t looked into how to do it yet.

Soft improvements

Some changes that aren’t strictly necessary to get the job done, but made my life a little easier.

Order of declarations

In Haskell, declarations can be written in any order (except when Template Haskell is involved) and they can refer to each other just fine.

In Coq, declarations must be ordered because of the restrictions on recursion. Type classes further complicate this story because of their implicitness: we cannot know whether an instance is used in an expression without type checking, and hs-to-coq currently stops at renaming.

For now, we have a “best guess” implementation using a “stable topological sort”, trying to preserve an a priori order as much as possible, putting instances before top-level values, and otherwise ordering value declarations as they appear in the Haskell source. Of course that doesn’t always work, so there are edits to create artificial dependencies between declarations.

It took me a while to notice something wrong with the implementation: independent definitions were sorted in reverse order, which is the opposite of what a “stable sort” should do. The sort algorithm itself was fine: the obvious dependencies were satisfied. And you expect to have things to fix by hand because of the underspecified nature of the problem at that point. So any single discrepancy was easily dismissed as “just what the algorithm does”. But after getting annoyed enough that nothing was where I expected it to be, I went to investigate. The culprit was GHC⁴: renaming produces a list of declarations in reverse order! This is usually not a problem since the order of declarations should not matter in Haskell⁵, but in our case we have to sort the declarations in source order before applying the stable topological sort. That ensures that the order in our Coq output is similar to the order in the Haskell input.

Module aliases

In edits files, identifiers must be fully qualified. This prevents ambiguities since edits don’t belong to any one module.

Module names can get quite long. It was tedious to repeat Data.Sequence.Internal over and over. There was already an edit to rename a module, but that changes the name of the file itself and affects other modules using that module. I added a new edit to abbreviate a module, without those side effects. In fact, that edit only affects the edits file it is in. The parser expands the abbreviation on the fly whenever it encounters an identifier, and after the parser is done, the abbreviation is completely forgotten.

module alias Seq Data.Sequence.Internal
# "Seq" is now an abbreviation of "Data.Sequence.Internal"

Ready, Set, Verify!, IFCP 2018.↩︎
I don’t know whether irregular/regular is conventional terminology, but my intuition to justify those names is that they generalize regular expressions. A regular recursive type defines a set of trees which can be recognized by a finite state machine (a tree automaton; Tree Automata, Techniques and Applications is a comprehensive book on the topic).↩︎
Link to source which looks a bit different for performance reasons.↩︎
Tested with GHC 8.4↩︎
And the AST is annotated with source locations so we don’t get lost.↩︎

hs-to-coq and Data.Sequence