Programming Turing machines with regexes

Posted on June 18, 2024 — 4300 words (21 minutes)

Everybody knows that regular expressions are not Turing-complete. That won’t stop me from doing this.

In a Turing machine, there is a tape and there is a program. What is the program in a Turing machine? ^{drumroll 🥁…} It is a finite-state machine, which is equivalent to a regular expression!

Just for a silly pun, I’m going to introduce this programming language in an absurd allegory about T-rexes (“Turing regular expressions”, or just “Turing expressions”).

Tales of T-rexes
The separation of program and tape
Finite-state machines: the next 700 programming languages
Completely free from Turing-completeness
Bonus track: Brainfuck

Tales of T-rexes

T-rexes are mysterious creatures who speak in tales (Turing expressions) about one T-rex—usually the speaker. Tales are constructed from symbols for the operations of a Turing machine, and the standard regex combinators.

> and < are the simple tales of the T-rex taking a single step to the left or right;
0! and 1! tell of the T-rex “writing” a 0 or 1;
0? and 1? tell of the T-rex “observing” a 0 or 1;
e₁ ... eₙ is a tale made up of a sequence of n tales, and there is an empty tale in the case n = 0 (“nothing happened”);
(e)* says that the tale e happened an arbitrary number of times, possibly zero;
(e₁|e₂) says that one of the tales happened, e₁ or e₂.

As it is a foreign language from a fantasy world, the description above shouldn’t be taken too literally. For instance, it can be difficult to imagine a T-rex holding a pen, much less writing with it. In truth, the actions that the tales 0! and 1! describe are varied, and “writing” is only the closest approximation among the crude words of humans.

Iteration (e)* and choice (e₁|e₂) make T-rex tales nondeterministic: different sequences of events may be valid interpretations of the same tale. The observations 0? and 1? enable us to prune the tree of possibilities. T-rex communication may be convoluted sometimes, but at least they mean to convey coherent series of events.

Enough exposition. Let’s meet T-rexes!

Two T-rexes greet us, introducing themselves as Alan and Alonzo. They invite us for a chat in their home in Jurassic park.

The tale of Alonzo

While Alan serves tea, Alonzo shows us a mysterious drawing. T-rex imagery is quite simplistic, owing to their poor vision and clumsy hands. We can sort of recognize Alonzo on the left, next to a row of circles and lines:

🦖
0001011000

Then, Alonzo tells us the following tale: “(>)*1?0!>1?0!”.

Noticing our puzzled faces, Alan fetches a small machine from the garage. It is a machine to interpret T-rex tales in a somewhat visual rendition. Alan demonstrates how to transcribe the tale that Alonzo told us together with his drawing into the machine. Here is the result:

Alan's machine

Program:

(>)*1?0!>1?0!

Input:

0001011000

Output:

0001010000

Press the Run button to see the machine render the tale. It happens in the blink of an eye; T-rexes are really fast! We will break it down step by step in what follows.

(You can also edit the ~~program~~ tale and the ~~input~~ drawing in these examples and see what happens then.)

Alonzo’s drawing is the scene where the tale takes place. We ask Alan and Alonzo what “0” and “1” represent, but we are not fluent enough in T-rex to understand their apparently nuanced answer. We have no other choice than to make abstraction of it.

🦖          ← Alonzo
0001011000  ← the world

Alonzo first said “(>)*”: he walked toward the right. As is customary in T-rex discourse, this tale leaves a lot up to interpretation. There are many possible guesses of how many steps he actually took. He could even have walked out of the picture!

But then the tale goes “1?”: Alonzo observed a 1, whatever that means, right where he stopped. That narrows the possibilities down to three:

Afterwards, Alonzo tells us that “0!” happened, which we think of abstractly as “writing” 0 where Alonzo is standing.

Before "0!"
   🦖
0001011000

After "0!"
   🦖
0000011000

Each of the three possibilities from earlier after writing 0:

Alonzo then made one step to the right, and observed another 1 (“>1?”). Only the second possibility above is consistent with that subsequent observation. Finally, he writes 0 again.

      🦖
0001000000

And we can see that the outcome matches the machine output above.

After puzzling over it for a while, we start to make sense of Alonzo’s tale “(>)*1?0!>1?0!”, and imagine this rough translation: “Funny story. I walked to the right, and I stopped in front of a 1, isn’t that right Alan? I was feeling hungry. So I ate it. I left nothing! I was still hungry, I am a dinosaur after all, so I took a step to the right, only to find another 1. Aren’t I lucky? I ate it too. It was super tasty!”

We have a good laugh. Alan and Alonzo share a few more tales. More tea? Of course. At their insistence, we try telling some of our own tales, to varied success. It’s getting late. Thank you for your hospitality. The end.

More examples

Exercise for the interested reader: implement the following operations using Turing expressions. Here’s a free machine to experiment with:

Alan's empty machine

Program:

Input:

Output:

Preamble: extra features and clarifications

For convenience, other digits can also be used in programs (i.e., 2?, 2!, 3?, 3!, up to 9? and 9! are allowed). The symbol 2 will be used to mark the end of a binary input in the exercises that follow.

Turing expressions are nondeterministic, but the machine only searches for the first valid execution. The search is biased as follows: (e₁|e₂) first tries e₁ and then, if that fails, e₂; (e)* is equivalent to (|e(e)*).¹

The tape extends infinitely to the left and to the right, initialized with zeroes outside of the input. The machine aborts if the tape head (🦖) walks out of the range [-100, 100].² 0 is the initial position of the tape head and where the input starts. The machine prints the first 10 symbols starting at position 0 as the output.

Whitespace is ignored. A # starts a comment up to the end of the line. What kind of person comments a regex?

Determinism

Although Turing expressions are nondeterministic in general, we obtain a deterministic subset of Turing expressions by requiring the branching combinators to be guarded. Don’t allow unrestricted iterations (e)* and choices (e₁|e₂), only use while loops (1?e)*0?³ and conditional statements (0?e₁|1?e₂).

Not

Flip the bits. 0 becomes 1, 1 becomes 0.

Examples:

Input:  0100110112
Output: 1011001002

Input:  1111111112
Output: 0000000002

Solution

Alan's machine

Program:

((0?1!|1?0!)>)*2?

Input:

0100110112

Output:

1011001002

Binary increment

Input: binary representation of a natural number n.
Output: binary representation of (n + 1).

Cheatsheet of binary representations:

Examples:

Input:  0010000000
Output: 1010000000

Input:  1110000000
Output: 0001000000

No input delimiters for this exercise.

Solution

Alan's machine

Program:

(1?0!>)*0?1!

Input:

1110000000

Output:

0001000000

Left shift

Move bits to the left.

Examples:

Input:  0100110112
Output: 1001101102

Input:  1111111112
Output: 1111111102

Solution

Alan's machine

Program:

(2?|(0!>0?|1!>1?)*(0!>2?))

Input:

0100110112

Output:

1001101112

Thanks to Nathanaëlle Courant for this nice solution!

Right shift

Move bits to the right.

Examples:

Input:  0100110112
Output: 0010011012

Input:  1111111112
Output: 0111111112

Solution

Alan's machine

Program:

((0?|1?)(0?>)*(2?|1?0!>(1?>)*(2?|0?1!>)))*2?

Input:

0100110112

Output:

0010011012

Thanks to Nathanaëlle Courant for this nice solution!

Cumulative xor

Each bit of the output is the xor of the input bits to the left of it.

Examples:

Input:  0100100012
Output: 0111000012

Input:  1111111112
Output: 1010101012

Solution

Alan's machine

Program:

((0?>)*(1?>((0?1!>)*1?0!|2?)|2?))*2?

Input:

0100100012

Output:

0111000012

Unary subtraction

Input: Two unary numbers x and y, separated by a single 0.
Output: The difference (x - y).

Example: evaluate (5 - 3).

Input:  1111101110
Output: 1100000000

Feel free to add a 2 to delimit the input. I only decided to allow symbols other than 0 and 1 after finishing this exercise.

Solution

Alan's machine

Program:

(1?>)*0?<(1?>(0?>)*1?0!>(1?<(0?<)*1?0!<|0?(0?<)*1?0!))*0?<(1?<)*0?>

Input:

1111101110

Output:

1100000000

Example: evaluate (3 - 5).

Input:  1110111110
Output: 0000000110

In my solution, the result -2 is represented by two 1 placed in the location of the second argument rather than the first. You may use a different encoding.

Solution (bis)

Alan's machine

Program:

(1?>)*0?<(1?>(0?>)*1?0!>(1?<(0?<)*1?0!<|0?(0?<)*1?0!))*0?<(1?<)*0?>

Input:

1110111110

Output:

0000000110

Commented program

The lack of delimiters in my version of the problem makes this a bit tricky.

        # Example: evaluate (5 - 3).
        # TAPE: 111110111
        # HEAD: ^
(1?>)*0?
        # TAPE: 111110111
        # HEAD:      ^
        #
        # The following line checks whether
        # the second argument is 0,
        # in which case we will skip the loop.
>(0?<|1?<<)
        # Otherwise we move the head on the last 1
        # of the first argument.
        # TAPE: 111110111
        # HEAD:     ^
        #
        # BEGIN LOOP
        # Loop invariant: the difference between
        # the two numbers on tape is constant.
(1?
        # Go to the first 1 of the second argument.
>(0?>)*1?
        # During the first iteration,
        # the tape looks like this:
        # TAPE: 111110111
        # HEAD:       ^
        #
        # Erase 1 to 0 and move to the right.
 0!>
        # TAPE: 111110011
        # HEAD:        ^
        #
        # Check whether there remains
        # at least one 1 to the right.
        # BEGIN IF
 (1?
        # There is at least one 1 on the right.
        # Move back into the first argument.
  <(0?<)*1?
        # TAPE: 111110011
        # HEAD:     ^
        #
        # Erase 1 to 0. Move left.
  0!<
        # TAPE: 111100011
        # HEAD:    ^
        #
        # If the 1s of the first argument ran out
        # at this point (which would mean
        # first argument < second argument),
        # we will BREAK out of the loop (then terminate),
        # otherwise, CONTINUE, back to the top of loop
 |0?
        # ELSE (second branch of the IF from three lines ago)
        # The 1s of the second argument ran out
        # (which means first argument >= second argument)
        # Tape when we reach this point (in the last iteration):
        # TAPE: 1110000000
        # HEAD:          ^
        #
        # Move back into the first argument.
  (0?<)*1?
        # TAPE: 111000000
        # HEAD:   ^
        #
        # Erase 1 to 0.
  0!
        # TAPE: 110000000
        # HEAD:   ^
        #
        # Reading a 0 will break out of the loop.
        # BREAK
 )
        # END IF
)*0?
        # END LOOP

The separation of program and tape

Until this point, there may remain misgivings about whether this is actually “regular expressions”. The syntax is the same, but is it really the same semantics? This section spells out a precise alternative definition of Turing machines with a clear place for the standard semantics of regular expressions (as regular languages).

Instead of applying regular expressions directly to an input string, we are using them to describe interactions between the program and the tape of a Turing machine. Then the regular expression might as well be the program.

The mechanics of Turing machines are defined traditionally via a transition relation between states. A Turing machine state is a pair (q, t) of a program state q ∈ Q (where Q is the set of states of a finite-state machine) and a tape state t ∈ 2^ℤ × ℤ (the bits on the tape and the position of the read-write head).

That “small-step” formalization of Turing machines is too monolithic for our present purpose of revealing the regular languages hidden inside Turing machines. The issue is that the communication between the program and the tape is implicit in the transition between states as program-tape pairs (q, t). We will take a more modular approach using trace semantics: the program and the tape each give rise to traces of interactions which make explicit the communication between those two components.

The standard semantics of regular expressions

The raison d’être of regular expressions is to recognize sequences of symbols, also known as strings, lists, or words. Here, we will refer to them as traces. Regular expressions are conventionally interpreted as sets of traces, reading the iteration * and choice | combinators are operations on sets.

Let A be a set of symbols; in our case A = {<, >, 0!, 1!, 0?, 1?} but the following definition works with any A. The trace semantics of a regular expression e over the alphabet A is defined inductively:

An atomic expression e ∈ A contains a single trace which is just that symbol. Trace(e) = {e} if e ∈ A
A concatenation of expressions e₁ … eₙ contains concatenations of traces of every eᵢ. Trace(e₁ … eₙ) = { t₁ … t_n ∣ ∀i, t_i ∈ Trace(eᵢ) }
An iteration (e)* contains all concatenations of traces t₁ … t_n such that each subtrace t_i is a trace of that same e. Trace((e)*) = { t₁ … t_n ∣ ∀i, t_i ∈ Trace(e) }
A choice (e₁|e₂) contains the union of traces of e₁ and e₂. Trace((e₁|e₂)) = Trace(e₁) ∪ Trace(e₂)

Equivalently, a trace semantics can be viewed as a relation between program and trace. We write e ⊢ t, pronounced “e recognizes t”, as an abbreviation of t ∈ Trace(e). This is also for uniformity with the notation in the next section.

A core result of automata theory is that the sets of traces definable by regular expressions are the same as those definable by finite-state machines. That led to our remark that Turing machines might as well be Turing regular expressions.

The Turing machine memory model

In the semantics of regular expressions above, the meaning of the symbols (<, >, etc.) is trivial: a symbol in a regular expression denotes itself as a singleton trace. In this section, we will give these symbols their natural meaning as “operations on a tape”. The tape is the memory model of Turing machines. Memory models are better known in the context of concurrent programming languages, as they answer the question of how to resolve concurrent writes and reads.

The tape carries a sequence of symbols extending infinitely in both directions. A head on the tape reads one symbol at a time, and can move left or right, one symbol at a time. Addresses on the tape are integers, elements of ℤ. A tape state is a pair (m, i) ∈ 2^ℤ × ℤ: the memory contents is m ∈ 2^ℤ (note 2^ℤ = ℤ → {0, 1}) and the position of the head is i ∈ ℤ. The behavior of the tape is defined as a ternary relation pronounced “(m, i) steps to (m′, i′) with trace t”, written:

(m, i) ⇝ (m′, i′) ⊢ t

It is defined by the following rules. We step left and right by decrementing and incrementing the head position i.

(m, i) ⇝ (m, i − 1) ⊢ < (m, i) ⇝ (m, i + 1) ⊢ >

Writing operations use the notation m[i ↦ v] for updating the value of the tape m at address i with v.

(m, i) ⇝ (m[i ↦ 0], i) ⊢ !0 (m, i) ⇝ (m[i ↦ 1], i) ⊢ !1

Observations, or assertions, step only when a side condition is satisfied. Otherwise, the tape is stuck, and that triggers backtracking in the search for a valid trace.

(m, i) ⇝ (m, i) ⊢ ?0 if m(i) = 0 (m, i) ⇝ (m, i) ⊢ ?1 if m(i) = 1

We close this relation by reflexivity (indexed by the empty trace ϵ) and transitivity (indexed by the concatenation of traces).

(m, i) ⇝ (m, i) ⊢ ϵ (m, i) ⇝ (m′, i′) ⊢ t and (m′, i′) ⇝ (m″, i″) ⊢ t′ ⇔ (m, i) ⇝ (m″, i″) ⊢ t t′

Turing regular expressions

We now connect programs and tapes together through the trace. A Turing regular expression e and an initial tape (m, 0) step to a final tape (m′, i′), written

e, (m, 0) ⇝ (m′, i′)

if there exists a trace t recognized by both the program and the tape:

e ⊢ t (m, 0) ⇝ (m′, i′) ⊢ t

We can then consider classes of functions computable by Turing expressions via an encoding of inputs and outputs on the tape. Let encode : ℕ → 2^ℤ be an encoding of natural numbers as tapes. A Turing expression e computes a function f : ℕ → ℕ if, for all n, there is exactly one final tape (m′, i′) such that

e, (encode(n), 0) ⇝ (m′, i′)

and that unique tape encodes f(n):

m′ = encode(f(n))

Et voilà. That’s how we can program Turing machines with regular expressions.

Finite-state machines: the next 700 programming languages

Finite-state machines appear obviously in Turing machines, but you can similarly view many programming languages in terms of finite-state machines by reducing the state to just the program counter: “where you are currently in the source program” can only take finitely many values in a finite program. From that point of view, all other components of the abstract machine of your favorite programming language—including the values of local variables—belong to the “memory” that the program counter interacts with. Why would we do this? For glory of course. So we can say that most programming languages are glorified regular expressions.

To be fair, there are exceptions to this idea: cellular automata and homoiconic languages (i.e., with the ability to quote and unquote code at will) are those I can think of. At most there is a boring construction where the finite-state machine writes the source program to memory then runs a general interpreter on it.

Completely free from Turing-completeness

The theory of formal languages and automata has a ready-made answer about the expressiveness of regular expressions: regular expressions denote regular languages, which belong to a lower level of expressiveness than recursively enumerable languages in the Chomsky hierarchy.

What I want to point out is that theory can only ever study “expressiveness” in a narrow sense. Real expressiveness is fundamentally open-ended: the only limit is your imagination. Any mathematical definition of “expressiveness” must place road blocks so that meaningful impossibility theorems can be proved. The danger is to forget about those road blocks when extrapolating mathematical theorems into general claims about the usefulness of a programming language.⁴

The expressiveness of formal languages is a delicate idea in that there are well-established mathematical concepts and theorems about it, but the rigor of mathematics hides a significant formalization gap between how a theory measures “expressiveness” and the informal open-ended question of “what can we do with this?”.

“Regular expressions are not Turing-complete” might literally be a theorem in some textbook; it doesn’t stop regular expressions from also being a feasible programming language for Turing machines as demonstrated in this post. Leaving you to come to terms with your own understanding of this paradox, a closing thought: at the end of the day, science is no slave to mathematics, we do mathematics in service of science.

Bonus track: Brainfuck

Turing regular expressions look similar to Brainfuck. Let’s extend the primitives of Turing expressions to be able to compile Brainfuck.

The loop operator [...] in Brainfuck can be written as (0~...)*0?, with a new operation 0~ to observe a value not equal to zero.

With + and - (increment and decrement modulo 256) as additional operations supported by our regular expressions, a Brainfuck program is compiled to an extended Turing expression simply by replacing [ and ] textually with (0~ and )*0?.⁵

Interestingly, translating Brainfuck to extended Turing expressions does not use |, yet Brainfuck is Turing-complete: while loops seem to make conditional expressions redundant. (Are Turing expressions without choice (e₁|e₂) (i.e., with only <, >, 0!, 1!, 0?, 1?, and (e)*) also Turing-complete?)

The machine implemented within this post supports those new constructs: +, -, 0~, 1~ (also 2~ to 9~, just because; but not more, just because), and the brackets [ and ]. You can write code in Brainfuck, and it will be desugared and interpreted as an extended Turing expression. You can also directly write an extended Turing expression.

The input can now be a comma-separated list prefixed by a comma (to allow multi-digit numbers). Example: ,1,1,2,3,5,8,13. Trailing zeroes in the output will not be printed for clarity.

Brainfuck is a high-level programming language compared to Turing expressions. Being able to increment and decrement numbers makes programming so much less tedious than explicitly manipulating unary or binary numbers in Turing machines.

Small examples

The idiom [-] zeroes out a number.

Alan's machine

Program:

[-]

Input:

,42

Output:

,0

Add two numbers.

Alan's machine

Program:

>[-<+>]

Input:

,42,57

Output:

,99

Nondeterministic Brainfuck

Extending Brainfuck with star (e)* and choice (e₁|e₂) equips the language with nondeterminism.

One thing we can do using nondeterminism is to define the inverse of a function simply by guessing the output y, applying the function to it f(y), then checking that it matches the input x, in which case y = f⁻¹(x). It’s not efficient, but it’s a general implementation of inverses which may have its uses in writing formal specifications.

Here’s a roundabout implementation of subtraction (x − y): guess the answer (call it z), add one of the operands to it (z + y), and compare the result with the other operand (if z + y = x then z = x − y).

Alan's machine

Program:

[->>>+<<<](+>>+<<)*>[->+<]>[->-<]>0?

Input:

,10,3

Output:

,7

Commented program

# Using four consecutive cells, named A, B, C, D
# Expected result: value of (A - B) placed in A

# D ← A
# A ← 0
[->>>+<<<]

# A ← GUESS
# C ← A
(+>>+<<)*

# C ← B + C
# B ← 0
>[->+<]

# D ← D - C
# C ← 0
>[->-<]

# Assert(D == 0)
>0?

Note that this is the opposite of most real-world regex engines: * is usually eager (equivalent to (e(e)*|) rather than (|e(e)*)). The lazy variant is usually written *? and you could add it to the syntax of Turing regexes if you want.↩︎
It’s very easy to accidentally write an infinite loop; this is a half-assed safeguard to catch some of them.↩︎
Extra care must be taken when more than two distinct symbols may be encountered on the tape.↩︎
Another hot take in the same vein is that the simply-typed lambda calculus—the simplest total functional programming language—is Turing-complete: you can encode Turing machines/general recursive functions/your favorite Turing-complete gadget by controlling nontermination with fuel, for a concrete example. Another idea is that a language where functions terminate can easily be extended with nontermination or recursion as an explicit effect.

More generally, Kleene’s normal form theorem says that if you can “express” primitive recursion, then you can “express” general recursion. Some might view this theorem as a counterargument, pointing to a boundary between “Turing-completeness” (can “express” general recursion) and “weak Turing-completeness” (can “express” primitive recursion) which can be made precise. While I recognize that there is a rich theory behind these concepts, I rather view Kleene’s normal form theorem as an argument why such a distinction is too subtle to be relevant to expressiveness in a broad sense of what we can and cannot do using a programming language.↩︎
We might even say that Brainfuck can be compiled to regular expressions using regular expressions. The regular expressions to do those substitutions are trivial though. Using sed: sed 's/\[/(0~/g;s/\]/)\*0?/g', where the two actual regular expressions are \[ and \].↩︎

Programming Turing machines with regexes

Table of contents

Tales of T-rexes

The tale of Alonzo

Alan's machine

More examples

Alan's empty machine (JavaScript is disabled)

Preamble: extra features and clarifications

Determinism

Not

Alan's machine

Binary increment

Alan's machine

Left shift

Alan's machine

Right shift

Alan's machine

Cumulative xor

Alan's machine

Unary subtraction

Alan's machine

Alan's machine

The separation of program and tape

The standard semantics of regular expressions

The Turing machine memory model

Turing regular expressions

Finite-state machines: the next 700 programming languages

Completely free from Turing-completeness

Bonus track: Brainfuck

Small examples

Alan's machine

Alan's machine

Nondeterministic Brainfuck

Alan's machine

Alan's machine

Alan's empty machine