Adding Tail Call Optimization to A Lisp Written in Go
Later: Practices for Software Projects
Earlier: Tests by Example in Clojure and Common Lisp
The last few days have been devoted to improving l1
, the homegrown lisp I
wrote about earlier this year. A number of changes have landed in the
last week:
- Implemented
let
,defn
and sugar for quote; - Figured out basic REPL integration with Emacs;
- Added numeric comparison operators;
- Reviewed my Minimum Viable Repo checklist for this project;
- Fixed four bugs.
I also implemented the bulk of the automated tests in the language itself. This was a decisive step forward in both ease of creating new tests and confidence that the language was approaching something usable.
The work I'm happiest with, though, because it taught me the most, was implementing tail call optimization (TCO) in the language, which the rest of this post will be about.
Motivation
The need for some form of TCO became clear as I started to write more small programs in l1
.
Perhaps the simplest example is one that sums all the natural numbers up to $n$:
(defn sum-to-acc (n acc)
(cond ((zero? n) acc)
(t (sum-to-acc (- n 1) (+ n acc)))))
(defn sum-to (n)
(sum-to-acc n 0))
Calling sum-to
for small $n$ worked fine:
(sum-to 100)
;;=>
5050
However, larger $n$ blew up spectacularly:
(sum-to (* 1000 1000))
;;=>
runtime: goroutine stack exceeds 1000000000-byte limit
runtime: sp=0x14020500360 stack=[0x14020500000, 0x14040500000]
fatal error: stack overflow
runtime stack:
runtime.throw({0x10289aa2b?, 0x10294ddc0?})
/opt/homebrew/Cellar/go/1.18.3/libexec/src/runtime/panic.go:992 +0x50
runtime.newstack()
/opt/homebrew/Cellar/go/1.18.3/libexec/src/runtime/stack.go:1101 +0x46c
runtime.morestack()
/opt/homebrew/Cellar/go/1.18.3/libexec/src/runtime/asm_arm64.s:314 +0x70
goroutine 1 [running]:
strings.(*Reader).ReadByte(0x1401c2820e0?)
/opt/homebrew/Cellar/go/1.18.3/libexec/src/strings/reader.go:66 +0x98 fp=0x14020500360 sp=0x14020500360 pc=0x102883348
math/big.nat.scan({0x0, 0x1401c2820e0?, 0x0}, {0x1028dc7c8, 0x1401c2820e0}, 0xa, 0x0)
/opt/homebrew/Cellar/go/1.18.3/libexec/src/math/big/natconv.go:126 +0x80 fp=0x14020500430 sp=0x14020500360 pc=0x10288b1e0
;; ...
This happens, of course, because sum-to-acc
calls itself a million times,
each time storing a copy of its local bindings on the stack, which
eventually consumes all the space on the stack.
Getting simple recursive functions like this to work for large $n$ is
especially important because l1
doesn't have loops (yet)!
The Optimization
The solution is hinted at already in my test case. Note that I did
not write sum-to
as a single recursive function, as follows:
(defn sum-to-notail (n)
(cond ((zero? n) 0)
(t (+ n (sum-to-notail (- n 1))))))
While this function looks slightly simpler, it is harder for a
compiler or interpreter to optimize. The difference is that, whereas
sum-to-notail
does some work after calling itself (by adding n
to
the result), sum-to-acc
calls itself from the tail position; that is,
the function returns immediately after calling itself.
People have long realized that function calls from the tail position
can be replaced by updating the return address and then jumping
directly to the the new function without adding new information to the
stack. This is something I had heard about for years and used in
various "functional" languages, without ever really implementing
myself (and therefore fully understanding). It's an easy thing to take for
granted without knowing anything about how it's actually implemented
under the hood. The failure of sum-to-acc
and similar recursive
functions, described above, meant I would have to learn.
Two very different blog posts were helpful to me in pointing the way
forward: Adding tail call optimization to a Lisp interpreter in Ruby,
and How Tail Call Optimization Works. The posts focus on very
different languages (Ruby vs. C / assembler), but they each revolve
around what are effectively GOTO
statements. I'm old enough to
remember BASIC and the pernicious GOTO
statement leading to
"spaghetti code." I doubt I've ever used a GOTO
statement in
production code, whose use in modern programming languages fell out of
favor in the aftermath of Dijkstra's famous Go To Statement Considered
Harmful paper. But the ability to transfer control to another part of
your program without invoking a function call is key to the
optimization.
The Approach
Since the strategy is general, let's lose all the parentheses for a
moment and rewrite sum-to-acc
in language-agnostic pseudo-code:
function sum-to-acc(n sum)
if n == 0, then return sum
return sum-to-acc(n - 1, n + sum)
In most languages (without TCO), when this function is called, the
values of n
and sum
, as well as the return address, will be put on
the stack, whose evolution looks something like the following.1:
first invocation: [n=5, sum=0, ret=sum-to:...]
second invocation: [n=4, sum=5, ret=sum-to-acc:...]
[n=5, sum=0, ret=sum-to:...]
third invocation: [n=3, sum=9, ret=sum-to-acc:...]
[n=4, sum=5, ret=sum-to-acc:...]
[n=5, sum=0, ret=sum-to:...]
fourth invocation: [n=2, sum=12, ret=sum-to-acc:...]
[n=3, sum=9, ret=sum-to-acc:...]
[n=4, sum=5, ret=sum-to-acc:...]
[n=5, sum=0, ret=sum-to:...]
fifth invocation: [n=1, sum=14, ret=sum-to-acc:...]
[n=2, sum=12, ret=sum-to-acc:...]
[n=3, sum=9, ret=sum-to-acc:...]
[n=4, sum=5, ret=sum-to-acc:...]
[n=5, sum=0, ret=sum-to:...]
sixth invocation: [n=0, sum=15, ret=sum-to-acc:...]
[n=1, sum=14, ret=sum-to-acc:...]
[n=2, sum=12, ret=sum-to-acc:...]
[n=3, sum=9, ret=sum-to-acc:...]
[n=4, sum=5, ret=sum-to-acc:...]
[n=5, sum=0, ret=sum-to:...]
At the sixth invocation, our terminating condition is reached, and 15 is returned, with all the pending stack frames popped off the stack.
With TCO, the implementation looks more like the following:
function sum-to-acc(n sum)
TOP:
if n == 0, then return sum
n = n - 1
sum = sum + n
GOTO TOP
as a result, the evolution of the stack looks as follows:
first invocation: [n=5, sum=0, ret=sum-to:...]
second invocation: [n=4, sum=5, ret=sum-to:...]
third invocation: [n=3, sum=9, ret=sum-to:...]
fourth invocation: [n=2, sum=12, ret=sum-to:...]
fifth invocation: [n=1, sum=14, ret=sum-to:...]
sixth invocation: [n=0, sum=15, ret=sum-to:...]
All those extra stack frames are gone: recursion has turned into a form of iteration.
Implementing TCO, then, has two ingredients:
- Replace the values of the current arguments with their new values directly.
- Jump straight to the next call of the function without adding to the stack;
This low-level, imperative optimization makes high-level, functional, recursive implementations efficient.
Implementation
In thinking about the implementation for l1
, I was pleased to learn
that Go actually has a goto
statement. However, my implementation
was poorly set up to use it.
Early in the implementation of l1
, I noticed that each data type
(numbers, atoms, and lists) had its own evaluation rules, so it made
sense to make use of Go's features supporting polymorphism, namely
interfaces and receivers. I had a Sexpr
interface which looked like the following:
type Sexpr interface {
String() string
Eval(*env) (Sexpr, error) // <--------
Equal(Sexpr) bool
}
Numbers and atoms, for example, had fairly simple Eval
implementations. For example,
func (a Atom) Eval(e *env) (Sexpr, error) {
if a.s == "t" {
return a, nil
}
ret, ok := e.Lookup(a.s)
if ok {
return ret, nil
}
ret, ok = builtins[a.s]
if ok {
return ret, nil
}
return nil, fmt.Errorf("unknown symbol: %s", a.s)
}
And, of course, numbers eval to themselves:
func (n Number) Eval(e *env) (Sexpr, error) {
return n, nil
}
Lists, as you would expect, were more complicated – evaluating a list
expression needs to handle special forms2,
user-defined functions, and built-in functions. Following the classic
Structure and Interpretation of Computer Programs, I separated the
core logic for function application into separate Eval
and Apply
phases. And to prevent the Eval
for lists from getting too large, I
broke out the evaluation rules for different cases (e.g. for let
and
cond
special forms and for function application) into their own
functions.
In other words, I had evaluation logic spread over ten functions in
five files. Sadly, the need to jump back to the beginning of an
evaluation rather than recursively calling Eval
again meant that
several of those nicely broken out functions had to be brought
together into a single function, because goto
does not support
jumping from one function to another. (C has setjmp
and longjmp
,
which effectively do this, but I would want to upgrade my IQ by a
few points before applying them in this situation.)
There were actually three cases where I was performing an evaluation
step right before returning, and the goto
pattern could be used:
- When evaluating code in the tail position of a user-defined function;
- When evaluating code in the last expression in a
let
block; - When evaluating code in the chosen branch of a
cond
clause.
I wound up with code which with looks like the following. Several
steps are indicated only with comments. Note the tiny, easy-to-miss
top:
label at the very beginning:
// lisp.go
//
func eval(expr Sexpr, e *env) (Sexpr, error) {
top:
switch t := expr.(type) {
case Atom:
return evAtom(t, e)
case Number:
return expr, nil
// ...
case *ConsCell:
if t == Nil {
return Nil, nil
}
// special forms:
if carAtom, ok := t.car.(Atom); ok {
switch {
case carAtom.s == "quote":
return t.cdr.(*ConsCell).car, nil
case carAtom.s == "cond":
pairList := t.cdr.(*ConsCell)
if pairList == Nil {
return Nil, nil
}
for {
if pairList == Nil {
return Nil, nil
}
pair := pairList.car.(*ConsCell)
ev, err := eval(pair.car, e)
if err != nil {
return nil, err
}
if ev == Nil {
pairList = pairList.cdr.(*ConsCell)
continue
}
expr = pair.cdr.(*ConsCell).car
goto top
}
// ...
The code so far shows the evaluation for atoms, numbers, and cond
statements. cond
does not introduce any new bindings, but when the
first truthy condition is encountered, it evaluates the next argument
as its final act. So the code above simply replaces the expression to
be evaluated, expr
, with the expression from the matching clause,
and then restarts the evaluation via goto
, without the overhead of a
separate function call.
The code for let
is somewhat similar:
case carAtom.s == "let":
args := t.cdr.(*ConsCell)
if args == Nil {
return nil, fmt.Errorf("let requires a binding list")
}
// ... code to set up let bindings ...
body := args.cdr.(*ConsCell)
var ret Sexpr = Nil
for {
var err error
if body == Nil {
return ret, nil
}
// Implement TCO for `let`:
if body.cdr == Nil {
expr = body.car
e = &newEnv
goto top
}
ret, err = eval(body.car, &newEnv)
if err != nil {
return nil, err
}
body = body.cdr.(*ConsCell)
}
The for
loop invokes a new eval
for each expression in the body of
the let
, except for the last one: when the last expression is
reached, (the cdr
is Nil
), the last eval
is done by jumping to
the beginning of the function, once it has updated its environment to
point to include the new bindings.
The last use of this pattern is in function invocation proper, which looks similar:
// (... code to set up new environment based on passed arguments ...)
var ret Sexpr = Nil
for {
if lambda.body == Nil {
return ret, nil
}
// TCO:
if lambda.body.cdr == Nil {
expr = lambda.body.car
e = &newEnv
goto top
}
ret, err = eval(lambda.body.car, &newEnv)
if err != nil {
return nil, err
}
lambda.body = lambda.body.cdr.(*ConsCell)
}
I've skipped various parts of eval
that aren't relevant for TCO
optimization – if you're interested, you can check out the code
yourself.
To be clear, what we are optimizing is all tail calls, not just recursive ones – though the recursive ones were the primary objective due to the stack overflows reported above.
The end result is that sum-to
now can complete for large values of
$n$:
(sum-to (* 1000 1000))
;;=>
500000500000
Incidentally, a variant of our test case failed before I added the TCO
optimization to let
shown above; this now works, as well:
(defn sum-to-acc-with-let (n acc)
(let ((_ 1))
(cond ((zero? n) acc)
(t (sum-to-acc-with-let (- n 1) (+ n acc))))))
(defn sum-to-with-let (n) (sum-to-acc-with-let n 0))
(sum-to-with-let (* 1000 1000))
;;=>
500000500000
Conclusion
Getting tail-call optimization to work was very satisfying… though
the eval
implementation is certainly more complex than before. (Ah,
optimization!)
To ensure TCO continues to work, variants of sum-to
with and without let
are run
on every build, along with a few other short example programs.
After implementing TCO in my own code, I can appreciate and understand the optimization better when I see it in the wild. I fully expect to use the pattern again when implementing future lisps (yes, I hope there will be more).
Note that this is a somewhat abstract representation: the details are language-specific. The ret=sum-to:...
notation means that when the function returns, control will pass back to where it left off inside the sum-to
function.
A special form is one that does not follow the normal evaluation rule for functions – it may evaluate its arguments once, many times, or not at all. (I am glossing over macros for the time being; l1
does not have them yet.)
Later: Practices for Software Projects
Earlier: Tests by Example in Clojure and Common Lisp