To The Metal... Compiling Your Own Language(s)

code lisp forth compilers llvm ..... August 6, 2024

Earlier: Wrangling Half a Thousand Dreams

Like many programmers, I have programming hobbies. One of these is implementing new languages. My most recent language project, l1, was a Lisp dialect whose primary data types are symbols and arbitrarily-large integers.

I've been happy with l1, but it is interpreted; since I was actively working on it last (2022), I've been wondering about the best way to generate compiled standalone executable programs, written in l1 or any other language.

The Problem In General

Execution models for programming languages take three basic approaches, listed in increasing order of speed:

Tree-walking interpreter: Programs are read and parsed into ASTs in memory, then executed step-by-step by an interpreter. This is the approach l1 uses.
Bytecode VM: Programs are compiled into a sort of abstract machine language, simpler than the physical processor's, and executed by a virtual machine (VM). Java and Python work this way.
Machine code generation: The code is directly compiled into machine language and executed on the user's hardware. C and C++ programs work this way.

Languages using Option 2 often add just-in-time compilation to machine code, for extra performance. Option 3 is typically fastest, but is sometimes skipped in introductory compiler classes and tutorials. For example, Robert Nystrom's excellent Crafting Interpreters book devotes the first section to implementing a tree-walking interpreter implementation in Java and the second half to a compiler and bytecode VM written in C, with minimal coverage of how to target physical hardware. And the (also excellent) class on compiler writing that I took from David Beazley, in its first incarnation, stopped at the point of generating of so-called intermediate representation (IR) output (though students in the current iteration of the class do compile to native code, using LLVM).

Compiling to machine code is tricky because CPUs are inherently complex. Real hardware is intricate, cumbersome, and unintuitive if you're primarily accustomed to high-level languages. Additionally, there are numerous significant variants to consider (e.g., CPU/GPU, ARM/Intel, 32-bit/64-bit architectures).

But targeting machine code rather than interpreters or bytecode VMs is appealing, not just because it is an interesting challenge, but also because the resulting artifacts are small, stand-alone, and typically very fast. While running Python, Ruby, and Java programs require the appropriate infrastructure to be in place on the target machine at all times, Go, Rust, and C programs (among others) benefit from targeting the physical hardware: their programs tend to be smaller, and can be deployed to identical computers simply by copying the executable file, needing to deploy the interpreter, extra libraries, etc. to the target machine(s).

Small Is Beautiful

As a programmer who came up during the dawn of personal computers, I have some nostalgia for an era when programs or even entire operating systems fit on a few-hundred-kB floppy disk. Much existing software feels bloated to me, though some widespread tools are still lean and fast. For illustration purposes, here are the physical sizes of some of the venerable command-line Unix programs I use on a daily basis (this is on MacOS):

Program	Size (kB)
`wc`	100
`cat`	116
`df`	116
`more`	360

These were chosen more or less at random from my bash history and are representative of old-school Unix utilities. For comparison, iMovie on my Mac is 2.8 GB, several thousand times larger than the largest of these. Of course, the comparison is somewhat ludicrous - iMovie does many amazing things… but I use all the above programs hundreds or thousands of times more often than I do iMovie, so it's good that that they are compact and run quickly. In a time of increasingly bloated software stacks, I find myself especially drawn to simple tools with small footprints.

An Approach

If targeting physical hardware is hard, what tools can we use to make the job easier?

I recently started learning about LLVM, a modular set of compiler tools which "can be used to develop a frontend for any programming language and a backend for any instruction set architecture" (Wikipedia). LLVM has been used heavily in the Rust toolchain and in Apple's developer tools.

The "modular" adjective is critical here: LLVM is separated into front-end, back-end and optimizing parts thanks to a shared "intermediate representation" (IR) – a sort of portable assembly language which represents simple computation steps in a machine-independent but low-level manner.

The LLVM IR takes a little getting used to but, with a little practice, is reasonably easy to read, and, more importantly, to generate.

As an example, consider the following simple C program, three.c, which stores the number 3 in a variable and uses it as its exit code. We will use clang, the LLVM C/C++/Obj-C/… compiler for the LLVM ecosystem:

$ cat three.c
int x = 3;

int main() {
  return x;
}
$ clang three.c -o three
$ ./three; echo $?
3

One can easily view, and possibly even understand, the assembler output for such a simple program:

$ clang -O3 -S three.c -o three.s
$ cat -n three.s
     1		.section	__TEXT,__text,regular,pure_instructions
     2		.build_version macos, 14, 0	sdk_version 14, 4
     3		.globl	_main                           ; -- Begin function main
     4		.p2align	2
     5	_main:                                  ; @main
     6		.cfi_startproc
     7	; %bb.0:
     8	Lloh0:
     9		adrp	x8, _x@PAGE
    10	Lloh1:
    11		ldr	w0, [x8, _x@PAGEOFF]
    12		ret
    13		.loh AdrpLdr	Lloh0, Lloh1
    14		.cfi_endproc
    15	                                        ; -- End function
    16		.section	__DATA,__data
    17		.globl	_x                              ; @x
    18		.p2align	2, 0x0
    19	_x:
    20		.long	3                               ; 0x3
    21
    22	.subsections_via_symbols

In comparison, here is the LLVM IR for the same program:

$ clang -S -emit-llvm three.c -o three.ll
$ cat -n three.ll
     1	; ModuleID = 'three.c'
     2	source_filename = "three.c"
     3	target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
     4	target triple = "arm64-apple-macosx14.0.0"
     5
     6	@x = global i32 3, align 4
     7
     8	; Function Attrs: noinline nounwind optnone ssp uwtable(sync)
     9	define i32 @main() #0 {
    10	  %1 = alloca i32, align 4
    11	  store i32 0, ptr %1, align 4
    12	  %2 = load i32, ptr @x, align 4
    13	  ret i32 %2
    14	}
    15
    16	attributes #0 = { noinline nounwind optnone ssp ;; .... very long list...
    17
    18	!llvm.module.flags = !{!0, !1, !2, !3, !4}
    19	!llvm.ident = !{!5}
    20
    21	!0 = !{i32 2, !"SDK Version", [2 x i32] [i32 14, i32 4]}
    22	!1 = !{i32 1, !"wchar_size", i32 4}
    23	!2 = !{i32 8, !"PIC Level", i32 2}
    24	!3 = !{i32 7, !"uwtable", i32 1}
    25	!4 = !{i32 7, !"frame-pointer", i32 1}
    26	!5 = !{!"Apple clang version 15.0.0 (clang-1500.3.9.4)"}

There is a fair amount of stuff here, but a lot of it looks suspiciously like metadata we don't really care about for our experiments going forward. The, uh, main region of interest is from lines 9-14 – notice that the function definition itself looks a little more readable than the assembly language version, but slightly lower-level than the original C program.

You can turn the IR into a runnable program:

$ clang -O3 three.ll -o three
$ ./three; echo $?
3

The approach I explore here is to generate LLVM IR "by fair means or foul." Here, let's just edit our IR down to something more minimal and see how it goes. I suspect the store of 0 in "register" %1 is gratuitous, so let's try to remove it, along with all the metadata:

$ cat 3.ll
target triple = "x86_64-apple-macosx14.0.0"

@x = global i32 3, align 4

define i32 @main() {
  %1 = load i32, ptr @x, align 4
  ret i32 %1
}
$ clang -O3 3.ll -o 3
$ ./3; echo $?
3

This is frankly not much more complicated than the C code, and it shows a helpful strategy at work:

Step 1: To understand how to accomplish something in LLVM IR, write the corresponding C program and use clang to generate the IR, being alert for possible "extra stuff" like we saw in the example.

Step 2. Try to generate, and test, working programs from the IR you write or adapt, making adjustments as desired.

There is another, optional step as well:

Step 3. Use, or write, language "bindings" to drive LLVM generation from the language of your choice. This is the step we will consider next.

Enter Babashka

While one can write LLVM IR directly, as we have seen, we are interested in compiling other languages (possibly higher-level ones of our own invention), so we will want to generate IR somehow. For this project I chose Babashka, an implementation of the Clojure programming language I have found ideal for small projects where both start-up speed and expressiveness are important.

(I assume some familiarity with Lisp and Clojure in this post; for those just getting started, Clojure for the Brave and True is a good introduction.)

The repo https://github.com/eigenhombre/llbb contains the source files discussed here. The bulk of the code in this repo is in llir.bb, a source file which provides alternative definitions in Clojure for common LLVM idioms. Some of these are trivial translations:

(def m1-target "arm64-apple-macosx14.0.0")

(defn target [t] (format "target triple = \"%s\"" t))

… whereas other expressions leverage the power of Clojure to a greater degree. For example, this section defines translations used to represent arithmetic operations:

(defn arithm [op typ a b]
  (format "%s %s %s, %s"
          (name? op)
          (name? typ)
          (sigil a)
          (sigil b)))

(defn add [typ a b] (arithm :add typ a b))
(defn sub [typ a b] (arithm :sub typ a b))
(defn mul [typ a b] (arithm :mul typ a b))
(defn div [typ a b] (arithm :sdiv typ a b))

(comment
  (div :i32 :a :b)
  ;;=>
  "sdiv i32 %a, %b"

  (add :i8 :x 1)
  ;;=>
  "add i8 %x, 1")

You can see this approach at work by representing the C program discussed earlier:

(module
 (assign-global :i :i32 3)
 (def-fn :i32 :main []
   (assign :retval (load :i32 :i))
   (ret :i32 :retval)))

which evaluates to

target triple = "arm64-apple-macosx14.0.0"

@i = global i32 3

define i32 @main() nounwind {
  %retval = load i32, i32* @i, align 4
  ret i32 %retval
}

I use here a slightly different but equivalent pointer syntax for the load expression than output by clang in the example above.

Two very small helper functions allow me to test out small programs quickly:

(require '[babashka.process :as sh])

(defn sh
  "
  Use `bash` to run command(s) `s`, capturing both stdout/stderr
  as a concatenated string.  Throw an exception if the exit code
  is nonzero.
  "
  [s]
  (let [{:keys [out err]}
        (sh/shell {:out :string, :err :string}
                  (format "bash -c '%s'" s))]
    (str/join "\n" (remove empty? [out err]))))

and

(require '[babashka.fs :as fs])

(defn compile-to
  "
  Save IR `body` to a temporary file and compile it, writing the
  resulting binary to `progname` in the current working directory.
  "
  [progname body]
  (let [ll-file
        (str (fs/create-temp-file {:prefix "llbb-", :suffix ".ll"}))]
    (spit ll-file body)
    (sh (format "clang -O3 %s -o %s"
                ll-file
                progname))))

These two together allow me to test small programs out quickly at the REPL. Some examples follow, obtained by running the C compiler for equivalent programs, generating and inspecting the LLVM IR, and translating them into new Clojure bindings as cleanly as possible.

Minimum viable program: just return zero:

(compile-to "smallest-prog"
            (module
             (def-fn :i32 :main []
               (ret :i32 0))))

(sh "./smallest-prog; echo -n $?")
;;=>
"0"

Argument count: return, as the exit code, the number of arguments, including the program name:

;; Argument counting: return number of arguments as an exit code:
(compile-to "argcount"
            (module
             (def-fn :i32 :main [[:i32 :arg0]
                                 [:ptr :arg1_unused]]
               (assign :retptr (alloca :i32))
               (store :i32 :arg0 :ptr :retptr)
               (assign :retval (load :i32 :retptr))
               (ret :i32 :retval))))

(sh "./argcount; echo -n $?") ;;=> "1"
(sh "./argcount 1 2 3; echo -n $?") ;;=> "4"

Hello, world:

(let [msg "Hello, World."
      n (inc (count msg))] ;; Includes string terminator
  (compile-to "hello"
              (module
               (external-fn :i32 :puts :i8*)
               (def-global-const-str :message msg)
               (def-fn :i32 :main []
                 (assign :as_ptr
                         (gep (fixedarray n :i8)
                              (star (fixedarray n :i8))
                              (sigil :message)
                              [:i64 0]
                              [:i64 0]))
                 (call :i32 :puts [:i8* :as_ptr])
                 (ret :i32 0)))))

(sh "./hello") ;;=> "Hello, World.\n"

This is the first program one typically writes in a new programming language. Note that we use here idioms (external-fn, call) to define and invoke an external function from the C standard library.

Let's see how big the resulting program is:

(sh "ls -l hello")
;;=>
"-rwxr-xr-x  1 jacobsen  staff  33432 Aug 14 21:09 hello\n"

At this point I want to pause to reconsider one of the points of this exercise, which is to produce small programs. Here are the rough executable sizes for a "Hello, World" example program in various languages that I use frequently:

Language	Size	Relative Size
Common Lisp	38 MB	1151
Clojure	3.4 MB	103
Go	1.9 MB	58
C	33 kB	1
LLVM IR	33 kB	1

I threw Clojure in there for comparison even though, unlike the other examples, the resulting überjar also requires a Java bytecode VM in order to run. The programs generated from C and from LLVM IR are equivalent; this is not surprising, given that I used the C program to guide my writing and translation of the LLVM IR.

Building A Compiling Calculator

We are now ready to implement a "compiler" for something approaching a useful language, namely, greatly reduced subset of Forth. Forth is a stack-based language created in the 1970s and still in use today, especially for small embedded systems.

LLVM will handle the parts commonly known as "compiler backend" tasks, and Babashka will provide our "frontend," namely breaking the text into tokens and parsing them. This task is made easy for us, because Forth is syntactically quite simple, and Babashka relatively powerful. Here are the language rules we will adopt:

Program tokens are separated by whitespace.
Non-numeric tokens are math operators.
Only integer operands are allowed.
Comments begin with \\.

Forth expressions typically place the arguments first, and the operator last (so-called "reverse-Polish notation"). Here is an example program which does some math and prints the result:

(def example "

2 2 +  \\ 4
5 *    \\ multiply by five to get 20
2 /    \\ divide by 2 -> 10
-1 +   \\ add -1 -> 9
8 -    \\ subtract 8 -> 1
.      \\ prints '1'

")

The code in forth.bb handles the parser, whose goal is to consume the raw program text and generate an abstract syntax tree (in our case, just a list) of operations to translate into IR:

(defn strip-comments
  "
  Remove parts of lines beginning with backslash
  "
  [s]
  (str/replace s #"(?sm)^(.*?)\\.*?$" "$1"))

(defn tokenize
  "
  Split `s` on any kind of whitespace
  "
  [s]
  (remove empty? (str/split s #"\s+")))

(defrecord node
    [typ val] ;; A node has a type and a value
  Object
  (toString [this]
    (format "[%s %s]" (:typ this) (:val this))))

;; Allowed operations
(def opmap {"+" :add
            "-" :sub
            "/" :div
            "*" :mul
            "." :dot
            "drop" :drop})

(defn ast
  "
  Convert a list of tokens into an \"abstract syntax tree\",
  which in our Forth is just a list of type/value pairs.
  "
  [tokens]
  (for [t tokens
        :let [op (get opmap t)]]
    (cond
      ;; Integers (possibly negative)
      (re-matches #"^\-?\d+$" t)
      (node. :num (Integer. t))

      ;; Operations
      op (node. :op op)

      :else (node. :invalid :invalid))))

Running this on our example,

(->> example
     strip-comments
     tokenize
     ast
     (map str))

;;=>
("[:num 2]"
 "[:num 2]"
 "[:op :add]"
 "[:num 5]"
 "[:op :mul]"
 "[:num 2]"
 "[:op :div]"
 "[:num -1]"
 "[:op :add]"
 "[:num 8]"
 "[:op :sub]"
 "[:op :dot]")

The remainder of forth.bb essentially just implements the needed operators, as well as the required stack and the reference to the printf C library function. It is perhaps a bit lengthy to go through here in its entirety, so I will share one example where Babashka helps:

(defn def-arithmetic-op [nam op-fn]
  (def-fn :void nam []
    (assign :sp (call :i32 :get_stack_cnt))
    (if-lt :i32 :sp 2
           (els)  ;; NOP - not enough on stack
           (els
            (assign :value2
                    (call :i32 :pop))
            (assign :value1
                    (call :i32 :pop))
            (assign :result (op-fn :i32 :value1 :value2))
            (call :void :push [:i32 :result])))
    (ret :void)))

This makes LLVM IR which does the following, in pseudo-code:

get current stack position; ensure at least two entries (else return)
pop the operands off the stack
apply the arithmetic operator to the operands
put the result on the stack

Implementing the four arithmetic operators is then as simple as invoking

;; ...
(def-arithmetic-op :mul mul)
(def-arithmetic-op :add add)
(def-arithmetic-op :sub sub)
(def-arithmetic-op :div div)
;; ...

when generating the IR.

Aside from the general-purpose LLVM IR code (llir.bb), the Forth implementation is under two hundred lines. It includes the invocation to clang to compile the temporary IR file to make the runnable program. Here's an example compilation session:

$ cat example.fs

       \\ initial state          stack: []
3      \\ put 3 on stack.        stack: [3]
99     \\ put 99 on stack.       stack: [3, 99]
drop   \\ discard top item.      stack: [3]
drop   \\ discard top item.      stack: []
2 2    \\ put 2 on stack, twice: stack: [2, 2]
+      \\ 2 + 2 = 4.             stack: [4]
5 *    \\ multiply 4 * 5.        stack: [20]
2 /    \\ divide by 2.           stack: [10]
-1 +   \\ add -1                 stack: [9]
8 -    \\ subtract 8 -> 1        stack: [1]
.      \\ prints '1'             stack: [1]
drop   \\ removes 1.             stack: []

$ ./forth.bb example.fs
$ ./example
1

The resulting program is fast, as expected…

$ time ./example
1

real	0m0.007s
user	0m0.002s
sys	0m0.003s

… and small:

$ ls -l ./example
-rwxr-xr-x  1 jacobsen  staff  8952 Aug 16 09:52 ./example

Let's review what we've done: we have implemented a small subset of Forth, writing a compiler front-end in Babashka/Clojure to translate source programs into LLVM IR, and using clang to turn the IR into compact binaries. The resulting programs are small and fast.

Sensible next steps would be to implement more of Forth's stack operators, and maybe start to implement the : (colon) operator, Forth's mechanism for defining new symbols and functions.

Lisp

Instead, let's implement a different variant of our arithmetic calculating language, using Lisp syntax (S-expressions). Consider our first Forth example:

2 2 +
5 *
2 /
-1 +
8 -
.

In Lisp, this looks like:

$ cat example.lisp
(print (- (+ -1
             (/ (* 5
                   (+ 2 2))
                2))
          8))

Rather than coming last, as they did before ("postfix" notation), our operators come first ("prefix" notation). The order of operations is determined by parentheses, as opposed to using stack as we did for our Forth implementation.

Here Babashka helps us tremendously because such parenthesized prefix expressions are valid EDN data, which the Clojure function clojure.edn/read-string can parse for us. But we need to convert the resulting nested list into the "SSA" (single static assignment) expressions LLVM understands. This is relatively straightforward with a recursive function which expands leaves of the tree and stores the results as intermediate values:

(defn to-ssa [expr bindings]
  (if (not (coll? expr))
    expr
    (let [[op & args] expr
          result (gensym "r")
          args (doall
                (for [arg args]
                  (if-not (coll? arg)
                    arg
                    (to-ssa arg bindings))))]
      (swap! bindings conj (concat [result op] args))
      result)))

(defn convert-to-ssa [expr]
  (let [bindings (atom [])]
    (to-ssa expr bindings)
    @bindings))

We use gensym here to get a unique variable name for each assignment, and doall to force the evaluation of the lazy for expansion of the argument terms. The result:

(->> "example.lisp"
     slurp
     (edn/read-string)
     convert-to-ssa)
;;=>
[(r623 + 2 2)
 (r622 * 5 r623)
 (r621 / r622 2)
 (r620 + -1 r621)
 (r619 - r620 8)
 (r618 print r619)]

The next step will be to actually write out the corresponding LLVM IR. The rest of lisp.bb is satisfyingly compact. Operators (we have five, but more can easily be added), are just a map of symbols to tiny bits of LLVM code:

(def ops
  {'* #(mul :i32 %1 %2)
   '+ #(add :i32 %1 %2)
   '/ #(div :i32 %1 %2)
   '- #(sub :i32 %1 %2)
   'print
   #(call "i32 (i8*, ...)"
          :printf
          [:i8* :as_ptr]
          [:i32 (sigil %1)])})

Similar to our Forth implementation, but even more compact, the main Babashka function, after a brief setup for printf, generates a series of SSA instructions.

(defn main [[path]]
  (when path
    (let [assignments (->> path
                           slurp
                           edn/read-string
                           convert-to-ssa)
          outfile (->> path
                       fs/file-name
                       fs/split-ext
                       first)
          ir (module
              (external-fn :i32 :printf :i8*, :...)
              (def-global-const-str :fmt_str "%d\n")
              (def-fn :i32 :main []
                (assign :as_ptr
                        (gep (fixedarray 4 :i8)
                             (star (fixedarray 4 :i8))
                             (sigil :fmt_str)
                             [:i64 0]
                             [:i64 0]))

                ;; Interpolate SSA instructions / operator invocations:
                (apply els
                       (for [[reg op & args] assignments
                             :let [op-fn (ops op)]]
                         (if-not op-fn
                           (throw (ex-info "bad operator" {:op op}))
                           (assign reg (apply op-fn args)))))
                (ret :i32 0)))]
      (compile-to outfile ir))))

(main *command-line-args*)

Putting these parts together (see lisp.bb on GitHub), we have:

$ ./lisp.bb example.lisp
$ ./example
1

It, too, is small and fast:

$ time ./example
1

real	0m0.006s
user	0m0.001s
sys	0m0.003s
$ ls -al ./example
-rwxr-xr-x  1 jacobsen  staff  33432 Aug 16 20:52 ./example

To say this is a "working Lisp compiler" at this point would be grandiose (we still need functions, lists and other collection types, eval, macros, …) but we have developed an excellent foundation to build upon.

To summarize, the strategy we have taken is as follows:

Use a high level language (in our case, Babashka/Clojure) to parse input and translate into LLVM IR;
When needed, write and generate small C programs to understand the equivalent IR to generate.
Compile the IR to small, fast binaries using clang.

Alternatives and Future Directions

I should note that C itself has long been used as an intermediate language, and we could have used it here instead of LLVM IR; I don't have a strong sense of the tradeoffs involved yet, but wanted to take the opportunity to learn more about LLVM for this project.

LLVM is interesting to me because of the modularity of its toolchain; it also provides a JIT compiler which allows one to build and execute code at run-time. We didn't investigate tooling for that here (it needs deeper LLVM language bindings than the homegrown Babashka code I used), but it could provide a way to do run-time compilation similar to what SBCL (a Common Lisp implementation which can compile functions at run-time) does.

Here are some directions I'm considering going forward:

Try interfacing with external libraries, e.g. a bignum library;
Implement more Forth functionality;
Implement more Lisp, possibly including a significant subset of l1;
Try a JIT-based approach, possibly using Rust as the host language.

Conclusion

Whenever possible, I want to make small, fast programs, and I like playing with and creating small programming languages. LLVM provides a fascinating set of tools and techniques for doing so, and using Babashka to make small front-ends for IR generation turns out to be surprisingly effective, at least for simple languages.

Earlier: Wrangling Half a Thousand Dreams