Frederik Gram Kortegaard / Your First Compiler

Part 4: Abstract Syntax Trees

Our lexer works great. It takes (2 + 3) * 4 and gives us a neat list of tokens: [LParen, Number(2), Plus, Number(3), RParen, Mul, Number(4), Eof].

But look at that list. When you see *, what does it multiply? The tokens don't say. When you see +, does it happen before or after the multiplication? The list doesn't tell you. The tokens are just sitting there in a line - no indication of what connects to what.

We need to capture that structure. We need trees.

What Is an AST?

All programs have structure. A function call contains arguments. An addition operation has two sides. An if statement has a condition and a body. This structure is hierarchical - it's a tree, not a list.

Let's look at a simple expression: 2 + 3

As tokens:

[Number(2), Plus(+), Number(3)]

As an AST:

    +
   / \
  2   3

The + sits at the top with two children. That structure means "apply + to these two things."

A More Complex Example

Consider 2 + 3 * 4. You know from math that multiplication happens before addition. But how do we represent that?

Notice what's nested: the * is under the +. That's how the tree shows order - what's deeper happens first. You read bottom-up: multiply 3 and 4 (that's 12), then add 2 to that result (14). The nesting captures the precedence.

Here's that same AST represented as nested Python objects:

BinaryOp(
    op="+",
    left=Number(2),
    right=BinaryOp(
        op="*",
        left=Number(3),
        right=Number(4)
    )
)

That's it. An AST is nested objects. The Python code above? That's the tree. Each BinaryOp contains other expressions. The nesting creates the tree structure.

Why Trees?

Because programming languages are inherently nested. Expressions contain expressions. Parentheses change grouping. Functions take arguments that are themselves expressions.

Compare 2 + 3 * 4 with (2 + 3) * 4:

  (2 + 3) * 4            2 + 3 * 4

      *                      +
     / \                    / \
    +   4                  2   *
   / \                        / \
  2   3                      3   4

Same tokens, different structure, different meaning. The tree captures this.

What Makes It "Abstract"?

You might wonder why "abstract." It means we keep what matters for meaning and throw away syntax noise.

Consider parentheses in (2 + 3) * 4. They force grouping - do the addition first, then multiply. But once we've built the tree, the grouping is already in the structure. The addition is nested under the multiplication. The parentheses did their job during parsing. We don't need to remember them.

We keep: operations and how they nest, values (numbers, names), and location (for error messages).

We throw away: whitespace, parentheses (the tree structure already shows grouping), comments, semicolons, syntax markers.

The AST represents "what the code means", not "exactly how it was typed".

Parse Trees vs ASTs

You might hear the term "parse tree" (or "concrete syntax tree"). A parse tree keeps everything - every parenthesis, every semicolon, every piece of syntax. It's a direct representation of how the grammar rules were applied.

An AST strips that down to just the meaning. For (2 + 3), the parse tree would have nodes for the left paren, the expression, and the right paren. The AST just has the addition node with its two children.

For compilers, ASTs are what you want. They're simpler to work with and carry exactly the information you need. Parse trees are useful for tools like code formatters that need to preserve the original syntax.

Why ASTs Matter for Compilers

An AST gives you a data structure you can actually work with. You can't easily annotate a string of source code, but you can attach type information to AST nodes. You can't transform source code without parsing it, but you can walk an AST and rewrite parts of it.

Some practical benefits:

You can annotate nodes with extra information (types, scope, whatever)
You can transform the tree (optimize, inline functions, etc.)
You can walk it systematically (check that all variables are defined)
Error messages can point to exact locations (we kept the span information)

What Is Parsing?

Parsing is the process of taking a flat sequence of tokens and building a tree from them. The parser reads tokens one by one and figures out how they relate to each other - what's nested inside what, what operates on what.

This is the second major phase of a compiler. The lexer gives us tokens. The parser gives us a tree. Everything after that works on the tree.

Parsing Techniques

There are many ways to build a parser. Here are the common ones:

Recursive Descent: Write a function for each grammar rule. Functions call each other recursively. Simple, no external tools, easy to understand and debug.

Parser Generators (yacc, bison, ANTLR): Write a formal grammar, generate parser code automatically. Powerful for complex languages, but adds tooling and makes debugging harder.

Pratt Parsing (Precedence Climbing): A particularly elegant way to parse expressions with operators. Uses "binding power" to handle precedence. We'll use this for our expression parser.

Parser Combinators: Build parsers by composing small parsers together. Each combinator handles one piece, and you snap them together like building blocks. Popular in functional languages (Haskell's Parsec, Rust's nom).

For YFC, we're using recursive descent with precedence climbing. Recursive descent for the overall structure (functions, if-expressions), and precedence climbing for expressions with operators. It's the approach most hand-written compilers use because it's straightforward and gives you full control.

Our AST Design

For YFC, we'll create a file called abstract.py and define AST nodes as Python classes:

from lexer import Token, TokenType

class Expression:
    """Base class for all expression nodes."""
    pass

class NumberExpr(Expression):
    """Represents a numeric literal like 42 or 3.14"""
    def __init__(self, value: Token):
        self.value = value

class BinaryOp(Expression):
    """Represents a binary operation like 2 + 3"""
    def __init__(self, left: Expression, op: Token, right: Expression):
        self.left = left
        self.op = op
        self.right = right

Building an AST Manually

Let's test our AST design by manually building a tree for 2 + 3 * 4:

if __name__ == "__main__":
    # Manually build AST for: 2 + (3 * 4)
    ast = BinaryOp(
        left=NumberExpr(Token(TokenType.Number, "2")),
        op=Token(TokenType.Plus, "+"),
        right=BinaryOp(
            left=NumberExpr(Token(TokenType.Number, "3")),
            op=Token(TokenType.Mul, "*"),
            right=NumberExpr(Token(TokenType.Number, "4"))
        )
    )

    print("AST for: 2 + 3 * 4")
    print(f"Root operation: {ast.op.value}")
    print(f"  Left: {ast.left.value.value}")
    print(f"  Right operation: {ast.right.op.value}")
    print(f"    Left: {ast.right.left.value.value}")
    print(f"    Right: {ast.right.right.value.value}")

Run it:

$ python abstract.py
AST for: 2 + 3 * 4
Root operation: +
  Left: 2
  Right operation: *
    Left: 3
    Right: 4

See how the multiplication is nested under the addition? That's the precedence captured in the tree structure. We built this manually, but in the next part, the parser will build these trees automatically from tokens.

What's Next

Now you understand what ASTs are and why we need them. Programs are trees, not lists. The tree structure captures relationships, precedence, and nesting. The AST is what we'll actually work with throughout the rest of the compiler.

In the next part, we'll build the parser. We'll write code that reads tokens and constructs an AST, handling operator precedence, parentheses, and nested expressions along the way.

Next: Building the Parser | Previous: Lexical Analysis

Code Repository: All accompanying code for this tutorial series is available on GitHub.