Frederik Gram Kortegaard / Your First Compiler
Our lexer works great. It takes (2 + 3) * 4 and gives us a neat list of tokens: [LParen, Number(2), Plus, Number(3), RParen, Mul, Number(4), Eof].
But look at that list. When you see *, what does it multiply? The tokens don't say. When you see +, does it happen before or after the multiplication? The list doesn't tell you. The tokens are just sitting there in a line - no indication of what connects to what.
We need to capture that structure. We need trees.
All programs have structure. A function call contains arguments. An addition operation has two sides. An if statement has a condition and a body. This structure is hierarchical - it's a tree, not a list.
Let's look at a simple expression: 2 + 3
As tokens:
[Number(2), Plus(+), Number(3)]
As an AST:
+ / \ 2 3
The + sits at the top with two children. That structure means "apply + to these two things."
Consider 2 + 3 * 4. You know from math that multiplication happens before addition. But how do we represent that?
+
/ \
2 *
/ \
3 4
Notice what's nested: the * is under the +. That's how the tree shows order - what's deeper happens first. You read bottom-up: multiply 3 and 4 (that's 12), then add 2 to that result (14). The nesting captures the precedence.
Here's that same AST represented as nested Python objects:
BinaryOp(
op="+",
left=Number(2),
right=BinaryOp(
op="*",
left=Number(3),
right=Number(4)
)
)
That's it. An AST is nested objects. The Python code above? That's the tree. Each BinaryOp contains other expressions. The nesting creates the tree structure.
Because programming languages are inherently nested. Expressions contain expressions. Parentheses change grouping. Functions take arguments that are themselves expressions.
Compare 2 + 3 * 4 with (2 + 3) * 4:
(2 + 3) * 4 2 + 3 * 4
* +
/ \ / \
+ 4 2 *
/ \ / \
2 3 3 4
Same tokens, different structure, different meaning. The tree captures this.
You might wonder why "abstract." It means we keep what matters for meaning and throw away syntax noise.
Consider parentheses in (2 + 3) * 4. They force grouping - do the addition first, then multiply. But once we've built the tree, the grouping is already in the structure. The addition is nested under the multiplication. The parentheses did their job during parsing. We don't need to remember them.
We keep: operations and how they nest, values (numbers, names), and location (for error messages).
We throw away: whitespace, parentheses (the tree structure already shows grouping), comments, semicolons, syntax markers.
The AST represents "what the code means", not "exactly how it was typed".
You might hear the term "parse tree" (or "concrete syntax tree"). A parse tree keeps everything - every parenthesis, every semicolon, every piece of syntax. It's a direct representation of how the grammar rules were applied.
An AST strips that down to just the meaning. For (2 + 3), the parse tree would have nodes for the left paren, the expression, and the right paren. The AST just has the addition node with its two children.
For compilers, ASTs are what you want. They're simpler to work with and carry exactly the information you need. Parse trees are useful for tools like code formatters that need to preserve the original syntax.
An AST gives you a data structure you can actually work with. You can't easily annotate a string of source code, but you can attach type information to AST nodes. You can't transform source code without parsing it, but you can walk an AST and rewrite parts of it.
Some practical benefits:
Parsing is the process of taking a flat sequence of tokens and building a tree from them. The parser reads tokens one by one and figures out how they relate to each other - what's nested inside what, what operates on what.
This is the second major phase of a compiler. The lexer gives us tokens. The parser gives us a tree. Everything after that works on the tree.
There are many ways to build a parser. Here are the common ones:
Recursive Descent: Write a function for each grammar rule. Functions call each other recursively. Simple, no external tools, easy to understand and debug.
Parser Generators (yacc, bison, ANTLR): Write a formal grammar, generate parser code automatically. Powerful for complex languages, but adds tooling and makes debugging harder.
Pratt Parsing (Precedence Climbing): A particularly elegant way to parse expressions with operators. Uses "binding power" to handle precedence. We'll use this for our expression parser.
Parser Combinators: Build parsers by composing small parsers together. Each combinator handles one piece, and you snap them together like building blocks. Popular in functional languages (Haskell's Parsec, Rust's nom).
For YFC, we're using recursive descent with precedence climbing. Recursive descent for the overall structure (functions, if-expressions), and precedence climbing for expressions with operators. It's the approach most hand-written compilers use because it's straightforward and gives you full control.
For YFC, we'll create a file called abstract.py and define AST nodes as Python classes:
from lexer import Token, TokenType
class Expression:
"""Base class for all expression nodes."""
pass
class NumberExpr(Expression):
"""Represents a numeric literal like 42 or 3.14"""
def __init__(self, value: Token):
self.value = value
class BinaryOp(Expression):
"""Represents a binary operation like 2 + 3"""
def __init__(self, left: Expression, op: Token, right: Expression):
self.left = left
self.op = op
self.right = right
Let's test our AST design by manually building a tree for 2 + 3 * 4:
if __name__ == "__main__":
# Manually build AST for: 2 + (3 * 4)
ast = BinaryOp(
left=NumberExpr(Token(TokenType.Number, "2")),
op=Token(TokenType.Plus, "+"),
right=BinaryOp(
left=NumberExpr(Token(TokenType.Number, "3")),
op=Token(TokenType.Mul, "*"),
right=NumberExpr(Token(TokenType.Number, "4"))
)
)
print("AST for: 2 + 3 * 4")
print(f"Root operation: {ast.op.value}")
print(f" Left: {ast.left.value.value}")
print(f" Right operation: {ast.right.op.value}")
print(f" Left: {ast.right.left.value.value}")
print(f" Right: {ast.right.right.value.value}")
Run it:
$ python abstract.py
AST for: 2 + 3 * 4
Root operation: +
Left: 2
Right operation: *
Left: 3
Right: 4
See how the multiplication is nested under the addition? That's the precedence captured in the tree structure. We built this manually, but in the next part, the parser will build these trees automatically from tokens.
Now you understand what ASTs are and why we need them. Programs are trees, not lists. The tree structure captures relationships, precedence, and nesting. The AST is what we'll actually work with throughout the rest of the compiler.
In the next part, we'll build the parser. We'll write code that reads tokens and constructs an AST, handling operator precedence, parentheses, and nested expressions along the way.
Next: Building the Parser | Previous: Lexical Analysis
Code Repository: All accompanying code for this tutorial series is available on GitHub.