Frederik Gram Kortegaard / Your First Compiler
We have tokens from the lexer. We have AST node definitions. Now we need to connect them - turn that flat list of tokens into a tree structure.
Let's build a parser.
Just like with the lexer, we'll start simple. For now, we'll handle:
+, -, *, /)That's enough to parse expressions like (2 + 3) * 4. Once this works, adding functions and other features is straightforward.
Let's create a file called parser.py. A parser needs to track where it is in the token list:
from lexer import Token, TokenType
from abstract import Expression, NumberExpr, BinaryOp
class ParserError(Exception):
pass
class Parser:
def __init__(self, tokens: list[Token]):
self.tokens = tokens
self.cursor = 0
@property
def current(self) -> Token:
"""Get the current token."""
if 0 <= self.cursor < len(self.tokens):
return self.tokens[self.cursor]
return self.tokens[-1] # EOF
def consume(self) -> Token:
"""Consume and return the current token."""
if 0 <= self.cursor < len(self.tokens):
token = self.tokens[self.cursor]
else:
token = self.tokens[-1]
self.cursor += 1
return token
When you see 2 + 3, there are two things being added: 2 and 3. Parser terminology calls these "primary" expressions - the things operators operate on.
def parse_primary(self) -> Expression:
"""Parse a primary expression."""
match self.current._type:
# Grammar: '(' expression ')'
case TokenType.LParen:
self.consume()
expr = self.parse_expression()
if self.current._type != TokenType.RParen:
raise ParserError(f"Expected ')' but got {self.current._type}")
self.consume()
return expr
# Grammar: NUMBER
case TokenType.Number:
token = self.consume()
return NumberExpr(token)
case _:
raise ParserError(f"Unexpected token: {self.current._type}")
When we see 2 + 3 * 4, we need to build the right tree. If we naively parse left to right, we'd get:
*
/ \
+ 4
/ \
2 3
That's (2 + 3) * 4 = 20. Wrong. We need:
+
/ \
2 *
/ \
3 4
That's 2 + (3 * 4) = 14. Correct. Multiplication binds tighter than addition, so it needs to be deeper in the tree. We need to encode that in our parser.
We assign each operator a number - higher number means higher precedence:
# Operator precedence - higher number = higher precedence
PRECEDENCE_MAP = {
TokenType.Plus: 20,
TokenType.Minus: 20,
TokenType.Mul: 40,
TokenType.Div: 40,
}
Why 20 and 40 instead of 1 and 2? Because it gives us room to add operators in between later. When we add comparison operators like < and >, we can slot them in at 10 without renumbering everything.
The algorithm is called "precedence climbing". Here's the core idea: after parsing the immediate right side of an operator, we check what's coming next. If the next operator has higher precedence, we don't return yet - we keep extending the right side.
def parse_binary_op(self, left: Expression, op: Token) -> Expression:
"""Parse a binary operation using precedence climbing."""
# Get the precedence of the current operator
current_precedence = PRECEDENCE_MAP.get(op._type, 0)
# Parse the immediate right-hand side
right = self.parse_primary()
# Look ahead: if next operator has higher precedence, extend the right side
while PRECEDENCE_MAP.get(self.current._type, 0) > current_precedence:
next_op = self.consume()
right = self.parse_binary_op(right, next_op)
# Build the binary operation node
return BinaryOp(left, op, right)
Let's walk through 2 + 3 * 4 step by step to see how this works:
parse_expression() calls parse_primary(), which consumes 2 and returns NumberExpr(2).+. It's in the precedence map, so we consume it and call parse_binary_op(NumberExpr(2), Plus).parse_binary_op: current_precedence is 20 (for +). We call parse_primary(), which consumes 3 and returns NumberExpr(3). So right = NumberExpr(3).* with precedence 40. Is 40 > 20? Yes! So we don't return yet.* and recursively call parse_binary_op(NumberExpr(3), Mul).current_precedence is 40. We parse 4 as the right side. Next token is Eof with precedence 0. Is 0 > 40? No. So we return BinaryOp(NumberExpr(3), Mul, NumberExpr(4)).right is now BinaryOp(3, *, 4). No more higher-precedence operators, so we return BinaryOp(NumberExpr(2), Plus, BinaryOp(3, *, 4)).The multiplication ended up nested deeper in the tree - exactly what we wanted. The key insight is that the while loop only continues if the next operator binds tighter than the current one.
def parse_expression(self) -> Expression:
"""Parse an expression."""
left = self.parse_primary()
while PRECEDENCE_MAP.get(self.current._type):
op = self.consume()
left = self.parse_binary_op(left, op)
return left
if __name__ == "__main__":
from lexer import Lexer
# Test operator precedence: 2 + 3 * 4
source = "2 + 3 * 4"
lexer = Lexer(source)
tokens = lexer.lex()
parser = Parser(tokens)
ast = parser.parse_expression()
print(f"Expression: {source}")
print(f"Root: {ast.op.value}")
print(f" Left: {ast.left.value.value}")
print(f" Right: BinaryOp {ast.right.op.value}")
print(f" Left: {ast.right.left.value.value}")
print(f" Right: {ast.right.right.value.value}")
Run it:
$ python parser.py
Expression: 2 + 3 * 4
Root: +
Left: 2
Right: BinaryOp *
Left: 3
Right: 4
Perfect! The multiplication is nested under the addition, which means it gets evaluated first. Our precedence climbing algorithm is doing its job.
We now have a working parser that handles numbers, binary operations with proper precedence, and parentheses for grouping. It even reports errors when it sees unexpected tokens.
The precedence climbing algorithm scales nicely too. When we add comparison operators like < and > later, we just drop them in the precedence table with a lower number.
We've got a working parser for arithmetic expressions. The foundation is solid. Next up, we'll extend it to handle the full YFC language - function definitions, function calls, conditionals, and more.
Next: Extending the Language | Previous: Abstract Syntax Trees
Code Repository: All accompanying code for this tutorial series is available on GitHub.