Frederik Gram Kortegaard / Your First Compiler

Part 5: Building the Parser

We have tokens from the lexer. We have AST node definitions. Now we need to connect them - turn that flat list of tokens into a tree structure.

Let's build a parser.

Starting Simple

Just like with the lexer, we'll start simple. For now, we'll handle:

That's enough to parse expressions like (2 + 3) * 4. Once this works, adding functions and other features is straightforward.

The Parser Structure

Let's create a file called parser.py. A parser needs to track where it is in the token list:

from lexer import Token, TokenType
from abstract import Expression, NumberExpr, BinaryOp

class ParserError(Exception):
    pass

class Parser:
    def __init__(self, tokens: list[Token]):
        self.tokens = tokens
        self.cursor = 0

    @property
    def current(self) -> Token:
        """Get the current token."""
        if 0 <= self.cursor < len(self.tokens):
            return self.tokens[self.cursor]
        return self.tokens[-1]  # EOF

    def consume(self) -> Token:
        """Consume and return the current token."""
        if 0 <= self.cursor < len(self.tokens):
            token = self.tokens[self.cursor]
        else:
            token = self.tokens[-1]
        self.cursor += 1
        return token

Parsing Primary Expressions

When you see 2 + 3, there are two things being added: 2 and 3. Parser terminology calls these "primary" expressions - the things operators operate on.

def parse_primary(self) -> Expression:
    """Parse a primary expression."""
    match self.current._type:
        # Grammar: '(' expression ')'
        case TokenType.LParen:
            self.consume()
            expr = self.parse_expression()
            if self.current._type != TokenType.RParen:
                raise ParserError(f"Expected ')' but got {self.current._type}")
            self.consume()
            return expr

        # Grammar: NUMBER
        case TokenType.Number:
            token = self.consume()
            return NumberExpr(token)

        case _:
            raise ParserError(f"Unexpected token: {self.current._type}")

The Precedence Problem

When we see 2 + 3 * 4, we need to build the right tree. If we naively parse left to right, we'd get:

      *
     / \
    +   4
   / \
  2   3

That's (2 + 3) * 4 = 20. Wrong. We need:

      +
     / \
    2   *
       / \
      3   4

That's 2 + (3 * 4) = 14. Correct. Multiplication binds tighter than addition, so it needs to be deeper in the tree. We need to encode that in our parser.

Operator Precedence Table

We assign each operator a number - higher number means higher precedence:

# Operator precedence - higher number = higher precedence
PRECEDENCE_MAP = {
    TokenType.Plus: 20,
    TokenType.Minus: 20,
    TokenType.Mul: 40,
    TokenType.Div: 40,
}

Why 20 and 40 instead of 1 and 2? Because it gives us room to add operators in between later. When we add comparison operators like < and >, we can slot them in at 10 without renumbering everything.

Precedence Climbing

The algorithm is called "precedence climbing". Here's the core idea: after parsing the immediate right side of an operator, we check what's coming next. If the next operator has higher precedence, we don't return yet - we keep extending the right side.

def parse_binary_op(self, left: Expression, op: Token) -> Expression:
    """Parse a binary operation using precedence climbing."""
    # Get the precedence of the current operator
    current_precedence = PRECEDENCE_MAP.get(op._type, 0)

    # Parse the immediate right-hand side
    right = self.parse_primary()

    # Look ahead: if next operator has higher precedence, extend the right side
    while PRECEDENCE_MAP.get(self.current._type, 0) > current_precedence:
        next_op = self.consume()
        right = self.parse_binary_op(right, next_op)

    # Build the binary operation node
    return BinaryOp(left, op, right)

Let's walk through 2 + 3 * 4 step by step to see how this works:

  1. parse_expression() calls parse_primary(), which consumes 2 and returns NumberExpr(2).
  2. The current token is +. It's in the precedence map, so we consume it and call parse_binary_op(NumberExpr(2), Plus).
  3. Inside parse_binary_op: current_precedence is 20 (for +). We call parse_primary(), which consumes 3 and returns NumberExpr(3). So right = NumberExpr(3).
  4. Now we look ahead. The next token is * with precedence 40. Is 40 > 20? Yes! So we don't return yet.
  5. We consume * and recursively call parse_binary_op(NumberExpr(3), Mul).
  6. In the recursive call: current_precedence is 40. We parse 4 as the right side. Next token is Eof with precedence 0. Is 0 > 40? No. So we return BinaryOp(NumberExpr(3), Mul, NumberExpr(4)).
  7. Back in the outer call, right is now BinaryOp(3, *, 4). No more higher-precedence operators, so we return BinaryOp(NumberExpr(2), Plus, BinaryOp(3, *, 4)).

The multiplication ended up nested deeper in the tree - exactly what we wanted. The key insight is that the while loop only continues if the next operator binds tighter than the current one.

The Entry Point

def parse_expression(self) -> Expression:
    """Parse an expression."""
    left = self.parse_primary()

    while PRECEDENCE_MAP.get(self.current._type):
        op = self.consume()
        left = self.parse_binary_op(left, op)

    return left

Testing It

if __name__ == "__main__":
    from lexer import Lexer

    # Test operator precedence: 2 + 3 * 4
    source = "2 + 3 * 4"
    lexer = Lexer(source)
    tokens = lexer.lex()
    parser = Parser(tokens)
    ast = parser.parse_expression()

    print(f"Expression: {source}")
    print(f"Root: {ast.op.value}")
    print(f"  Left: {ast.left.value.value}")
    print(f"  Right: BinaryOp {ast.right.op.value}")
    print(f"    Left: {ast.right.left.value.value}")
    print(f"    Right: {ast.right.right.value.value}")

Run it:

$ python parser.py
Expression: 2 + 3 * 4
Root: +
  Left: 2
  Right: BinaryOp *
    Left: 3
    Right: 4

Perfect! The multiplication is nested under the addition, which means it gets evaluated first. Our precedence climbing algorithm is doing its job.

What We've Built

We now have a working parser that handles numbers, binary operations with proper precedence, and parentheses for grouping. It even reports errors when it sees unexpected tokens.

The precedence climbing algorithm scales nicely too. When we add comparison operators like < and > later, we just drop them in the precedence table with a lower number.

What's Next

We've got a working parser for arithmetic expressions. The foundation is solid. Next up, we'll extend it to handle the full YFC language - function definitions, function calls, conditionals, and more.


Next: Extending the Language | Previous: Abstract Syntax Trees

Code Repository: All accompanying code for this tutorial series is available on GitHub.