Frederik Gram Kortegaard / Your First Compiler

Part 6: Extending the Language

We have a parser that handles arithmetic expressions. Numbers, operators, parentheses. It works. But we're not building a calculator - we're building a compiler for a real programming language.

Time to add the rest. Functions, variables, conditionals. Everything we need to write actual programs.

Adding Identifiers

The first thing we need is identifiers - names for variables and functions. When you write factorial(n), both factorial and n are identifiers.

An identifier is a sequence of letters, digits, and underscores. But it can't start with a digit.

Lexing identifiers:

class TokenType(Enum):
    # ... existing tokens ...
    Identifier = auto()  # New!

In the lexer's main loop, add identifier recognition:

# Recognize identifiers
if self.current.isalpha() or self.current == '_':
    start = self.cursor
    while self.current.isalnum() or self.current == '_':
        self.increment()

    word = self.source[start:self.cursor]
    self.push_token(TokenType.Identifier, word)
    continue

Parsing identifiers:

On the AST side, we need a new node type:

class IdentifierExpr(Expression):
    """Represents a variable reference like x or n"""
    def __init__(self, token: Token):
        self.token = token

And in parse_primary():

case TokenType.Identifier:
    token = self.consume()
    return IdentifierExpr(token)

Keywords

We need keywords: func, if, else, extern. But keywords look just like identifiers to the lexer. The solution: lex everything that looks like an identifier, then check if it's a keyword.

KEYWORDS = {
    "func": TokenType.Func,
    "if": TokenType.If,
    "else": TokenType.Else,
    "extern": TokenType.Extern,
}

Update the identifier lexing code to check this table:

if self.current.isalpha() or self.current == '_':
    start = self.cursor
    while self.current.isalnum() or self.current == '_':
        self.increment()

    word = self.source[start:self.cursor]

    # Check if it's a keyword
    token_type = KEYWORDS.get(word, TokenType.Identifier)
    self.push_token(token_type, word)
    continue

More Operators

A real language needs more than arithmetic. We need:

Most of these are single-character tokens, so they go in the existing lookup table:

single_char_tokens = {
    '(': TokenType.LParen,
    ')': TokenType.RParen,
    '+': TokenType.Plus,
    '-': TokenType.Minus,
    '*': TokenType.Mul,
    '/': TokenType.Div,
    ':': TokenType.Colon,
    ',': TokenType.Comma,
    '<': TokenType.LessThan,
    '>': TokenType.GreaterThan,
}

But == is tricky - it's two characters. When we see =, we need to look ahead to check if the next character is also =:

if self.current == '=':
    self.increment()
    if self.current == '=':
        self.push_token(TokenType.Equal, "==")
        self.increment()
        continue
    raise Exception(f"Unexpected character: '='. Did you mean '=='?")

This is a common pattern in lexers - lookahead for multi-character tokens.

Comments

Comments start with # and run to the end of the line:

# Skip comments
if self.current == '#':
    while self.current != '\n' and self.current != '\0':
        self.increment()
    continue

Simple. When we see #, eat everything until a newline or end of file.

Updating the Precedence Map

With new operators, we need to update the precedence table. Comparison operators should have lower precedence than arithmetic:

PRECEDENCE_MAP = {
    TokenType.Equal: 10,
    TokenType.LessThan: 10,
    TokenType.GreaterThan: 10,
    TokenType.Plus: 20,
    TokenType.Minus: 20,
    TokenType.Mul: 40,
    TokenType.Div: 40,
}

This means 2 + 3 < 10 parses as (2 + 3) < 10, which is what you'd expect. The gaps between numbers (10, 20, 40) let us slot in new operators later without renumbering.

Function Calls

Now let's parse function calls like factorial(5).

When we see an identifier, we don't immediately know what it is. Is factorial a variable, or is it a function call? We need to look ahead. If the next token is (, it's a function call.

class CallExpr(Expression):
    def __init__(self, callee: IdentifierExpr, arguments: list[Expression]):
        self.callee = callee
        self.arguments = arguments

In parse_primary(), update the identifier case:

case TokenType.Identifier:
    token = self.consume()
    ident = IdentifierExpr(token)

    # Just an identifier
    if self.current._type != TokenType.LParen:
        return ident

    # It's a function call
    self.consume()  # eat '('
    arguments: list[Expression] = []

    while self.current._type != TokenType.RParen and self.current._type != TokenType.Eof:
        arguments.append(self.parse_expression())
        if self.current._type == TokenType.Comma:
            self.consume()

    self.consume_assert(TokenType.RParen, "Expected ')' after function arguments")
    return CallExpr(ident, arguments)

If Expressions

In YFC, if is an expression that evaluates to either the then-branch or the else-branch:

if n < 2
    1
else
    n * factorial(n - 1)

Add the IfExpr node:

class IfExpr(Expression):
    def __init__(
        self,
        condition: Expression,
        then_expr: Expression,
        else_expr: Expression | None,
    ):
        self.condition = condition
        self.then_expr = then_expr
        self.else_expr = else_expr  # None if no else clause

Add the if-expression case to parse_primary():

case TokenType.If:
    self.consume()  # eat 'if'

    condition = self.parse_expression()
    then_expr = self.parse_expression()
    else_expr = None

    if self.current._type == TokenType.Else:
        self.consume()
        else_expr = self.parse_expression()

    return IfExpr(condition, then_expr, else_expr)

Spans

As we add more features, error reporting becomes critical. We need to know not just that something went wrong, but where. A span tracks the start and end location of an AST node in the source code.

@dataclass
class Span:
    start: Location
    end: Location

We add a span field to every AST node. When we construct a node, we record where it started and ended in the source. Later, when the compiler encounters an error (type mismatch, undefined variable, etc.), it can point to the exact range of characters that caused the problem.

For example, a NumberExpr gets its span from the token's location. A BinaryOp spans from the start of the left operand to the end of the right operand. An IfExpr spans from the if keyword to the end of the else branch.

Function Definitions

Now for functions themselves. Functions aren't expressions - they're named chunks of code. They sit at the top level of a program.

class Function:
    def __init__(
        self,
        span: Span,
        name: str,
        parameters: list[Token],
        body: Expression | None,
    ):
        self.span = span
        self.name = name
        self.parameters = parameters
        self.body = body  # None for extern functions

Functions have the grammar: func name(params) : body

def parse_function(self, is_extern: bool = False) -> Function:
    """Parse a function definition."""
    start = self.consume().location  # eat 'func'

    name = self.consume_assert(TokenType.Identifier, "Expected function name")
    self.consume_assert(TokenType.LParen, "Expected '(' after function name")

    # Parse parameters
    parameters: list[Token] = []
    while self.current._type != TokenType.RParen and self.current._type != TokenType.Eof:
        param = self.consume_assert(TokenType.Identifier, "Expected parameter name")
        parameters.append(param)
        if self.current._type == TokenType.Comma:
            self.consume()

    self.consume_assert(TokenType.RParen, "Expected ')' after parameters")
    end = self.consume_assert(TokenType.Colon, "Expected ':' before function body").location

    # Parse body (None for extern functions)
    body = None if is_extern else self.parse_expression()

    return Function(
        Span(start, body.span.end if body is not None else end),
        name.value,
        parameters,
        body,
    )

A Note on Extern Functions

Notice that parse_function takes an is_extern parameter. Extern functions are declared but not defined - they have no body. The syntax is: extern func putchard(char) :

Why do we need these? Because our language needs to call outside code. If you want to print output, you need a function like putchard that's implemented in C or assembly. The extern declaration tells the compiler "this function exists somewhere else - trust me, just call it."

When is_extern is true, we skip parsing the body and set it to None. The compiler will later expect to find this function at link time.

Top-Level Parsing

We need a method that parses an entire program. A program is a sequence of function definitions and top-level expressions:

def parse(self) -> list[Function]:
    """Parse the entire program."""
    functions = []

    while self.current._type != TokenType.Eof:
        if self.current._type == TokenType.Func:
            functions.append(self.parse_function())
        elif self.current._type == TokenType.Extern:
            self.consume()  # eat 'extern'
            functions.append(self.parse_function(is_extern=True))
        else:
            # Top-level expression - wrap it in an anonymous function
            expr = self.parse_expression()
            func = Function(expr.span, f"__anon_{len(functions)}", [], expr)
            functions.append(func)

    return functions

The interesting bit is what happens with top-level expressions. If you just write factorial(5) at the top level, it's not inside any function. We wrap it in an anonymous function (named __anon_0, __anon_1, etc.) so the rest of the compiler can treat everything uniformly as functions.

Testing the Full Parser

if __name__ == "__main__":
    from lexer import Lexer

    source = """
    # Test the full parser
    func factorial(n) :
        if n < 2
            1
        else
            n * factorial(n - 1)

    func fib(n) :
        if n < 2
            n
        else
            fib(n - 1) + fib(n - 2)

    # Top-level expression
    factorial(5)
    """

    lexer = Lexer(source)
    tokens = lexer.lex()
    parser = Parser(tokens)
    functions = parser.parse()

    print("\nParsed functions:")
    for func in functions:
        print(f"\nFunction: {func.name}")
        print(f"  Parameters: {[p.value for p in func.parameters]}")
        print(f"  Has body: {func.body is not None}")

Run it:

$ python parser.py

Parsed functions:

Function: factorial
  Parameters: ['n']
  Has body: True

Function: fib
  Parameters: ['n']
  Has body: True

Function: __anon_2
  Parameters: []
  Has body: True

Three functions parsed. factorial and fib are the named functions we defined. __anon_2 is the top-level expression factorial(5) wrapped in an anonymous function. The index is 2 because it's the third item (after factorial at index 0 and fib at index 1).

Understanding the Compiler Structure

We've now built the entire frontend of our compiler. The frontend is responsible for understanding the source code - turning text into a structured representation the rest of the compiler can work with.

Here's where things stand:

The backend is everything that comes after: taking the AST and turning it into something that runs. That means code generation (emitting assembly or machine code), optimization passes, register allocation, and linking.

The clean separation between frontend and backend is important. The frontend knows about YFC syntax but nothing about x86. The backend knows about x86 but nothing about YFC syntax. The AST is the interface between them.

What We've Built

We started with arithmetic. Now we can parse:

The frontend is complete. We can take YFC source code and turn it into an AST.


Next: Taking Stock | Previous: Building the Parser

Code Repository: All accompanying code for this tutorial series is available on GitHub.