Frederik Gram Kortegaard / Your First Compiler
We have a parser that handles arithmetic expressions. Numbers, operators, parentheses. It works. But we're not building a calculator - we're building a compiler for a real programming language.
Time to add the rest. Functions, variables, conditionals. Everything we need to write actual programs.
The first thing we need is identifiers - names for variables and functions. When you write factorial(n), both factorial and n are identifiers.
An identifier is a sequence of letters, digits, and underscores. But it can't start with a digit.
Lexing identifiers:
class TokenType(Enum):
# ... existing tokens ...
Identifier = auto() # New!
In the lexer's main loop, add identifier recognition:
# Recognize identifiers
if self.current.isalpha() or self.current == '_':
start = self.cursor
while self.current.isalnum() or self.current == '_':
self.increment()
word = self.source[start:self.cursor]
self.push_token(TokenType.Identifier, word)
continue
Parsing identifiers:
On the AST side, we need a new node type:
class IdentifierExpr(Expression):
"""Represents a variable reference like x or n"""
def __init__(self, token: Token):
self.token = token
And in parse_primary():
case TokenType.Identifier:
token = self.consume()
return IdentifierExpr(token)
We need keywords: func, if, else, extern. But keywords look just like identifiers to the lexer. The solution: lex everything that looks like an identifier, then check if it's a keyword.
KEYWORDS = {
"func": TokenType.Func,
"if": TokenType.If,
"else": TokenType.Else,
"extern": TokenType.Extern,
}
Update the identifier lexing code to check this table:
if self.current.isalpha() or self.current == '_':
start = self.cursor
while self.current.isalnum() or self.current == '_':
self.increment()
word = self.source[start:self.cursor]
# Check if it's a keyword
token_type = KEYWORDS.get(word, TokenType.Identifier)
self.push_token(token_type, word)
continue
A real language needs more than arithmetic. We need:
: (colon) - for function body syntax: func foo(x) : x + 1, (comma) - for separating parameters and arguments< and > - comparison operators== - equality comparisonMost of these are single-character tokens, so they go in the existing lookup table:
single_char_tokens = {
'(': TokenType.LParen,
')': TokenType.RParen,
'+': TokenType.Plus,
'-': TokenType.Minus,
'*': TokenType.Mul,
'/': TokenType.Div,
':': TokenType.Colon,
',': TokenType.Comma,
'<': TokenType.LessThan,
'>': TokenType.GreaterThan,
}
But == is tricky - it's two characters. When we see =, we need to look ahead to check if the next character is also =:
if self.current == '=':
self.increment()
if self.current == '=':
self.push_token(TokenType.Equal, "==")
self.increment()
continue
raise Exception(f"Unexpected character: '='. Did you mean '=='?")
This is a common pattern in lexers - lookahead for multi-character tokens.
Comments start with # and run to the end of the line:
# Skip comments
if self.current == '#':
while self.current != '\n' and self.current != '\0':
self.increment()
continue
Simple. When we see #, eat everything until a newline or end of file.
With new operators, we need to update the precedence table. Comparison operators should have lower precedence than arithmetic:
PRECEDENCE_MAP = {
TokenType.Equal: 10,
TokenType.LessThan: 10,
TokenType.GreaterThan: 10,
TokenType.Plus: 20,
TokenType.Minus: 20,
TokenType.Mul: 40,
TokenType.Div: 40,
}
This means 2 + 3 < 10 parses as (2 + 3) < 10, which is what you'd expect. The gaps between numbers (10, 20, 40) let us slot in new operators later without renumbering.
Now let's parse function calls like factorial(5).
When we see an identifier, we don't immediately know what it is. Is factorial a variable, or is it a function call? We need to look ahead. If the next token is (, it's a function call.
class CallExpr(Expression):
def __init__(self, callee: IdentifierExpr, arguments: list[Expression]):
self.callee = callee
self.arguments = arguments
In parse_primary(), update the identifier case:
case TokenType.Identifier:
token = self.consume()
ident = IdentifierExpr(token)
# Just an identifier
if self.current._type != TokenType.LParen:
return ident
# It's a function call
self.consume() # eat '('
arguments: list[Expression] = []
while self.current._type != TokenType.RParen and self.current._type != TokenType.Eof:
arguments.append(self.parse_expression())
if self.current._type == TokenType.Comma:
self.consume()
self.consume_assert(TokenType.RParen, "Expected ')' after function arguments")
return CallExpr(ident, arguments)
In YFC, if is an expression that evaluates to either the then-branch or the else-branch:
if n < 2
1
else
n * factorial(n - 1)
Add the IfExpr node:
class IfExpr(Expression):
def __init__(
self,
condition: Expression,
then_expr: Expression,
else_expr: Expression | None,
):
self.condition = condition
self.then_expr = then_expr
self.else_expr = else_expr # None if no else clause
Add the if-expression case to parse_primary():
case TokenType.If:
self.consume() # eat 'if'
condition = self.parse_expression()
then_expr = self.parse_expression()
else_expr = None
if self.current._type == TokenType.Else:
self.consume()
else_expr = self.parse_expression()
return IfExpr(condition, then_expr, else_expr)
As we add more features, error reporting becomes critical. We need to know not just that something went wrong, but where. A span tracks the start and end location of an AST node in the source code.
@dataclass
class Span:
start: Location
end: Location
We add a span field to every AST node. When we construct a node, we record where it started and ended in the source. Later, when the compiler encounters an error (type mismatch, undefined variable, etc.), it can point to the exact range of characters that caused the problem.
For example, a NumberExpr gets its span from the token's location. A BinaryOp spans from the start of the left operand to the end of the right operand. An IfExpr spans from the if keyword to the end of the else branch.
Now for functions themselves. Functions aren't expressions - they're named chunks of code. They sit at the top level of a program.
class Function:
def __init__(
self,
span: Span,
name: str,
parameters: list[Token],
body: Expression | None,
):
self.span = span
self.name = name
self.parameters = parameters
self.body = body # None for extern functions
Functions have the grammar: func name(params) : body
def parse_function(self, is_extern: bool = False) -> Function:
"""Parse a function definition."""
start = self.consume().location # eat 'func'
name = self.consume_assert(TokenType.Identifier, "Expected function name")
self.consume_assert(TokenType.LParen, "Expected '(' after function name")
# Parse parameters
parameters: list[Token] = []
while self.current._type != TokenType.RParen and self.current._type != TokenType.Eof:
param = self.consume_assert(TokenType.Identifier, "Expected parameter name")
parameters.append(param)
if self.current._type == TokenType.Comma:
self.consume()
self.consume_assert(TokenType.RParen, "Expected ')' after parameters")
end = self.consume_assert(TokenType.Colon, "Expected ':' before function body").location
# Parse body (None for extern functions)
body = None if is_extern else self.parse_expression()
return Function(
Span(start, body.span.end if body is not None else end),
name.value,
parameters,
body,
)
Notice that parse_function takes an is_extern parameter. Extern functions are declared but not defined - they have no body. The syntax is: extern func putchard(char) :
Why do we need these? Because our language needs to call outside code. If you want to print output, you need a function like putchard that's implemented in C or assembly. The extern declaration tells the compiler "this function exists somewhere else - trust me, just call it."
When is_extern is true, we skip parsing the body and set it to None. The compiler will later expect to find this function at link time.
We need a method that parses an entire program. A program is a sequence of function definitions and top-level expressions:
def parse(self) -> list[Function]:
"""Parse the entire program."""
functions = []
while self.current._type != TokenType.Eof:
if self.current._type == TokenType.Func:
functions.append(self.parse_function())
elif self.current._type == TokenType.Extern:
self.consume() # eat 'extern'
functions.append(self.parse_function(is_extern=True))
else:
# Top-level expression - wrap it in an anonymous function
expr = self.parse_expression()
func = Function(expr.span, f"__anon_{len(functions)}", [], expr)
functions.append(func)
return functions
The interesting bit is what happens with top-level expressions. If you just write factorial(5) at the top level, it's not inside any function. We wrap it in an anonymous function (named __anon_0, __anon_1, etc.) so the rest of the compiler can treat everything uniformly as functions.
if __name__ == "__main__":
from lexer import Lexer
source = """
# Test the full parser
func factorial(n) :
if n < 2
1
else
n * factorial(n - 1)
func fib(n) :
if n < 2
n
else
fib(n - 1) + fib(n - 2)
# Top-level expression
factorial(5)
"""
lexer = Lexer(source)
tokens = lexer.lex()
parser = Parser(tokens)
functions = parser.parse()
print("\nParsed functions:")
for func in functions:
print(f"\nFunction: {func.name}")
print(f" Parameters: {[p.value for p in func.parameters]}")
print(f" Has body: {func.body is not None}")
Run it:
$ python parser.py Parsed functions: Function: factorial Parameters: ['n'] Has body: True Function: fib Parameters: ['n'] Has body: True Function: __anon_2 Parameters: [] Has body: True
Three functions parsed. factorial and fib are the named functions we defined. __anon_2 is the top-level expression factorial(5) wrapped in an anonymous function. The index is 2 because it's the third item (after factorial at index 0 and fib at index 1).
We've now built the entire frontend of our compiler. The frontend is responsible for understanding the source code - turning text into a structured representation the rest of the compiler can work with.
Here's where things stand:
The backend is everything that comes after: taking the AST and turning it into something that runs. That means code generation (emitting assembly or machine code), optimization passes, register allocation, and linking.
The clean separation between frontend and backend is important. The frontend knows about YFC syntax but nothing about x86. The backend knows about x86 but nothing about YFC syntax. The AST is the interface between them.
We started with arithmetic. Now we can parse:
func, if, else, extern+ - * /) and comparison (< > ==)# to end of linefactorial(5) with multiple argumentsif n < 2 1 else n * 2func factorial(n) : ...extern func putchard(char) :The frontend is complete. We can take YFC source code and turn it into an AST.
Next: Taking Stock | Previous: Building the Parser
Code Repository: All accompanying code for this tutorial series is available on GitHub.