Frederik Gram Kortegaard / Your First Compiler
The first step in any compiler is lexical analysis - breaking source code into tokens. When you see 2 + 3, you immediately recognize three separate things: a number, a plus sign, and another number. Your brain does this automatically. But a compiler has to explicitly walk through each character and figure out what's meaningful.
Let's say we want to tokenize this expression: 2 + 3
We need to walk through the string character by character. When we see 2, that's a number. The space doesn't mean anything. The + is an operator. Another space. Then 3 is another number.
So we end up with three tokens: NUMBER(2), PLUS(+), NUMBER(3).
That's it. That's what a lexer does - it turns a string of characters into a list of meaningful tokens.
Before we start building, let's clarify what these terms actually mean. In compiler theory, a token has two parts:
Token type (sometimes called a tag): The category of the symbol. Examples: NUMBER, IDENTIFIER, PLUS, LPAREN. This tells you what kind of thing you're looking at.
Lexeme (sometimes called the value): The actual substring from the source code that produced the token. Examples: "42", "+", "foo". This is the literal text.
When we say NUMBER(2), we're saying: "This is a token with type NUMBER and lexeme "2"". The type tells the parser how to use it. The lexeme tells us what it actually was in the source.
Before building the entire lexer we should discuss how implementation is going to work in our compiler. We'll implement features one-by-one, meaning that although YFC will eventually have if-else control structures, for-loops and similar, the best way to not get overwhelmed when building your first compiler is taking it one step at a time.
With that in mind, we'll build a lexer that handles basic arithmetic: numbers, +, -, *, /, and parentheses. That's enough to tokenize expressions like (2 + 3) * 4. Once we have this working, we can add more features later.
Let's start coding. Create a new file called lexer.py - this will be the first real piece of our compiler.
First, we need a way to represent tokens. A token needs a type (is it a number? an operator?) and a value (what number? which operator?).
from enum import Enum, auto
from dataclasses import dataclass
class TokenType(Enum):
Number = auto()
Plus = auto()
Minus = auto()
Mul = auto()
Div = auto()
LParen = auto()
RParen = auto()
Eof = auto()
@dataclass
class Token:
_type: TokenType
value: str
We use an Enum for token types because it gives us type safety and clear naming. For the Token itself, I use a dataclass because it's just pure data - no behavior, just values we want to store.
Now let's build the lexer itself. The core idea is simple: maintain a position in the source string, look at the current character, and decide what to do with it.
class Lexer:
def __init__(self, source: str):
self.source = source
self.tokens: list[Token] = []
self.cursor = 0
Before we start lexing, let's create some utility methods to make our life easier:
@property
def current(self) -> str:
"""Return the current character or EOF marker."""
if 0 <= self.cursor < len(self.source):
return self.source[self.cursor]
return "\0"
def increment(self):
"""Move to the next character."""
self.cursor += 1
def push_token(self, token_type: TokenType, value: str):
"""Add a token to our list."""
self.tokens.append(Token(token_type, value))
Now for the actual lexing. The strategy is simple: loop through characters one by one, figure out what each character starts (a number? an operator?), and handle it accordingly.
def lex(self) -> list[Token]:
while self.current != '\0':
# Skip whitespace
if self.current.isspace():
self.increment()
continue
# Recognize numbers
if self.current.isdigit():
start = self.cursor
while self.current.isdigit():
self.increment()
self.push_token(TokenType.Number, self.source[start:self.cursor])
continue
# Recognize operators and parentheses
single_char_tokens = {
'(': TokenType.LParen,
')': TokenType.RParen,
'+': TokenType.Plus,
'-': TokenType.Minus,
'*': TokenType.Mul,
'/': TokenType.Div,
}
if token_type := single_char_tokens.get(self.current):
self.push_token(token_type, self.current)
self.increment()
continue
raise Exception(f"Unexpected character: '{self.current}'")
# Add EOF token
self.push_token(TokenType.Eof, "")
return self.tokens
Let's try it out:
if __name__ == "__main__":
source = "(2 + 3) * 4"
lexer = Lexer(source)
tokens = lexer.lex()
print(f"Tokenizing: {source}\n")
for token in tokens:
print(f"{token._type.name}({token.value})")
Run it:
$ python lexer.py Tokenizing: (2 + 3) * 4 LParen(() Number(2) Plus(+) Number(3) RParen()) Mul(*) Number(4) Eof()
Perfect. We just built a working lexer in about 50 lines of code.
Before we move on, let's add something important: location tracking. Right now, if we hit an unexpected character, we just say "Unexpected character: 'x'". But in a real program with hundreds of lines, that's useless. We need to know where the error is - which line and column.
@dataclass
class Location:
row: int
column: int
@dataclass
class Token:
location: Location
_type: TokenType
value: str
Now we need to track row and column as we lex. The trick is in increment(). When we move to the next character, we need to check if it's a newline:
def increment(self):
"""Move to the next character."""
if self.current == '\n':
self.row += 1
self.column = 0
else:
self.column += 1
self.cursor += 1
This is a simple lexer, but it's real. It handles the core concepts you'll find in every lexer:
Our lexer can break expressions into tokens with proper error reporting, but it doesn't understand what they mean. It doesn't know that * should happen before +, or that parentheses group things together.
That's the parser's job. The parser takes this flat list of tokens and builds a tree structure that represents the actual meaning of the expression - an Abstract Syntax Tree (AST).
We'll build that next.
Next: Abstract Syntax Trees | Previous: What Is a Compiler?
Code Repository: All accompanying code for this tutorial series is available on GitHub.