Frederik Gram Kortegaard / Your First Compiler

Part 3: Lexical Analysis

The first step in any compiler is lexical analysis - breaking source code into tokens. When you see 2 + 3, you immediately recognize three separate things: a number, a plus sign, and another number. Your brain does this automatically. But a compiler has to explicitly walk through each character and figure out what's meaningful.

Let's say we want to tokenize this expression: 2 + 3

We need to walk through the string character by character. When we see 2, that's a number. The space doesn't mean anything. The + is an operator. Another space. Then 3 is another number.

So we end up with three tokens: NUMBER(2), PLUS(+), NUMBER(3).

That's it. That's what a lexer does - it turns a string of characters into a list of meaningful tokens.

Tokens and Lexemes

Before we start building, let's clarify what these terms actually mean. In compiler theory, a token has two parts:

Token type (sometimes called a tag): The category of the symbol. Examples: NUMBER, IDENTIFIER, PLUS, LPAREN. This tells you what kind of thing you're looking at.

Lexeme (sometimes called the value): The actual substring from the source code that produced the token. Examples: "42", "+", "foo". This is the literal text.

When we say NUMBER(2), we're saying: "This is a token with type NUMBER and lexeme "2"". The type tells the parser how to use it. The lexeme tells us what it actually was in the source.

Starting Simple

Before building the entire lexer we should discuss how implementation is going to work in our compiler. We'll implement features one-by-one, meaning that although YFC will eventually have if-else control structures, for-loops and similar, the best way to not get overwhelmed when building your first compiler is taking it one step at a time.

With that in mind, we'll build a lexer that handles basic arithmetic: numbers, +, -, *, /, and parentheses. That's enough to tokenize expressions like (2 + 3) * 4. Once we have this working, we can add more features later.

Building the Lexer

Let's start coding. Create a new file called lexer.py - this will be the first real piece of our compiler.

Defining Tokens

First, we need a way to represent tokens. A token needs a type (is it a number? an operator?) and a value (what number? which operator?).

from enum import Enum, auto
from dataclasses import dataclass

class TokenType(Enum):
    Number = auto()
    Plus = auto()
    Minus = auto()
    Mul = auto()
    Div = auto()
    LParen = auto()
    RParen = auto()
    Eof = auto()

@dataclass
class Token:
    _type: TokenType
    value: str

We use an Enum for token types because it gives us type safety and clear naming. For the Token itself, I use a dataclass because it's just pure data - no behavior, just values we want to store.

The Lexer Structure

Now let's build the lexer itself. The core idea is simple: maintain a position in the source string, look at the current character, and decide what to do with it.

class Lexer:
    def __init__(self, source: str):
        self.source = source
        self.tokens: list[Token] = []
        self.cursor = 0

Helper Methods

Before we start lexing, let's create some utility methods to make our life easier:

@property
def current(self) -> str:
    """Return the current character or EOF marker."""
    if 0 <= self.cursor < len(self.source):
        return self.source[self.cursor]
    return "\0"

def increment(self):
    """Move to the next character."""
    self.cursor += 1

def push_token(self, token_type: TokenType, value: str):
    """Add a token to our list."""
    self.tokens.append(Token(token_type, value))

The Main Loop

Now for the actual lexing. The strategy is simple: loop through characters one by one, figure out what each character starts (a number? an operator?), and handle it accordingly.

def lex(self) -> list[Token]:
    while self.current != '\0':
        # Skip whitespace
        if self.current.isspace():
            self.increment()
            continue

        # Recognize numbers
        if self.current.isdigit():
            start = self.cursor
            while self.current.isdigit():
                self.increment()

            self.push_token(TokenType.Number, self.source[start:self.cursor])
            continue

        # Recognize operators and parentheses
        single_char_tokens = {
            '(': TokenType.LParen,
            ')': TokenType.RParen,
            '+': TokenType.Plus,
            '-': TokenType.Minus,
            '*': TokenType.Mul,
            '/': TokenType.Div,
        }

        if token_type := single_char_tokens.get(self.current):
            self.push_token(token_type, self.current)
            self.increment()
            continue

        raise Exception(f"Unexpected character: '{self.current}'")

    # Add EOF token
    self.push_token(TokenType.Eof, "")
    return self.tokens

Testing It

Let's try it out:

if __name__ == "__main__":
    source = "(2 + 3) * 4"
    lexer = Lexer(source)
    tokens = lexer.lex()

    print(f"Tokenizing: {source}\n")
    for token in tokens:
        print(f"{token._type.name}({token.value})")

Run it:

$ python lexer.py
Tokenizing: (2 + 3) * 4

LParen(()
Number(2)
Plus(+)
Number(3)
RParen())
Mul(*)
Number(4)
Eof()

Perfect. We just built a working lexer in about 50 lines of code.

Adding Debug Information

Before we move on, let's add something important: location tracking. Right now, if we hit an unexpected character, we just say "Unexpected character: 'x'". But in a real program with hundreds of lines, that's useless. We need to know where the error is - which line and column.

@dataclass
class Location:
    row: int
    column: int

@dataclass
class Token:
    location: Location
    _type: TokenType
    value: str

Now we need to track row and column as we lex. The trick is in increment(). When we move to the next character, we need to check if it's a newline:

def increment(self):
    """Move to the next character."""
    if self.current == '\n':
        self.row += 1
        self.column = 0
    else:
        self.column += 1
    self.cursor += 1

What We've Built

This is a simple lexer, but it's real. It handles the core concepts you'll find in every lexer:

What's Next

Our lexer can break expressions into tokens with proper error reporting, but it doesn't understand what they mean. It doesn't know that * should happen before +, or that parentheses group things together.

That's the parser's job. The parser takes this flat list of tokens and builds a tree structure that represents the actual meaning of the expression - an Abstract Syntax Tree (AST).

We'll build that next.


Next: Abstract Syntax Trees | Previous: What Is a Compiler?

Code Repository: All accompanying code for this tutorial series is available on GitHub.