2. Lexical Analysis (The Scanner)

The first step in any compiler or interpreter is scanning. The scanner takes in raw source code as a long string of characters and groups it into a series of chunks we call tokens. These are the meaningful "words" and "punctuation" that make up the language's grammar.

By the end of this chapter, we will have a full-featured, fast scanner. We will wire up the scanner immediately so you can test your code as you add each new feature.

1. Lexemes and Tokens

Here is a line of Snek code:

var language = "snek";

Here, var is the keyword for declaring a variable. That three-character sequence means something. But if we yank three letters out of the middle of language, like gua, those do not mean anything on their own.

Our job is to scan through the characters and group them together into the smallest sequences that still represent something. Each of these blobs of characters is called a lexeme. In that example line of code, the lexemes are: var, language, =, "snek", and ;.

When we take the lexeme and bundle it together with other useful data (like what type of lexeme it is, and what line it appeared on), the result is a Token.

Token Types

We need to categorize our tokens so the parser knows what it is looking at without doing slow string comparisons. We will use a Python Enum. Create a new file named token_type.py:

from enum import Enum, auto

class TokenType(Enum):
    # Single-character tokens.
    LEFT_PAREN = auto()
    RIGHT_PAREN = auto()
    LEFT_BRACE = auto()
    RIGHT_BRACE = auto()
    COMMA = auto()
    DOT = auto()
    MINUS = auto()
    PLUS = auto()
    SEMICOLON = auto()
    SLASH = auto()
    STAR = auto()

    # One or two character tokens.
    BANG = auto()
    BANG_EQUAL = auto()
    EQUAL = auto()
    EQUAL_EQUAL = auto()
    GREATER = auto()
    GREATER_EQUAL = auto()
    LESS = auto()
    LESS_EQUAL = auto()

    # Literals.
    IDENTIFIER = auto()
    STRING = auto()
    NUMBER = auto()

    # Keywords.
    AND = auto()
    CLASS = auto()
    ELSE = auto()
    FALSE = auto()
    FUN = auto()
    FOR = auto()
    IF = auto()
    NIL = auto()
    OR = auto()
    PRINT = auto()
    RETURN = auto()
    SUPER = auto()
    THIS = auto()
    TRUE = auto()
    VAR = auto()
    WHILE = auto()

    EOF = auto()

Python Toolbox: Enum and auto()

We use Python's built-in Enum to represent our token types. This is much safer than using raw strings or integers because it prevents typos (e.g., typing "IDENTIFER" instead of "IDENTIFIER"). The auto() function automatically assigns a unique integer to each item, saving us from having to manually number them 1, 2, 3, etc.

The Token Class

Now we bundle the type, the raw lexeme, the evaluated literal value (for strings and numbers), and the line number into a single object. Create token_class.py:

from token_type import TokenType

class Token:
    def __init__(self, type: TokenType, lexeme: str, literal: object, line: int):
        self.type = type
        self.lexeme = lexeme
        self.literal = literal
        self.line = line

    def __str__(self):
        return f"{self.type.name} {self.lexeme} {self.literal}"

Checkpoint: Mental Tracing

Write out the exact sequence of TokenTypes that should be generated by this Snek code: print "hello" >= 5;

PRINT
STRING
GREATER_EQUAL
NUMBER
SEMICOLON
EOF

2. The Scanner Core

The core of the scanner is a loop. Starting at the first character of the source code, the scanner figures out what lexeme the character belongs to, and consumes it. It keeps doing that until it reaches the end of the input.

Create scanner.py:

from token_type import TokenType
from token_class import Token
import error

class Scanner:
    def __init__(self, source):
        self.source = source
        self.tokens = []
        
        self.start = 0   # First character in the lexeme being scanned
        self.current = 0 # Character currently being considered
        self.line = 1    # What source line `current` is on

    def scan_tokens(self) -> list[Token]:
        while not self.is_at_end():
            # We are at the beginning of the next lexeme.
            self.start = self.current
            self.scan_token()

        self.tokens.append(Token(TokenType.EOF, "", None, self.line))
        return self.tokens

    def is_at_end(self) -> bool:
        return self.current >= len(self.source)

Recognizing Single-Character Lexemes

In each turn of the loop, we scan a single token. We consume the next character and pick a token type for it. Add these methods to your Scanner class:

    def scan_token(self):
        c = self.advance()
        
        if c == '(': self.add_token(TokenType.LEFT_PAREN)
        elif c == ')': self.add_token(TokenType.RIGHT_PAREN)
        elif c == '{': self.add_token(TokenType.LEFT_BRACE)
        elif c == '}': self.add_token(TokenType.RIGHT_BRACE)
        elif c == ',': self.add_token(TokenType.COMMA)
        elif c == '.': self.add_token(TokenType.DOT)
        elif c == '-': self.add_token(TokenType.MINUS)
        elif c == '+': self.add_token(TokenType.PLUS)
        elif c == ';': self.add_token(TokenType.SEMICOLON)
        elif c == '*': self.add_token(TokenType.STAR)
        
        # Ignore whitespace
        elif c in [' ', '\r', '\t']:
            pass 
        elif c == '\n':
            self.line += 1
            
        else:
            error.error(self.line, "Unexpected character.")

    def advance(self) -> str:
        """Consumes the next character and returns it."""
        c = self.source[self.current]
        self.current += 1
        return c

    def add_token(self, type: TokenType, literal=None):
        """Grabs the text of the current lexeme and creates a token."""
        text = self.source[self.start : self.current]
        self.tokens.append(Token(type, text, literal, self.line))

Python Toolbox: String Slicing

In add_token, we extract the lexeme using Python's slice notation: self.source[self.start : self.current]. In Python, slicing is inclusive of the first index and exclusive of the second. This works perfectly for our scanner because self.current always points to the character just after the end of the current lexeme.

Notice the handling of whitespace and newlines. Because they don't call add_token(), the lexeme is safely discarded. Also note the else clause at the end. If we encounter a character Snek does not use (like @), we report an error, but we keep scanning. This gives our users a better experience by detecting as many errors as possible in one go, avoiding "syntax error Whac-A-Mole."

3. Wiring it up (First Test Run)

We have enough logic to process simple symbols and ignore spaces. Instead of writing the rest of the scanner blindly, let's wire it into our main program so we can test it as we build.

Open snek.py. First, add the import at the very top of the file:

import sys
import error
from scanner import Scanner

Then, modify the run method to use it:

    @staticmethod
    def run(source):
        scanner = Scanner(source)
        tokens = scanner.scan_tokens()

        # Print the tokens to verify our scanner works.
        for token in tokens:
            print(token)

Test it out: Run your interpreter (python snek.py) to start the REPL. Type + - ( ) ; (with spaces) and press Enter. You should see a list of generated tokens printed to the screen, and the spaces should be completely ignored. Type @ to see your error handler catch the unexpected character and continue.

4. Lookahead and Operators

We have single-character lexemes working, but what about !? Is it just a BANG, or the start of a BANG_EQUAL (!=)? We need to look at the second character.

Add these clauses to the if/elif chain in your scan_token() method:

        elif c == '!':
            self.add_token(TokenType.BANG_EQUAL if self.match('=') else TokenType.BANG)
        elif c == '=':
            self.add_token(TokenType.EQUAL_EQUAL if self.match('=') else TokenType.EQUAL)
        elif c == '<':
            self.add_token(TokenType.LESS_EQUAL if self.match('=') else TokenType.LESS)
        elif c == '>':
            self.add_token(TokenType.GREATER_EQUAL if self.match('=') else TokenType.GREATER)

This uses a new helper, match(). Add it to your Scanner class. It is like a conditional advance(). We only consume the current character if it is what we are looking for.

    def match(self, expected: str) -> bool:
        if self.is_at_end(): return False
        if self.source[self.current] != expected: return False

        self.current += 1
        return True

Test it out: Run the REPL and type != == <= >. Verify that the scanner correctly groups the two-character operators together instead of treating them as separate symbols.

Comments

The / character needs special handling because comments begin with a slash too (//). Add this to your scan_token() method:

        elif c == '/':
            if self.match('/'):
                # A comment goes until the end of the line.
                while self.peek() != '\n' and not self.is_at_end():
                    self.advance()
            else:
                self.add_token(TokenType.SLASH)

We use peek() (lookahead without consuming) so we do not accidentally consume the newline at the end of the comment. We want the next turn of the loop to see the newline so it triggers our existing elif c == '\n': logic to increment self.line.

Add peek() to your class:

    def peek(self) -> str:
        if self.is_at_end(): return '\0'
        return self.source[self.current]

Scanner Mechanics: advance() vs. peek() vs. match()

It is crucial to understand the difference between these three methods, as they are the "eyes and hands" of our scanner:

advance(): Looks at the current character, consumes it (moves the pointer forward), and returns it.

peek(): Looks at the current character and returns it, but does not consume it. The pointer stays put. This is known as lookahead.

match(expected): A combination of both. It peeks at the character; if it matches what we expect, it consumes it and returns True. If not, it leaves it alone and returns False.

Test it out: Run the REPL and type + // this is a comment. You should see the PLUS token and the EOF token, with the comment entirely ignored.

Checkpoint: Theory

Our scanner throws away spaces, tabs, and newlines because they aren't needed by the parser. In what popular programming language is this not true, and why does that language's scanner have to be more complex?

Python! Python uses indentation (spaces/tabs) and newlines to define block structure, rather than curly braces {} and semicolons ;. A Python scanner must actively track indentation levels and emit special INDENT and DEDENT tokens so the parser knows when blocks begin and end.

5. Literals

Now we handle longer lexemes that represent exact values.

Strings

Strings always begin with ". Add this case to scan_token():

        elif c == '"':
            self.string()

And implement the string() method:

    def string(self):
        while self.peek() != '"' and not self.is_at_end():
            if self.peek() == '\n':
                self.line += 1
            self.advance()

        if self.is_at_end():
            error.error(self.line, "Unterminated string.")
            return

        # The closing ".
        self.advance()

        # Trim the surrounding quotes and save the literal value.
        value = self.source[self.start + 1 : self.current - 1]
        self.add_token(TokenType.STRING, value)

Numbers

We look for any digit to start a number. Add this elif case to scan_token():

        elif c.isdigit():
            self.number()

Then implement number() and a new peek_next() helper:

    def number(self):
        while self.peek().isdigit():
            self.advance()

        # Look for a fractional part.
        if self.peek() == '.' and self.peek_next().isdigit():
            # Consume the "."
            self.advance()

            while self.peek().isdigit():
                self.advance()

        # Parse the string into a Python float for the literal value
        self.add_token(TokenType.NUMBER, float(self.source[self.start : self.current]))

    def peek_next(self) -> str:
        if self.current + 1 >= len(self.source): return '\0'
        return self.source[self.current + 1]

Checkpoint: Break-it

In our number() method, we check for a fractional part using if self.peek() == '.' and self.peek_next().isdigit():. What would happen if we simplified this to just use if self.match('.'):? Give an example of a Snek expression that would scan incorrectly.

If we used if self.match('.'):, the scanner would eagerly consume the dot character without first checking if a digit actually follows it.

Because match() consumes the character upon a successful match, the dot would become part of the current number lexeme. If a user wrote a method call on a number, such as 123.sqrt(), the scanner would consume 123. as a single number literal, leaving sqrt() as a separate identifier. We would lose the . token entirely, which the parser needs to understand the method call. We must use peek() and peek_next() to look ahead and ensure the dot is actually part of a fraction before we commit to consuming it.

Test it out: Run the REPL and type "hello world" 123.45. You should see the literal values successfully parsed and printed alongside the tokens.

6. Identifiers and Keywords

Finally, we handle variable names and reserved words. Add this logic to the end of your scan_token() method, before the final else error case:

        elif c.isalpha() or c == '_':
            self.identifier()

To avoid returning TokenType.IDENTIFIER for a reserved keyword like class or while, we define a dictionary mapping strings to TokenTypes. Add this directly inside the Scanner class definition (above __init__):

class Scanner:
    keywords = {
        "and": TokenType.AND,
        "class": TokenType.CLASS,
        "else": TokenType.ELSE,
        "false": TokenType.FALSE,
        "for": TokenType.FOR,
        "fun": TokenType.FUN,
        "if": TokenType.IF,
        "nil": TokenType.NIL,
        "or": TokenType.OR,
        "print": TokenType.PRINT,
        "return": TokenType.RETURN,
        "super": TokenType.SUPER,
        "this": TokenType.THIS,
        "true": TokenType.TRUE,
        "var": TokenType.VAR,
        "while": TokenType.WHILE,
    }
    
    def __init__(self, source):
        # ...

Then we scan the identifier and check the dictionary.

    def identifier(self):
        while self.peek().isalnum() or self.peek() == '_':
            self.advance()

        text = self.source[self.start : self.current]
        type = self.keywords.get(text, TokenType.IDENTIFIER)
        self.add_token(type)

Python Toolbox: Dictionary .get() with Defaults

In identifier(), we use self.keywords.get(text, TokenType.IDENTIFIER). The .get() method looks up a key in a dictionary. If the key (text) exists, it returns the associated value (the specific keyword token type). The magic is the second argument: if the key is not found, it returns the default value we provided (TokenType.IDENTIFIER) instead of throwing a KeyError.

Maximal Munch

Why could we not match keywords the same way we matched <=, checking for individual characters? Consider a variable named orchid. The scanner would see or and immediately emit an OR keyword token!

The rule is called maximal munch. When two rules can match a chunk of code, whichever one matches the longest sequence of characters wins. We assume any lexeme starting with a letter is an identifier, consume all connected alphanumeric characters, and then check if the entire lexeme matches a keyword.

Test it out: Run your REPL one last time and type var orchid = 10; while true. Verify that var, while, and true are parsed as keywords, but orchid is parsed as an identifier.

Your scanner is complete!

7. Challenge

Add Block Comments

Add support to Snek's scanner for C-style /* ... */ block comments. Make sure you handle newlines inside the block correctly so that your line counter remains accurate.

Implementation Hint

You will need to modify the / case in scan_token. If self.match('*') is true, you enter a while loop. You will loop until you peek() a * AND peek_next() a /. Do not forget to consume the closing */ after the loop! Also, handle the case where the file ends before the comment is closed by reporting an error.