Pass 4 — Code generation¶
Entry point: codegen::GenerateTarget (src/codegen/mod.rs) — a
clap-derived enum with one variant per backend, each wrapping that
backend’s Args struct (empty for most, carrying flags like
--package / --namespace for Java and C#). The CLI parses the
target subcommand into a GenerateTarget and calls .emit(&st),
which dispatches to the chosen backend’s emit function. Backend
implementations live under src/codegen/<target>.rs.
Inputs¶
A backend is a pure function:
pub fn emit(st: &StateTable, args: &Args) -> Vec<EmittedFile>
Every backend reads the same StateTable. An EmittedFile is
just a relative path plus UTF-8 contents — backends never touch the
filesystem themselves; the CLI handles that.
The shape each backend produces¶
All backends produce the same logical artifacts, encoded in the target language’s idioms:
TokenKind enum. One variant per entry in
state_table.tokens, plusEOF = 0. Representation per language:Rust / TypeScript / Python (PyO3 wraps Rust):
u16.Go / C:
uint16/uint16_t.C#:
ushort.Java:
int(Java has no unsigned 16-bit primitive — values stay in the0..0xFFFFrange).
Lex failures are not a TokenKind variant; they ride the optional / sentinel convention chosen per language.
Lex-failure representation. Languages with a free structural nullable carry the kind as that:
Rust:
Token<TK> { kind: Option<TK>, … }.TypeScript:
Token<TK> { kind: TK | null, … }.Python:
Optional[int]on the PyO3Tokenwrapper.
Unsigned-int backends use the in-band sentinel value
0xFFFFfor “no DFA pattern matched”:Go:
Token { Kind uint16, … }; the runtime exposesparsunart.ErrorKind = 0xFFFF.Java:
Tokencarriesint kind(read viatoken.kind()); the runtime exposesLexer.ERROR_KIND = 0xFFFF.C#:
Token(ushort Kind, …); the runtime exposesLexer.ErrorKind = 0xFFFF.C:
{ uint16_t kind; … }; the runtime definesPARSUNA_LEX_ERROR = 0xFFFFand the public header re-exports it as{upper}_LEX_ERROR. FIRST and SYNC arrays both terminate withSENTINEL = 0xFFFF— same value as the lex-failure kind, but the two never appear in the same comparison context (sentinels are array terminators only, the kind only appears on aToken).
No grammar-declared kind id collides with
0xFFFF(or with0, which is reserved for EOF). The user-facing analyzer no longer reserves the nameERROR— a grammar can declare a token calledERRORif it likes; the lex-failure sentinel lives outside theTokenKindenum in every backend.RuleKind enum. One variant per entry in
state_table.rule_kinds, representationu16(or the language’s nearest unsigned half-word).Name-lookup helpers.
tokenKindName/ruleKindName— useful for diagnostics and test output.Compiled lexer DFA. Not a packed table — the DFA is emitted as code. Each state becomes one arm of a
match/switchthat reads the current byte and dispatches on it; the whole machine is wrapped in alongest_matchfunction exposed through the runtime’sDfaMatcher(or equivalent) interface. Byte ranges sharing the same target collapse into single arms (so an[a-z]+token compiles toSome(&(b'a'..=b'z'))rather than 26 individual byte arms). When the grammar declares lexer modes (@mode(name)pre-annotations), one matcher is emitted per mode and the publiclongest_matchdispatches on the active mode id. See The compiled lexer DFA below for the shape and rationale.SYNC intern pool. Every entry from
state_table.sync_setsbecomes a named constant (SYNC_N). States refer to them by index, so rules whose FOLLOW sets intern to the same set share a single table entry.FIRST intern pool — filtered. Only the entries whose
has_referencesflag is set get emitted asFIRST_Nconstants. For LL(1) grammars that’s empty:Tail::Star/Tail::Optcodegen atk = 1inlines the FIRST set into amatcharm pattern instead of consulting the pool, andTail::Dispatchbuilds its decision tree at lowering time. For LL(k > 1) only the entries actually referenced bymatches_first(FIRST_N)calls are emitted.Entry-state constants. One
ENTRY_<RULE>per public rule.State-dispatch function. A single function — in most backends a large
switchon the current state id — that runs one state body of the parser and either yields the event that body produced or signals “pure transition step” (the runtime’s pull loop calls back in for the next body in that case). This is thestepcallback the runtime calls; see The state-dispatch function below.Per-token mode-action callback (
apply_actions/ApplyActions/ etc.). Amatchover kinds firing the declared-> push(name)/-> popactions on the lexer’s mode stack in source order. Tokens with no actions fall through with no effect; grammars that declare no modes get an empty body that monomorphization can erase.parse_<rule> entry points. One convenience constructor per public rule, wrapping the runtime’s
Parser::newwith the correct entry-state constant and a configured lexer.A small amount of backend-specific plumbing. Package declarations, import blocks, module wrappers, and whatever else the target language needs. Each backend keeps this to the minimum.
The state-dispatch function: how ops become code¶
The core translation every backend performs is from a state’s
Body to statements in the target language. Each case arm
emits the body’s instrs list in order, then the tail. The
runtime’s Cursor (or the generated equivalent — see
The Cursor handle below) is the only handle the dispatch function
talks to the runtime through.
Instr translation:
Enter(kind)/Exit(kind)return cursor.enter(RuleKind::X)/return cursor.exit(RuleKind::X). These are emit ops — they end the dispatch arm, returning the built event to the runtime.Expect { kind, token_name, sync }return cursor.expect(TokenKind::X, &SYNC_N, "expected X"). The runtime handles the match / recover / retry protocol (see The event model);expectitself returns the produced event (aTokenon hit, or anErrorplus armed recovery on miss).PushRet(s)cursor.push_ret(s)— does not end the arm; it’s a side effect followed by the next instr or the tail.
Tail translation:
Jump(s)cursor.set_state(s); return None— pure transition step. The runtime’s pull loop callsstepagain in the same iteration.Retcursor.set_state(cursor.ret()); return None— pop the next return address (orTERMINATEDif the stack is empty) and signal “pure transition step”.Star { first, body, cont, head }/Opt { first, body, cont }A predicated branch. The lookahead test is the same in both; the match path differs:
Starmatch:cursor.push_ret(head); <run body>. Theheadpush is what makes the body return back to the loop-condition state for the next iteration.Optmatch: depends oncont.
contisSome(state)for the original push-and-jump shape orNoneafter tail-call elimination:Some(s)—Optmatch emitscursor.push_ret(s); <body>; both ops’ miss path emitscursor.set_state(s).None—Optmatch emits the body inline (no push, the body’s trailingRetreturns to our caller); both ops’ miss path emitscursor.set_state(cursor.ret()).
Each sub-body is itself a
Body— its instrs run inline, and its tail is translated the same way as any other tail.The form of the lookahead test depends on
state_table.k:k = 1— the FIRST set is inlined into amatcharm pattern (one alternative per token kind it accepts). NoFIRST_Nconstant is emitted or referenced.k > 1— emitscursor.matches_first(FIRST_N), the only place the FIRST intern pool is consulted at runtime.
Dispatch { tree, sync, cont }A nested
switch. EachDispatchTree::Switchbecomes one level ofswitch (cursor.look(depth).kind); eachLeafbecomes an action whose form depends oncont:Some(s)—Armleaves emitcursor.push_ret(s); <arm body>;Fallthroughemitscursor.set_state(s);Erroremitscur = s; recover_to(sync)plus a returned error event.None—Armleaves emit the arm body inline (no push);Fallthroughemitscur = ret();Errorarms recovery and then emitscur = ret().
The trie structure is what lets a k-token lookahead compile to k nested switches rather than a linear scan.
Dispatchconsumes its FIRST sets at lowering time when the tree is built; the emitted switch arms carry concrete token kinds, so noFIRST_Nreference appears at runtime.
Because the ops are already linear and the state ids are dense
integers (optimize compacts them to a 1..N range), most
backends emit the whole parser as a single switch in a single
function.
The Cursor handle¶
Every runtime exposes a Cursor (or equivalent — *Cursor in
Go, a thin wrapper around Parser in TS/C#/Java) that’s the only
handle generated dispatch talks to the runtime through. It re-exports
just the operations dispatch needs — lookahead access, return-stack
push / pop, event builders (enter / exit / consume),
expect, and recovery arming. External code can’t construct one
(the constructor is internal to the pull loop), so the parser’s
internal state stays sealed: the only way to drive a parse is through
the iterator’s next / Next.
This is uniform across backends: Rust’s parsuna_rt::Cursor,
Go’s *Cursor (a type Cursor Parser alias for a free cast),
the Java Cursor class, the C# Cursor class, the TS Cursor
binding, and the C Parser * (which doubles as cursor and parser
since the C runtime is single-TU). The visible parser type — the
thing you iterate — has no dispatch hooks on it, only the
next_event / Next / next() iterator method.
The compiled lexer DFA¶
The lexer DFA is compiled to code, not to data. Each backend
walks state_table.modes and emits a longest_match function
shaped like:
fn longest_match(buf, start, mode) -> DfaMatch:
match mode:
0 => longest_match_mode_0(buf, start)
1 => longest_match_mode_1(buf, start)
…
fn longest_match_mode_0(buf, start) -> DfaMatch:
pos = start
best_len = 0
best_kind = None
state = <dfa.start>
loop:
match state:
1 => match buf[pos]:
'a'..='z' => { pos += 1; state = 2;
best_len = pos - start;
best_kind = Some(TokenKind.Ident); }
_ => break
2 => {
// Self-loop prologue: scan past every byte that
// would loop the state back to itself in one go,
// before the per-byte switch.
while buf[pos] in {'a'..='z', '0'..='9', '_'}:
pos += 1
best_len = pos - start
best_kind = Some(TokenKind.Ident)
match buf[pos]:
...
}
...
return DfaMatch { len: best_len, kind: best_kind }
One outer switch arm per DFA state, one inner switch arm
per byte target. Whenever a transition lands on an accept state,
the surrounding code records (best_len, best_kind) so the
classic longest-match rule falls out of the structure.
For grammars without @mode(name) pre-annotations the outer
match mode collapses to a single arm (or is elided entirely);
backends that can compile to a function pointer typically expose
longest_match_mode_0 directly.
Self-loop prologue¶
For states whose self_loop_ranges (set by lowering) is non-empty
— the typical lexer hot-path states like whitespace runs, identifier
bodies, comment / string contents — backends emit a bulk scan
prologue ahead of the regular per-byte switch. The prologue advances
pos past every byte in the self-loop set in one tight loop that
the host language’s optimizer typically autovectorizes (Rust’s
LLVM, V8’s Irregexp via sticky regex, .NET’s
MemoryExtensions.IndexOfAnyExcept, Go’s bytes.IndexFunc,
HotSpot’s loop vectorizer). The byte that breaks the prologue then
falls through to the regular dispatch — non-self-loop transitions
are unchanged.
Byte-range collapsing¶
A naive emission would produce 256 arms per DFA state. Instead the
DFA compiler in lowering::lexer_dfa returns its states with
transitions already grouped: each DfaState carries its
arms: Vec<ByteArm>, where bytes sharing a target collapse into
one arm and contiguous bytes within an arm collapse into ranges.
So IDENT = ('a'..'z' | '_')+; compiles to two arms per state —
one for b'a'..=b'z', one for b'_' — not 27. Every backend
works from the same pre-grouped shape; no per-backend collapsing
logic is needed.
Why code instead of tables¶
Constant folding by the host compiler. The byte-range patterns and state transitions are visible to LLVM, javac, the Go SSA pass, and friends; they compile to jump tables, branch tables, or inlined comparisons depending on the target. A data-driven walk over a packed transition array is opaque to those optimizers.
Dense binaries for sparse DFAs. Most DFA states have only a handful of live transitions; emitting just those arms is much smaller than reserving 256 entries per state.
Locality. The parser dispatch is already one big
switch; the DFA matcher is the same shape in the same file, so a reader sees the whole machine as code rather than as opaque tables threaded through a runtime helper.
Each backend supplies its compiled matcher to the runtime through
a DfaMatcher trait (Rust), interface (Go, Java, C#), or function
type (TypeScript). C is single-translation-unit: the runtime header
forward-declares static void longest_match_impl(...) and the
generated .c file (which #includes the header) provides the
definition, so every reference resolves within the same TU. The
runtime calls longest_match(buf, pos, mode) once per token;
everything else about the lexer (input buffering, position tracking,
EOF handling, mode-stack maintenance) stays generic in the runtime.
The buf argument is a byte view in every backend except
TypeScript, which receives a Latin-1-decoded string so the matcher
can read bytes via buf.charCodeAt(pos). The TS runtime keeps
the original Uint8Array alongside it for spans, advance(),
and UTF-8 token-text decoding; the Latin-1 string exists purely to
make the inner DFA loop a one-byte-string indexing pattern that
V8’s JIT specializes hard.
Differences between backends¶
The public surface is the same everywhere. The per-backend variation is limited to:
File layout. Rust and C# emit a single file; Java emits one file per public type plus a
Grammarclass with the DFA tables; C emits a header plus implementation; Go emits a single package file with helper types re-exported from the runtime.Build file. Python wraps the generated Rust in a
pyo3extension module and emits aCargo.tomlpluspyproject.tomlsomaturincan build it. The other backends assume an existing build system in the consumer project.Runtime dependency. Rust, Python, Go, Java, C#, and TypeScript import a pre-built runtime crate / package (
parsuna-rt,parsuna.dev/parsuna-rt-go, etc.). C inlines the runtime intoparsuna_rt.hso a generated parser is self-contained.Skip-token policy. Most runtimes expose two configs — emit-skips (default, surfacing
WS/COMMENTasTokenevents) or drop-skips (the parser silently consumes them) — chosen at parser construction. In Rust this is a compile-timeParserConfig(zero-sized type parameter); in Go / Java / C# / TypeScript it’s a runtimeOptionsstruct passed to the constructor. Either way, the structural event stream is unchanged.
The TypeScript, Go, Java, and C# backends also differ in how they
encode the Event tagged union — an idiomatic discriminated union
in TypeScript, a struct with an EventTag field in Go, a sealed
class hierarchy in Java, a record hierarchy in C# — but the
semantics are the same in each case. Each runtime additionally
distinguishes a Garbage event (a token consumed by error
recovery) from a normal Token event so consumers can tell
recovered-past tokens from real ones without tracking recovery
state externally.
The tree-sitter backend¶
File: src/tree_sitter.rs.
Parsuna can also emit a grammar.js for tree-sitter, via the
parsuna <grammar> tree-sitter subcommand. This is not a full
backend — it does not use the StateTable at all, only the
AnalyzedGrammar — because tree-sitter has its own parser
generator. The emitted file is a declarative transliteration of the
parsuna grammar into tree-sitter’s combinator DSL, intended for
editor tooling (syntax highlighting, code folding). It does not
share the pull-parser runtime.
Putting it all together¶
The CLI’s generate subcommand is therefore tiny: parse the
grammar, analyze it, lower it, look up the backend by name, call
(backend.emit)(&state_table), and write each EmittedFile to
disk under the output directory. The passes documented in this
section are everything the generator does.