Pass 4 — Code generation¶
Entry point: codegen::GenerateTarget (src/codegen/mod.rs) — a
clap-derived enum with one variant per backend, each wrapping that
backend’s Args struct (empty for most, carrying flags like
--package / --namespace for Java and C#). The CLI parses the
target subcommand into a GenerateTarget and calls .emit(&st),
which dispatches to the chosen backend’s emit function. Backend
implementations live under src/codegen/<target>.rs.
Inputs¶
A backend is a pure function:
pub fn emit(st: &StateTable, args: &Args) -> Vec<EmittedFile>
Every backend reads the same StateTable. An EmittedFile is
just a relative path plus UTF-8 contents — backends never touch the
filesystem themselves; the CLI handles that.
The shape each backend produces¶
All backends produce the same logical artifacts, encoded in the target language’s idioms:
TokenKind enum. One variant per entry in
state_table.tokens, plusEOF = 0. Representation per language:Rust / TypeScript / Python (PyO3 wraps Rust):
u16.Go / C:
uint16/uint16_t.C#:
ushort.Java:
int(Java has no unsigned 16-bit primitive — values stay in the0..0xFFFFrange).
Lex failures are not a TokenKind variant; they ride the optional / sentinel convention chosen per language.
Lex-failure representation. Languages with a free structural nullable carry the kind as that:
Rust:
Token<TK> { kind: Option<TK>, … }.TypeScript:
Token<TK> { kind: TK | null, … }.Python:
Optional[int]on the PyO3Tokenwrapper.
Unsigned-int backends use the in-band sentinel value
0xFFFFfor “no DFA pattern matched”:Go:
Token { Kind uint16, … }; the runtime exposesparsunart.ErrorKind = 0xFFFF.Java:
Token { int kind, … }; the runtime exposesLexer.ERROR_KIND = 0xFFFF.C#:
Token(ushort Kind, …); the runtime exposesLexer.ErrorKind = 0xFFFF.C:
{ uint16_t kind; … }; the runtime definesPARSUNA_LEX_ERROR = 0xFFFFand the public header re-exports it as{upper}_LEX_ERROR. FIRST and SYNC arrays both terminate withSENTINEL = 0xFFFF— same value as the lex-failure kind, but the two never appear in the same comparison context (sentinels are array terminators only, the kind only appears on aToken).
No grammar-declared kind id collides with
0xFFFF(or with0, which is reserved for EOF). The user-facing analyzer no longer reserves the nameERROR— a grammar can declare a token calledERRORif it likes; the lex-failure sentinel lives outside theTokenKindenum in every backend.RuleKind enum. One variant per entry in
state_table.rule_kinds, representationu16(or the language’s nearest unsigned half-word).Name-lookup helpers.
tokenKindName/ruleKindName— useful for diagnostics and test output.Compiled lexer DFA. Not a packed table — the DFA is emitted as code. Each state becomes one arm of a
match/switchthat reads the current byte and dispatches on it; the whole machine is wrapped in alongest_matchfunction exposed through the runtime’sDfaMatcher(or equivalent) interface. Byte ranges sharing the same target collapse into single arms (so an[a-z]+token compiles toSome(&(b'a'..=b'z'))rather than 26 individual byte arms). See The compiled lexer DFA below for the shape and rationale.SYNC intern pool. Every entry from
state_table.sync_setsbecomes a named constant (SYNC_N). States refer to them by index, so rules whose FOLLOW sets intern to the same set share a single table entry.FIRST intern pool — filtered. Only the entries whose
has_referencesflag is set get emitted asFIRST_Nconstants. For LL(1) grammars that’s empty:Op::Star/Op::Optcodegen atk = 1inlines the FIRST set into amatcharm pattern instead of consulting the pool, andOp::Dispatchbuilds its decision tree at lowering time. For LL(k > 1) only the entries actually referenced bymatches_first(FIRST_N)calls are emitted.Entry-state constants. One
ENTRY_<RULE>per public rule.State-dispatch function. A single function — in most backends a large
switchon the current state id — that runs one state of the parser and either consumes an event or transfers control.parse_<rule> entry points. One convenience constructor per public rule, wrapping the runtime’s
Parser::newwith the correct entry-state constant and a configured lexer.A small amount of backend-specific plumbing. Package declarations, import blocks, module wrappers, and whatever else the target language needs. Each backend keeps this to the minimum.
The state-dispatch function: how ops become code¶
The core translation every backend performs is from Op values to
statements in the target language. The mapping is direct:
Enter(kind)/Exit(kind)A call into the runtime:
parser.enter(RuleKind::X)/parser.exit(RuleKind::X). The runtime emits the structural event and positions it at the current parse cursor.Expect { kind, token_name, sync }parser.expect(TokenKind::X, SYNC_N, "expected X"). The runtime handles the match/recover/retry protocol (see The event model).PushRet(s)parser.push_ret(s).Jump(s)Set the current state to
sand break out of the dispatch iteration (or, in languages where it’s simpler, fall through to the next case by numeric adjacency).Retparser.state = parser.ret().Star { first, body, cont, head }/Opt { first, body, cont }A predicated branch. The lookahead test is the same in both ops; the match-path push and the miss-path transfer differ slightly:
Starmatch path:parser.push_ret(head); cur = body. Theheadpush is what makes the body return back to the loop-condition state for the next iteration.Optmatch path: depends oncont.
contisSome(state)for the original push-and-jump shape orNoneafter tail-call elimination — the codegen pattern-matches:Some(s)—Optmatch emitsparser.push_ret(s); cur = body; both ops’ miss path emitscur = s.None—Optmatch emitscur = body(no push, the body’s trailingRetreturns to our caller); both ops’ miss path emitscur = parser.ret().
The form of the lookahead test depends on
state_table.k:k = 1— the FIRST set is inlined into amatcharm pattern (one alternative per token kind it accepts). NoFIRST_Nconstant is emitted or referenced.k > 1— emitsparser.matches_first(FIRST_N), the only place the FIRST intern pool is consulted at runtime.
Dispatch { tree, sync, cont }A nested
switch. EachDispatchTree::Switchbecomes one level ofswitch (parser.look(depth).kind); eachLeafbecomes an action whose form depends oncont:Some(s)—Armleaves emitparser.push_ret(s); cur = arm_body;Fallthroughemitscur = s;Erroremitscur = s; error+recover(sync).None—Armleaves emitcur = arm_body(no push);Fallthroughemitscur = parser.ret();Errorruns the recovery and then emitscur = parser.ret().
The trie structure is what lets a k-token lookahead compile to k nested switches rather than a linear scan.
Dispatchconsumes its FIRST sets at lowering time when the tree is built; the emitted switch arms carry concrete token kinds, so noFIRST_Nreference appears at runtime.
Because the ops are already linear and the state ids are dense
integers, most backends can emit the whole parser as a single
switch in a single function.
The compiled lexer DFA¶
The lexer DFA is compiled to code, not to data. Each backend
walks state_table.lexer_dfa.states and emits a longest_match
function shaped like:
fn longest_match(buf, start) -> DfaMatch:
pos = start
best_len = 0
best_kind = None
state = <dfa.start>
loop:
match state:
1 => match buf[pos]:
'a'..='z' => { pos += 1; state = 2;
best_len = pos - start;
best_kind = Some(TokenKind.Ident); }
_ => break
2 => {
// Self-loop prologue: scan past every byte that
// would loop the state back to itself in one go,
// before the per-byte switch.
while buf[pos] in {'a'..='z', '0'..='9', '_'}:
pos += 1
best_len = pos - start
best_kind = Some(TokenKind.Ident)
match buf[pos]:
...
}
...
return DfaMatch { len: best_len, kind: best_kind }
One outer switch arm per DFA state, one inner switch arm
per byte target. Whenever a transition lands on an accept state,
the surrounding code records (best_len, best_kind) so the
classic longest-match rule falls out of the structure.
Self-loop prologue¶
For states whose self_loop field (set by lowering) is non-empty
— the typical lexer hot-path states like whitespace runs, identifier
bodies, comment / string contents — backends emit a bulk scan
prologue ahead of the regular per-byte switch. The prologue advances
pos past every byte in the self-loop set in one tight loop that
the host language’s optimizer typically autovectorizes (Rust’s
LLVM, V8’s Irregexp via sticky regex, .NET’s
MemoryExtensions.IndexOfAnyExcept, Go’s bytes.IndexFunc,
HotSpot’s loop vectorizer). The byte that breaks the prologue then
falls through to the regular dispatch — non-self-loop transitions
are unchanged.
Byte-range collapsing¶
A naive emission would produce 256 arms per DFA state. Instead the
DFA compiler in lowering::lexer_dfa returns its states with
transitions already grouped: each DfaState carries its
arms: Vec<ByteArm>, where bytes sharing a target collapse into
one arm and contiguous bytes within an arm collapse into ranges.
So IDENT = ('a'..'z' | '_')+; compiles to two arms per state —
one for b'a'..=b'z', one for b'_' — not 27. Every backend
works from the same pre-grouped shape; no per-backend collapsing
logic is needed.
Why code instead of tables¶
Constant folding by the host compiler. The byte-range patterns and state transitions are visible to LLVM, javac, the Go SSA pass, and friends; they compile to jump tables, branch tables, or inlined comparisons depending on the target. A data-driven walk over a packed transition array is opaque to those optimizers.
Dense binaries for sparse DFAs. Most DFA states have only a handful of live transitions; emitting just those arms is much smaller than reserving 256 entries per state.
Locality. The parser dispatch is already one big
switch; the DFA matcher is the same shape in the same file, so a reader sees the whole machine as code rather than as opaque tables threaded through a runtime helper.
Each backend supplies its compiled matcher to the runtime through
a DfaMatcher trait (Rust), interface (Go, Java, C#), or function
type (TypeScript). C is single-translation-unit: the runtime header
forward-declares static void longest_match_impl(...) and the
generated .c file (which #includes the header) provides the
definition, so every reference resolves within the same TU. The
runtime calls longest_match(buf, pos) once per token; everything
else about the lexer (input buffering, position tracking, EOF
handling) stays generic in the runtime crate.
The buf argument is a byte view in every backend except
TypeScript, which receives a Latin-1-decoded string so the matcher
can read bytes via buf.charCodeAt(pos). The TS runtime keeps
the original Uint8Array alongside it for spans, advance(),
and UTF-8 token-text decoding; the Latin-1 string exists purely to
make the inner DFA loop a one-byte-string indexing pattern that
V8’s JIT specializes hard.
Differences between backends¶
The public surface is the same everywhere. The per-backend variation is limited to:
File layout. Rust and C# emit a single file; Java emits one file per public type plus a
Grammarclass with the DFA tables; C emits a header plus implementation; Go emits a single package file with helper types re-exported from the runtime.Build file. Python wraps the generated Rust in a
pyo3extension module and emits aCargo.tomlpluspyproject.tomlsomaturincan build it. The other backends assume an existing build system in the consumer project.Runtime dependency. Rust, Python, Go, Java, C#, and TypeScript import a pre-built runtime crate/package (
parsuna-rt,parsuna.dev/parsuna-rt-go, etc.). C inlines the runtime intoparsuna_rt.hso a generated parser is self-contained.
The TypeScript, Go, Java, and C# backends also differ in how they
encode the Event tagged union — an idiomatic discriminated union
in TypeScript, a struct with an EventTag field in Go, a sealed
class hierarchy in Java, a record hierarchy in C# — but the
semantics are the same in each case.
The tree-sitter backend¶
File: src/tree_sitter.rs.
Parsuna can also emit a grammar.js for tree-sitter, via the
parsuna <grammar> tree-sitter subcommand. This is not a full
backend — it does not use the StateTable at all, only the
AnalyzedGrammar — because tree-sitter has its own parser
generator. The emitted file is a declarative transliteration of the
parsuna grammar into tree-sitter’s combinator DSL, intended for
editor tooling (syntax highlighting, code folding). It does not
share the pull-parser runtime.
Putting it all together¶
The CLI’s generate subcommand is therefore tiny: parse the
grammar, analyze it, lower it, look up the backend by name, call
(backend.emit)(&state_table), and write each EmittedFile to
disk under the output directory. The passes documented in this
section are everything the generator does.