Pass 4 — Code generation

Entry point: codegen::GenerateTarget (src/codegen/mod.rs) — a clap-derived enum with one variant per backend, each wrapping that backend’s Args struct (empty for most, carrying flags like --package / --namespace for Java and C#). The CLI parses the target subcommand into a GenerateTarget and calls .emit(&st), which dispatches to the chosen backend’s emit function. Backend implementations live under src/codegen/<target>.rs.

Inputs

A backend is a pure function:

pub fn emit(st: &StateTable, args: &Args) -> Vec<EmittedFile>

Every backend reads the same StateTable. An EmittedFile is just a relative path plus UTF-8 contents — backends never touch the filesystem themselves; the CLI handles that.

The shape each backend produces

All backends produce the same logical artifacts, encoded in the target language’s idioms:

  1. TokenKind enum. One variant per entry in state_table.tokens, plus EOF = 0. Representation per language:

    • Rust / TypeScript / Python (PyO3 wraps Rust): u16.

    • Go / C: uint16 / uint16_t.

    • C#: ushort.

    • Java: int (Java has no unsigned 16-bit primitive — values stay in the 0..0xFFFF range).

    Lex failures are not a TokenKind variant; they ride the optional / sentinel convention chosen per language.

  2. Lex-failure representation. Languages with a free structural nullable carry the kind as that:

    • Rust: Token<TK> { kind: Option<TK>, }.

    • TypeScript: Token<TK> { kind: TK | null, }.

    • Python: Optional[int] on the PyO3 Token wrapper.

    Unsigned-int backends use the in-band sentinel value 0xFFFF for “no DFA pattern matched”:

    • Go: Token { Kind uint16, }; the runtime exposes parsunart.ErrorKind = 0xFFFF.

    • Java: Token { int kind, }; the runtime exposes Lexer.ERROR_KIND = 0xFFFF.

    • C#: Token(ushort Kind, …); the runtime exposes Lexer.ErrorKind = 0xFFFF.

    • C: { uint16_t kind; }; the runtime defines PARSUNA_LEX_ERROR = 0xFFFF and the public header re-exports it as {upper}_LEX_ERROR. FIRST and SYNC arrays both terminate with SENTINEL = 0xFFFF — same value as the lex-failure kind, but the two never appear in the same comparison context (sentinels are array terminators only, the kind only appears on a Token).

    No grammar-declared kind id collides with 0xFFFF (or with 0, which is reserved for EOF). The user-facing analyzer no longer reserves the name ERROR — a grammar can declare a token called ERROR if it likes; the lex-failure sentinel lives outside the TokenKind enum in every backend.

  3. RuleKind enum. One variant per entry in state_table.rule_kinds, representation u16 (or the language’s nearest unsigned half-word).

  4. Name-lookup helpers. tokenKindName / ruleKindName — useful for diagnostics and test output.

  5. Compiled lexer DFA. Not a packed table — the DFA is emitted as code. Each state becomes one arm of a match/switch that reads the current byte and dispatches on it; the whole machine is wrapped in a longest_match function exposed through the runtime’s DfaMatcher (or equivalent) interface. Byte ranges sharing the same target collapse into single arms (so an [a-z]+ token compiles to Some(&(b'a'..=b'z')) rather than 26 individual byte arms). See The compiled lexer DFA below for the shape and rationale.

  6. SYNC intern pool. Every entry from state_table.sync_sets becomes a named constant (SYNC_N). States refer to them by index, so rules whose FOLLOW sets intern to the same set share a single table entry.

  7. FIRST intern pool — filtered. Only the entries whose has_references flag is set get emitted as FIRST_N constants. For LL(1) grammars that’s empty: Op::Star / Op::Opt codegen at k = 1 inlines the FIRST set into a match arm pattern instead of consulting the pool, and Op::Dispatch builds its decision tree at lowering time. For LL(k > 1) only the entries actually referenced by matches_first(FIRST_N) calls are emitted.

  8. Entry-state constants. One ENTRY_<RULE> per public rule.

  9. State-dispatch function. A single function — in most backends a large switch on the current state id — that runs one state of the parser and either consumes an event or transfers control.

  10. parse_<rule> entry points. One convenience constructor per public rule, wrapping the runtime’s Parser::new with the correct entry-state constant and a configured lexer.

  11. A small amount of backend-specific plumbing. Package declarations, import blocks, module wrappers, and whatever else the target language needs. Each backend keeps this to the minimum.

The state-dispatch function: how ops become code

The core translation every backend performs is from Op values to statements in the target language. The mapping is direct:

Enter(kind) / Exit(kind)

A call into the runtime: parser.enter(RuleKind::X) / parser.exit(RuleKind::X). The runtime emits the structural event and positions it at the current parse cursor.

Expect { kind, token_name, sync }

parser.expect(TokenKind::X, SYNC_N, "expected X"). The runtime handles the match/recover/retry protocol (see The event model).

PushRet(s)

parser.push_ret(s).

Jump(s)

Set the current state to s and break out of the dispatch iteration (or, in languages where it’s simpler, fall through to the next case by numeric adjacency).

Ret

parser.state = parser.ret().

Star { first, body, cont, head } / Opt { first, body, cont }

A predicated branch. The lookahead test is the same in both ops; the match-path push and the miss-path transfer differ slightly:

  • Star match path: parser.push_ret(head); cur = body. The head push is what makes the body return back to the loop-condition state for the next iteration.

  • Opt match path: depends on cont.

cont is Some(state) for the original push-and-jump shape or None after tail-call elimination — the codegen pattern-matches:

  • Some(s)Opt match emits parser.push_ret(s); cur = body; both ops’ miss path emits cur = s.

  • NoneOpt match emits cur = body (no push, the body’s trailing Ret returns to our caller); both ops’ miss path emits cur = parser.ret().

The form of the lookahead test depends on state_table.k:

  • k = 1 — the FIRST set is inlined into a match arm pattern (one alternative per token kind it accepts). No FIRST_N constant is emitted or referenced.

  • k > 1 — emits parser.matches_first(FIRST_N), the only place the FIRST intern pool is consulted at runtime.

Dispatch { tree, sync, cont }

A nested switch. Each DispatchTree::Switch becomes one level of switch (parser.look(depth).kind); each Leaf becomes an action whose form depends on cont:

  • Some(s)Arm leaves emit parser.push_ret(s); cur = arm_body; Fallthrough emits cur = s; Error emits cur = s; error+recover(sync).

  • NoneArm leaves emit cur = arm_body (no push); Fallthrough emits cur = parser.ret(); Error runs the recovery and then emits cur = parser.ret().

The trie structure is what lets a k-token lookahead compile to k nested switches rather than a linear scan. Dispatch consumes its FIRST sets at lowering time when the tree is built; the emitted switch arms carry concrete token kinds, so no FIRST_N reference appears at runtime.

Because the ops are already linear and the state ids are dense integers, most backends can emit the whole parser as a single switch in a single function.

The compiled lexer DFA

The lexer DFA is compiled to code, not to data. Each backend walks state_table.lexer_dfa.states and emits a longest_match function shaped like:

fn longest_match(buf, start) -> DfaMatch:
    pos        = start
    best_len   = 0
    best_kind  = None
    state      = <dfa.start>
    loop:
        match state:
            1 => match buf[pos]:
                'a'..='z' => { pos += 1; state = 2;
                               best_len = pos - start;
                               best_kind = Some(TokenKind.Ident); }
                _         => break
            2 => {
                // Self-loop prologue: scan past every byte that
                // would loop the state back to itself in one go,
                // before the per-byte switch.
                while buf[pos] in {'a'..='z', '0'..='9', '_'}:
                    pos += 1
                best_len = pos - start
                best_kind = Some(TokenKind.Ident)
                match buf[pos]:
                    ...
            }
            ...
    return DfaMatch { len: best_len, kind: best_kind }

One outer switch arm per DFA state, one inner switch arm per byte target. Whenever a transition lands on an accept state, the surrounding code records (best_len, best_kind) so the classic longest-match rule falls out of the structure.

Self-loop prologue

For states whose self_loop field (set by lowering) is non-empty — the typical lexer hot-path states like whitespace runs, identifier bodies, comment / string contents — backends emit a bulk scan prologue ahead of the regular per-byte switch. The prologue advances pos past every byte in the self-loop set in one tight loop that the host language’s optimizer typically autovectorizes (Rust’s LLVM, V8’s Irregexp via sticky regex, .NET’s MemoryExtensions.IndexOfAnyExcept, Go’s bytes.IndexFunc, HotSpot’s loop vectorizer). The byte that breaks the prologue then falls through to the regular dispatch — non-self-loop transitions are unchanged.

Byte-range collapsing

A naive emission would produce 256 arms per DFA state. Instead the DFA compiler in lowering::lexer_dfa returns its states with transitions already grouped: each DfaState carries its arms: Vec<ByteArm>, where bytes sharing a target collapse into one arm and contiguous bytes within an arm collapse into ranges. So IDENT = ('a'..'z' | '_')+; compiles to two arms per state — one for b'a'..=b'z', one for b'_' — not 27. Every backend works from the same pre-grouped shape; no per-backend collapsing logic is needed.

Why code instead of tables

  • Constant folding by the host compiler. The byte-range patterns and state transitions are visible to LLVM, javac, the Go SSA pass, and friends; they compile to jump tables, branch tables, or inlined comparisons depending on the target. A data-driven walk over a packed transition array is opaque to those optimizers.

  • Dense binaries for sparse DFAs. Most DFA states have only a handful of live transitions; emitting just those arms is much smaller than reserving 256 entries per state.

  • Locality. The parser dispatch is already one big switch; the DFA matcher is the same shape in the same file, so a reader sees the whole machine as code rather than as opaque tables threaded through a runtime helper.

Each backend supplies its compiled matcher to the runtime through a DfaMatcher trait (Rust), interface (Go, Java, C#), or function type (TypeScript). C is single-translation-unit: the runtime header forward-declares static void longest_match_impl(...) and the generated .c file (which #includes the header) provides the definition, so every reference resolves within the same TU. The runtime calls longest_match(buf, pos) once per token; everything else about the lexer (input buffering, position tracking, EOF handling) stays generic in the runtime crate.

The buf argument is a byte view in every backend except TypeScript, which receives a Latin-1-decoded string so the matcher can read bytes via buf.charCodeAt(pos). The TS runtime keeps the original Uint8Array alongside it for spans, advance(), and UTF-8 token-text decoding; the Latin-1 string exists purely to make the inner DFA loop a one-byte-string indexing pattern that V8’s JIT specializes hard.

Shared helpers

File: src/codegen/common.rs.

Language-neutral helpers shared across backends: naming conversions (pascal, screaming_snake), string-literal escaping (escape_string, escape_string_bmp), and an Expr walker (visit). A backend’s job is mostly deciding what layout to produce; common exists so two backends producing the same name convention do so consistently.

Differences between backends

The public surface is the same everywhere. The per-backend variation is limited to:

  • File layout. Rust and C# emit a single file; Java emits one file per public type plus a Grammar class with the DFA tables; C emits a header plus implementation; Go emits a single package file with helper types re-exported from the runtime.

  • Build file. Python wraps the generated Rust in a pyo3 extension module and emits a Cargo.toml plus pyproject.toml so maturin can build it. The other backends assume an existing build system in the consumer project.

  • Runtime dependency. Rust, Python, Go, Java, C#, and TypeScript import a pre-built runtime crate/package (parsuna-rt, parsuna.dev/parsuna-rt-go, etc.). C inlines the runtime into parsuna_rt.h so a generated parser is self-contained.

The TypeScript, Go, Java, and C# backends also differ in how they encode the Event tagged union — an idiomatic discriminated union in TypeScript, a struct with an EventTag field in Go, a sealed class hierarchy in Java, a record hierarchy in C# — but the semantics are the same in each case.

The tree-sitter backend

File: src/tree_sitter.rs.

Parsuna can also emit a grammar.js for tree-sitter, via the parsuna <grammar> tree-sitter subcommand. This is not a full backend — it does not use the StateTable at all, only the AnalyzedGrammar — because tree-sitter has its own parser generator. The emitted file is a declarative transliteration of the parsuna grammar into tree-sitter’s combinator DSL, intended for editor tooling (syntax highlighting, code folding). It does not share the pull-parser runtime.

Putting it all together

The CLI’s generate subcommand is therefore tiny: parse the grammar, analyze it, lower it, look up the backend by name, call (backend.emit)(&state_table), and write each EmittedFile to disk under the output directory. The passes documented in this section are everything the generator does.