Pass 1 — Grammar parsing¶
Entry point: grammar::parse_grammar (src/grammar/parser.rs).
Output: a Grammar value (src/grammar/ir.rs).
Input¶
The source text of a .parsuna file, as UTF-8 bytes. Nothing else
is consulted — there are no includes, no external definitions, no
search path.
Bootstrap¶
The grammar-file parser is itself generated by parsuna. The file
src/grammar/parsuna.parsuna describes the syntax of a parsuna
grammar; running the generator over it produces
src/grammar/generated.rs, a pull-parser over the grammar DSL.
parser.rs then consumes that pull parser’s events and builds the
Grammar IR. Parsuna bootstraps itself.
The implication is that the first pass is a worked example of
consuming an event stream. parser.rs reads Enter/Exit
pairs to recognize rule-shaped blocks, pulls tokens out of them with
a small Reader abstraction, and accumulates errors as it goes.
The Grammar IR¶
The parse produces a flat Grammar:
name: String— a label used by later phases for file and package naming. The parser leaves this empty; the CLI fills it from the input file’s stem (or from--namewhen given).tokens: IndexMap<String, TokenDef>— every token declaration, keyed by name.IndexMappreserves insertion (= source) order on iteration, which the lexer DFA relies on for tie-breaking and the lowering pass relies on forRuleKindid assignment. EachTokenDefrecords the name, the body (TokenPattern), theskipflag (from a-> skipaction), theis_fragmentflag (_-prefix), the optional lexermode(from an@mode(name)pre-annotation), the resolvedmode_actionslist (from-> push(mode)/-> popactions), and a source span.rules: IndexMap<String, RuleDef>— every rule declaration. EachRuleDefrecords the name, the body (Expr), theis_fragmentflag, and a source span.
TokenPattern is a regular expression tree over characters:
Empty, Literal(String), Class(CharClass), Ref(String),
Seq, Alt, Opt, Star, Plus. Expr is the
corresponding LL expression tree over tokens and rules: Empty,
Token(name), Rule(name), and the same combinators. Two trees,
one shape.
Where the distinction matters¶
The parser decides whether a body is a token pattern or a rule
expression from the case of the first letter of the declaration’s
name (see The grammar language). It uses different descent
functions (read_pattern_* vs. read_*) so that:
Character atoms (
'a',..,.,!) and string literals are accepted only on the token side. Using one inside a rule body produces a pointed error like “string literal atoms are only valid inside token declarations”.Identifiers in a rule body with an uppercase initial become
Expr::Token(name); with a lowercase initial,Expr::Rule(name). Identifiers in a token body are alwaysTokenPattern::Ref(name)and are resolved later.
Actions and pre-annotations¶
A declaration may carry a trailing -> action[, action...] block
and a leading @kind(arg) pre-annotation. Both are parsed here
and lowered into TokenDef fields:
-> skipsetsskip = true. Rejected on rules and on fragments (_NAME).-> push(mode)/-> popappend aModeAction::Push/ModeAction::Poptomode_actions. Mode-stack actions are kept in source order so combinations like-> pop, push(b)(swap top) and-> push(a), push(b)(push two) round-trip cleanly. Combiningskipwith any mode action is rejected.@mode(name)before a token declaration setsmode = Some(name). Applies to the very next decl only — it is a per-token attribute, not a scope, so@mode(tag)repeats once per token bound to that mode. Rejected on rules.
Unknown action names (anything other than skip / push /
pop) and unknown pre-annotation kinds (anything other than
mode) are recorded as errors and parsing continues.
Error collection¶
parse_grammar returns Result<Grammar, Vec<Error>>. It does
not stop at the first problem; instead it accumulates diagnostics
and keeps parsing. This produces the characteristic “ten errors at
once” experience: a malformed grammar still parses to a mostly-shaped
IR, and the user sees every syntactic issue in a single run.
The Reader abstraction handles the book-keeping. It holds the
current lookahead event, exposes peek/advance/expect_*
helpers, and transparently drops WS and COMMENT tokens so
callers never have to think about trivia. Recovery Garbage
events from the bootstrap parser (the underlying runtime emits
these when it skips past unexpected input — see The event model)
are silently swallowed by the Reader: the per-token errors that
accompany them already say what went wrong, so re-reporting them
would just be noise.
Post-conditions¶
After parsing, the Grammar is syntactically well-formed
but semantically unchecked. It may still contain:
References to undefined tokens or rules.
Left-recursive rules.
Token reference cycles.
Names that collide with reserved runtime identifiers (
EOF).Mode actions that refer to mode names no token actually declares.
Duplicate declarations are caught here, at parse time, because the
IR’s name-keyed IndexMap would silently dedupe them otherwise.
Everything else listed above is left for the next pass,
Pass 2 — Analysis, to catch.