Pass 1 — Grammar parsing¶
Entry point: grammar::parse_grammar (src/grammar/parser.rs).
Output: a Grammar value (src/grammar/ir.rs).
Input¶
The source text of a .parsuna file, as UTF-8 bytes. Nothing else
is consulted — there are no includes, no external definitions, no
search path.
Bootstrap¶
The grammar-file parser is itself generated by parsuna. The file
src/grammar/parsuna.parsuna describes the syntax of a parsuna
grammar; running the generator over it produces
src/grammar/generated.rs, a pull-parser over the grammar DSL.
parser.rs then consumes that pull parser’s events and builds the
Grammar IR. Parsuna bootstraps itself.
The implication is that the first pass is a worked example of
consuming an event stream. parser.rs reads Enter/Exit
pairs to recognise rule-shaped blocks, pulls tokens out of them with
a small Reader abstraction, and accumulates errors as it goes.
The Grammar IR¶
The parse produces a flat Grammar:
name: String— a label used by later phases for file and package naming. The parser leaves this empty; the CLI fills it from the input file’s stem (or from--namewhen given).tokens: Vec<TokenDef>— every token declaration in source order. EachTokenDefrecords the name, the body (TokenPattern), theskipflag (?prefix), theis_fragmentflag (_prefix), and a source span.rules: Vec<RuleDef>— every rule declaration in source order. EachRuleDefrecords the name, the body (Expr), theis_fragmentflag, and a source span.
TokenPattern is a regular expression tree over characters:
Empty, Literal(String), Class(CharClass), Ref(String),
Seq, Alt, Opt, Star, Plus. Expr is the
corresponding LL expression tree over tokens and rules: Empty,
Token(name), Rule(name), and the same combinators. Two trees,
one shape.
Where the distinction matters¶
The parser decides whether a body is a token pattern or a rule
expression from the case of the first letter of the declaration’s
name (see The grammar language). It uses different descent
functions (read_pattern_* vs. read_*) so that:
Character atoms (
'a',..,.,!) and string literals are accepted only on the token side. Using one inside a rule body produces a pointed error like “string literal atoms are only valid inside token declarations”.Identifiers in a rule body with an uppercase initial become
Expr::Token(name); with a lowercase initial,Expr::Rule(name). Identifiers in a token body are alwaysTokenPattern::Ref(name)and are resolved later.
Error collection¶
parse_grammar returns Result<Grammar, Vec<Error>>. It does
not stop at the first problem; instead it accumulates diagnostics
and keeps parsing. This produces the characteristic “ten errors at
once” experience: a malformed grammar still parses to a mostly-shaped
IR, and the user sees every syntactic issue in a single run.
The Reader abstraction handles the book-keeping. It holds the
current lookahead event, exposes peek/advance/expect_*
helpers, and transparently drops WS and COMMENT tokens so
callers never have to think about trivia.
Post-conditions¶
After parsing, the Grammar is syntactically well-formed
but semantically unchecked. It may still contain:
References to undefined tokens or rules.
Left-recursive rules.
Token reference cycles.
Duplicate declarations.
Names that collide with runtime sentinels (
EOF,ERROR).
The next pass, Pass 2 — Analysis, is what catches those.