Pass 1 — Grammar parsing

Entry point: grammar::parse_grammar (src/grammar/parser.rs). Output: a Grammar value (src/grammar/ir.rs).

Input

The source text of a .parsuna file, as UTF-8 bytes. Nothing else is consulted — there are no includes, no external definitions, no search path.

Bootstrap

The grammar-file parser is itself generated by parsuna. The file src/grammar/parsuna.parsuna describes the syntax of a parsuna grammar; running the generator over it produces src/grammar/generated.rs, a pull-parser over the grammar DSL. parser.rs then consumes that pull parser’s events and builds the Grammar IR. Parsuna bootstraps itself.

The implication is that the first pass is a worked example of consuming an event stream. parser.rs reads Enter/Exit pairs to recognise rule-shaped blocks, pulls tokens out of them with a small Reader abstraction, and accumulates errors as it goes.

The Grammar IR

The parse produces a flat Grammar:

  • name: String — a label used by later phases for file and package naming. The parser leaves this empty; the CLI fills it from the input file’s stem (or from --name when given).

  • tokens: Vec<TokenDef> — every token declaration in source order. Each TokenDef records the name, the body (TokenPattern), the skip flag (? prefix), the is_fragment flag (_ prefix), and a source span.

  • rules: Vec<RuleDef> — every rule declaration in source order. Each RuleDef records the name, the body (Expr), the is_fragment flag, and a source span.

TokenPattern is a regular expression tree over characters: Empty, Literal(String), Class(CharClass), Ref(String), Seq, Alt, Opt, Star, Plus. Expr is the corresponding LL expression tree over tokens and rules: Empty, Token(name), Rule(name), and the same combinators. Two trees, one shape.

Where the distinction matters

The parser decides whether a body is a token pattern or a rule expression from the case of the first letter of the declaration’s name (see The grammar language). It uses different descent functions (read_pattern_* vs. read_*) so that:

  • Character atoms ('a', .., ., !) and string literals are accepted only on the token side. Using one inside a rule body produces a pointed error like “string literal atoms are only valid inside token declarations”.

  • Identifiers in a rule body with an uppercase initial become Expr::Token(name); with a lowercase initial, Expr::Rule(name). Identifiers in a token body are always TokenPattern::Ref(name) and are resolved later.

Error collection

parse_grammar returns Result<Grammar, Vec<Error>>. It does not stop at the first problem; instead it accumulates diagnostics and keeps parsing. This produces the characteristic “ten errors at once” experience: a malformed grammar still parses to a mostly-shaped IR, and the user sees every syntactic issue in a single run.

The Reader abstraction handles the book-keeping. It holds the current lookahead event, exposes peek/advance/expect_* helpers, and transparently drops WS and COMMENT tokens so callers never have to think about trivia.

Post-conditions

After parsing, the Grammar is syntactically well-formed but semantically unchecked. It may still contain:

  • References to undefined tokens or rules.

  • Left-recursive rules.

  • Token reference cycles.

  • Duplicate declarations.

  • Names that collide with runtime sentinels (EOF, ERROR).

The next pass, Pass 2 — Analysis, is what catches those.