Pass 1 — Grammar parsing

Entry point: grammar::parse_grammar (src/grammar/parser.rs). Output: a Grammar value (src/grammar/ir.rs).

Input

The source text of a .parsuna file, as UTF-8 bytes. Nothing else is consulted — there are no includes, no external definitions, no search path.

Bootstrap

The grammar-file parser is itself generated by parsuna. The file src/grammar/parsuna.parsuna describes the syntax of a parsuna grammar; running the generator over it produces src/grammar/generated.rs, a pull-parser over the grammar DSL. parser.rs then consumes that pull parser’s events and builds the Grammar IR. Parsuna bootstraps itself.

The implication is that the first pass is a worked example of consuming an event stream. parser.rs reads Enter/Exit pairs to recognize rule-shaped blocks, pulls tokens out of them with a small Reader abstraction, and accumulates errors as it goes.

The Grammar IR

The parse produces a flat Grammar:

  • name: String — a label used by later phases for file and package naming. The parser leaves this empty; the CLI fills it from the input file’s stem (or from --name when given).

  • tokens: IndexMap<String, TokenDef> — every token declaration, keyed by name. IndexMap preserves insertion (= source) order on iteration, which the lexer DFA relies on for tie-breaking and the lowering pass relies on for RuleKind id assignment. Each TokenDef records the name, the body (TokenPattern), the skip flag (from a -> skip action), the is_fragment flag (_-prefix), the optional lexer mode (from an @mode(name) pre-annotation), the resolved mode_actions list (from -> push(mode) / -> pop actions), and a source span.

  • rules: IndexMap<String, RuleDef> — every rule declaration. Each RuleDef records the name, the body (Expr), the is_fragment flag, and a source span.

TokenPattern is a regular expression tree over characters: Empty, Literal(String), Class(CharClass), Ref(String), Seq, Alt, Opt, Star, Plus. Expr is the corresponding LL expression tree over tokens and rules: Empty, Token(name), Rule(name), and the same combinators. Two trees, one shape.

Where the distinction matters

The parser decides whether a body is a token pattern or a rule expression from the case of the first letter of the declaration’s name (see The grammar language). It uses different descent functions (read_pattern_* vs. read_*) so that:

  • Character atoms ('a', .., ., !) and string literals are accepted only on the token side. Using one inside a rule body produces a pointed error like “string literal atoms are only valid inside token declarations”.

  • Identifiers in a rule body with an uppercase initial become Expr::Token(name); with a lowercase initial, Expr::Rule(name). Identifiers in a token body are always TokenPattern::Ref(name) and are resolved later.

Actions and pre-annotations

A declaration may carry a trailing -> action[, action...] block and a leading @kind(arg) pre-annotation. Both are parsed here and lowered into TokenDef fields:

  • -> skip sets skip = true. Rejected on rules and on fragments (_NAME).

  • -> push(mode) / -> pop append a ModeAction::Push / ModeAction::Pop to mode_actions. Mode-stack actions are kept in source order so combinations like -> pop, push(b) (swap top) and -> push(a), push(b) (push two) round-trip cleanly. Combining skip with any mode action is rejected.

  • @mode(name) before a token declaration sets mode = Some(name). Applies to the very next decl only — it is a per-token attribute, not a scope, so @mode(tag) repeats once per token bound to that mode. Rejected on rules.

Unknown action names (anything other than skip / push / pop) and unknown pre-annotation kinds (anything other than mode) are recorded as errors and parsing continues.

Error collection

parse_grammar returns Result<Grammar, Vec<Error>>. It does not stop at the first problem; instead it accumulates diagnostics and keeps parsing. This produces the characteristic “ten errors at once” experience: a malformed grammar still parses to a mostly-shaped IR, and the user sees every syntactic issue in a single run.

The Reader abstraction handles the book-keeping. It holds the current lookahead event, exposes peek/advance/expect_* helpers, and transparently drops WS and COMMENT tokens so callers never have to think about trivia. Recovery Garbage events from the bootstrap parser (the underlying runtime emits these when it skips past unexpected input — see The event model) are silently swallowed by the Reader: the per-token errors that accompany them already say what went wrong, so re-reporting them would just be noise.

Post-conditions

After parsing, the Grammar is syntactically well-formed but semantically unchecked. It may still contain:

  • References to undefined tokens or rules.

  • Left-recursive rules.

  • Token reference cycles.

  • Names that collide with reserved runtime identifiers (EOF).

  • Mode actions that refer to mode names no token actually declares.

Duplicate declarations are caught here, at parse time, because the IR’s name-keyed IndexMap would silently dedupe them otherwise. Everything else listed above is left for the next pass, Pass 2 — Analysis, to catch.