Pass 1 — Grammar parsing ======================== Entry point: ``grammar::parse_grammar`` (``src/grammar/parser.rs``). Output: a ``Grammar`` value (``src/grammar/ir.rs``). Input ----- The source text of a ``.parsuna`` file, as UTF-8 bytes. Nothing else is consulted — there are no includes, no external definitions, no search path. Bootstrap --------- The grammar-file parser is itself generated by parsuna. The file ``src/grammar/parsuna.parsuna`` describes the syntax of a parsuna grammar; running the generator over it produces ``src/grammar/generated.rs``, a pull-parser over the grammar DSL. ``parser.rs`` then consumes that pull parser's events and builds the ``Grammar`` IR. Parsuna bootstraps itself. The implication is that the first pass is a worked example of consuming an event stream. ``parser.rs`` reads ``Enter``/``Exit`` pairs to recognise rule-shaped blocks, pulls tokens out of them with a small ``Reader`` abstraction, and accumulates errors as it goes. The Grammar IR -------------- The parse produces a flat ``Grammar``: * ``name: String`` — a label used by later phases for file and package naming. The parser leaves this empty; the CLI fills it from the input file's stem (or from ``--name`` when given). * ``tokens: Vec`` — every token declaration in source order. Each ``TokenDef`` records the name, the body (``TokenPattern``), the ``skip`` flag (``?`` prefix), the ``is_fragment`` flag (``_`` prefix), and a source span. * ``rules: Vec`` — every rule declaration in source order. Each ``RuleDef`` records the name, the body (``Expr``), the ``is_fragment`` flag, and a source span. ``TokenPattern`` is a regular expression tree over characters: ``Empty``, ``Literal(String)``, ``Class(CharClass)``, ``Ref(String)``, ``Seq``, ``Alt``, ``Opt``, ``Star``, ``Plus``. ``Expr`` is the corresponding LL expression tree over tokens and rules: ``Empty``, ``Token(name)``, ``Rule(name)``, and the same combinators. Two trees, one shape. Where the distinction matters ----------------------------- The parser decides whether a body is a token pattern or a rule expression from the case of the first letter of the declaration's name (see :doc:`../grammar_language`). It uses different descent functions (``read_pattern_*`` vs. ``read_*``) so that: * Character atoms (``'a'``, ``..``, ``.``, ``!``) and string literals are accepted only on the token side. Using one inside a rule body produces a pointed error like "string literal atoms are only valid inside token declarations". * Identifiers in a rule body with an uppercase initial become ``Expr::Token(name)``; with a lowercase initial, ``Expr::Rule(name)``. Identifiers in a token body are always ``TokenPattern::Ref(name)`` and are resolved later. Error collection ---------------- ``parse_grammar`` returns ``Result>``. It does not stop at the first problem; instead it accumulates diagnostics and keeps parsing. This produces the characteristic "ten errors at once" experience: a malformed grammar still parses to a mostly-shaped IR, and the user sees every syntactic issue in a single run. The ``Reader`` abstraction handles the book-keeping. It holds the current lookahead event, exposes ``peek``/``advance``/``expect_*`` helpers, and transparently drops ``WS`` and ``COMMENT`` tokens so callers never have to think about trivia. Post-conditions --------------- After parsing, the ``Grammar`` is **syntactically** well-formed but **semantically** unchecked. It may still contain: * References to undefined tokens or rules. * Left-recursive rules. * Token reference cycles. * Duplicate declarations. * Names that collide with runtime sentinels (``EOF``, ``ERROR``). The next pass, :doc:`analyze`, is what catches those.