Pass 2 — Analysis ================= Entry point: ``analysis::analyze`` (``src/analysis/mod.rs``). Output: ``AnalysisOutcome``, containing an optional ``AnalyzedGrammar`` and a flat bag of diagnostics produced along the way. Analysis runs in three sub-passes: structural validation, post-validation lints, and iterative FIRST/FOLLOW computation. The first two only emit diagnostics; the third additionally produces the FIRST/FOLLOW tables that later passes consume. Diagnostics ----------- Every check pushes into a single ``Vec``. A ``Diagnostic`` carries a severity (``Error`` or ``Warning``), a message, and a source span. Errors stop the build; warnings are surfaced but the build still succeeds unless the caller passes ``--warnings=errors``, which promotes every warning at print time. Parser-level errors (``parsuna_rt::Error`` values produced by the generated grammar parser) are converted into error-severity diagnostics at the boundary; analysis itself only ever pushes ``Diagnostic`` values directly. Sub-pass 2a — Structural validation ----------------------------------- File: ``src/analysis/validate.rs``. Before any set computation, the grammar is checked for structural problems. These checks do not need FIRST/FOLLOW; catching them first means the later, more expensive fixed-point computations never run on a malformed input. Checks, in order: * **Reserved token names.** ``EOF`` and ``ERROR`` (case-insensitive) are rejected — they collide with runtime sentinels. * **Duplicates.** Token names and rule names are checked for duplicates independently. * **Undefined references.** Every ``Expr::Token(n)`` must name a declared token; every ``Expr::Rule(n)`` must name a declared rule; every ``TokenPattern::Ref(n)`` must name a declared token. Rules that reference fragment tokens by name are rejected — fragments are only usable inside other token patterns. * **Token reference cycles.** ``A`` referencing ``B`` referencing ``A`` would loop forever at lowering time. A DFS over the token-ref graph reports the cycle path and aborts. * **Left recursion.** For each rule, compute the set of rules that can appear at its first-position (skipping rules whose body is guaranteed nullable — ``Empty``, ``Opt``, ``Star``). If the rule can reach itself, it is left-recursive. The check is purely structural; an accurate FIRST-based check comes essentially for free in sub-pass 2c but reporting here gives a better error. * **Non-empty public surface.** A grammar with no non-fragment rules produces an empty public API; it is rejected. If any error-severity diagnostics land in the bag here, analysis returns immediately without proceeding to lints or FIRST/FOLLOW — the later passes assume references resolve and there are no cycles, so running them on a structurally broken grammar would be unsafe. Sub-pass 2b — Post-validation lints ----------------------------------- Files: ``src/analysis/lints.rs`` and ``src/analysis/shadow.rs``. Once validation has cleared, four additional checks look for grammars that *would* compile but shouldn't. Each picks its own severity. * **Empty-match tokens.** A token whose pattern can match the empty string (``T = 'a'?;``, ``T = 'x'*;``, ``T = "";``) would let the lexer accept at length zero — either an infinite loop or a stream of zero-length tokens. Detected by a recursive nullability walk over the resolved pattern; reported as **error**. * **Unused fragments.** Fragment tokens (``_HEX``) and fragment rules (``_postfix``) that no other declaration references are dead code. Reachability is computed by BFS from the live seeds — non-fragment tokens, non-fragment rules, and skip tokens — over the reference graph. Reported as **warning**, since the grammar still produces a working parser. * **Non-productive rules.** A rule with no derivation that terminates in tokens (``r = R r+;``, mutually-recursive cycles with no terminal escape) would never let the parser accept any input. Detected by a least-fixed-point over rules that reduce to terminals or to other productive rules; reported as **error**. * **Literal-token shadowing.** A ``Literal``-pattern token (e.g. ``IF = "if";``) declared *after* a more general token whose pattern also accepts that literal (e.g. ``IDENT = ('a'..'z')+;``) is unreachable: the lexer breaks ties by smallest token id (= declaration order), so the earlier token always wins. Detected by running a small NFA-style pattern matcher over each candidate shadower; reported as **error**. The lint pass and the shadow pass each push into the same diagnostic bag. After both have run, if any error-severity diagnostics are present analysis returns; warnings alone do not stop the build. Sub-pass 2c — Iterative FIRST/FOLLOW at the smallest viable k ------------------------------------------------------------- File: ``src/analysis/first_follow.rs``. The entry point used by ``analyze`` is ``solve_lookahead``; FIRST/FOLLOW computation, conflict detection, and the iterative deepening loop all live together in this module. The driver algorithm is **iterative deepening on k**: .. parsed-literal:: k = 0 loop: k += 1 (nullable, first) = compute_first(g, k) follow_k = compute_follow_k(g, first, nullable, k) conflicts = detect_conflicts(g, nullable, first, follow_k, k) if conflicts empty: commit (nullable, first, k); break if conflict count has not decreased for STUCK_LIMIT rounds: give up — grammar is not LL(k) for any finite k ``STUCK_LIMIT`` is a small constant (three at the time of writing) that protects against genuinely ambiguous grammars without giving up too early on ones that slowly converge. ``compute_first`` is a standard least-fixed-point computation lifted to FIRST(k): instead of a set of single tokens, each rule's FIRST is a set of length-≤-k **lookahead sequences**. ``nullable`` is the classic Boolean fixed point ("does the rule derive ε?"). The ``concat_k`` helper, used pervasively, concatenates two FIRST-sets with truncation to length ``k``. ``compute_follow_k`` is the FIRST(k)-style follow: for each rule ``A``, the set of length-≤-k sequences that can legally appear after an occurrence of ``A``. End-of-input is represented by the placeholder token ``$EOF``. Conflict detection ------------------ For every ``Alt`` node, ``detect_conflicts`` computes each arm's **prediction set** — its FIRST extended by the tail that follows the alternation — and then checks pairs of arms for prefix overlap. If arm *i* predicts ``[a b]`` and arm *j* predicts ``[a]``, an input starting with ``a c`` is fine for *j* but ambiguous for any ``a b …``, so the pair is flagged. The walker is structural: for a ``Seq(xs)`` node it folds the tail right-to-left so each child is checked with the correct forward- looking context; for ``Star(x)`` / ``Plus(x)`` the tail concatenates ``FIRST(x)*`` so an iteration's lookahead includes the possibility of another iteration; for ``Opt(x)`` the tail flows through unchanged. When conflicts remain at the final ``k`` and the number of conflict sequences has not dropped for several rounds, the grammar is reported as "not LL(k) for any finite k", each remaining conflict listing the rule, the arm indices, and up to three sample ambiguous prefixes. The AnalyzedGrammar ------------------- When analysis succeeds it produces: * ``grammar`` — the original ``Grammar``, verbatim. Downstream passes reach through it for the raw token and rule definitions. * ``first: BTreeMap`` — FIRST(k) per rule, deduplicated. Sequences are stored as ``Vec`` of token names (not ids — those come later). * ``follow: BTreeMap`` — classic single-token FOLLOW per rule. Used later as a recovery synchronization base. * ``follow_k: BTreeMap`` — FIRST(k)-style follow. Used to compute accurate prediction sets for nullable arms at lowering time. * ``nullable: BTreeMap`` — per-rule nullability. * ``k: usize`` — the smallest ``k`` for which no alternative is ambiguous. This is the LL(k) of the whole grammar; the value flows through to the runtime as a compile-time constant that sizes the parser's lookahead ring. Why both ``follow`` and ``follow_k`` ------------------------------------ They serve different purposes: * ``follow`` is used by lowering to compute **SYNC sets**, the single-token sets the parser recovers to on an ``Expect`` mismatch. Recovery needs to stop at the earliest familiar token, which means a one-token horizon. * ``follow_k`` is used to extend per-arm prediction sets at dispatch sites, so nullable arms can still be chosen when the next ``k`` tokens match the surrounding context. Output ------ ``AnalysisOutcome { grammar: Option, diagnostics: Vec }``. The grammar is ``Some`` iff no diagnostic in ``diagnostics`` has ``Severity::Error``; warnings can accompany a successful outcome. ``AnalysisOutcome::has_errors`` is the standard discriminator. On success the grammar is handed to :doc:`lower`. Either way, the caller iterates ``diagnostics`` for printing — under ``--warnings=errors`` warnings are re-rendered with the ``error`` label and the build fails even if ``grammar`` was produced.