Pass 2 — Analysis¶
Entry point: analysis::analyze (src/analysis/mod.rs).
Output: AnalysisOutcome, containing an optional AnalyzedGrammar
and a flat bag of diagnostics produced along the way.
Analysis runs in three sub-passes: structural validation, post-validation lints, and iterative FIRST/FOLLOW computation. The first two only emit diagnostics; the third additionally produces the FIRST/FOLLOW tables that later passes consume.
Diagnostics¶
Every check pushes into a single Vec<Diagnostic>. A
Diagnostic carries a severity (Error or Warning), a
message, and a source span. Errors stop the build; warnings are
surfaced but the build still succeeds unless the caller passes
--warnings=errors, which promotes every warning at print time.
Parser-level errors (parsuna_rt::Error values produced by the
generated grammar parser) are converted into error-severity diagnostics
at the boundary; analysis itself only ever pushes Diagnostic values
directly.
Sub-pass 2a — Structural validation¶
File: src/analysis/validate.rs.
Before any set computation, the grammar is checked for structural problems. These checks do not need FIRST/FOLLOW; catching them first means the later, more expensive fixed-point computations never run on a malformed input.
Checks, in order:
Reserved token names.
EOFandERROR(case-insensitive) are rejected — they collide with runtime sentinels.Duplicates. Token names and rule names are checked for duplicates independently.
Undefined references. Every
Expr::Token(n)must name a declared token; everyExpr::Rule(n)must name a declared rule; everyTokenPattern::Ref(n)must name a declared token. Rules that reference fragment tokens by name are rejected — fragments are only usable inside other token patterns.Token reference cycles.
AreferencingBreferencingAwould loop forever at lowering time. A DFS over the token-ref graph reports the cycle path and aborts.Left recursion. For each rule, compute the set of rules that can appear at its first-position (skipping rules whose body is guaranteed nullable —
Empty,Opt,Star). If the rule can reach itself, it is left-recursive. The check is purely structural; an accurate FIRST-based check comes essentially for free in sub-pass 2c but reporting here gives a better error.Non-empty public surface. A grammar with no non-fragment rules produces an empty public API; it is rejected.
If any error-severity diagnostics land in the bag here, analysis returns immediately without proceeding to lints or FIRST/FOLLOW — the later passes assume references resolve and there are no cycles, so running them on a structurally broken grammar would be unsafe.
Sub-pass 2b — Post-validation lints¶
Files: src/analysis/lints.rs and src/analysis/shadow.rs.
Once validation has cleared, four additional checks look for grammars that would compile but shouldn’t. Each picks its own severity.
Empty-match tokens. A token whose pattern can match the empty string (
T = 'a'?;,T = 'x'*;,T = "";) would let the lexer accept at length zero — either an infinite loop or a stream of zero-length tokens. Detected by a recursive nullability walk over the resolved pattern; reported as error.Unused fragments. Fragment tokens (
_HEX) and fragment rules (_postfix) that no other declaration references are dead code. Reachability is computed by BFS from the live seeds — non-fragment tokens, non-fragment rules, and skip tokens — over the reference graph. Reported as warning, since the grammar still produces a working parser.Non-productive rules. A rule with no derivation that terminates in tokens (
r = R r+;, mutually-recursive cycles with no terminal escape) would never let the parser accept any input. Detected by a least-fixed-point over rules that reduce to terminals or to other productive rules; reported as error.Literal-token shadowing. A
Literal-pattern token (e.g.IF = "if";) declared after a more general token whose pattern also accepts that literal (e.g.IDENT = ('a'..'z')+;) is unreachable: the lexer breaks ties by smallest token id (= declaration order), so the earlier token always wins. Detected by running a small NFA-style pattern matcher over each candidate shadower; reported as error.
The lint pass and the shadow pass each push into the same diagnostic bag. After both have run, if any error-severity diagnostics are present analysis returns; warnings alone do not stop the build.
Sub-pass 2c — Iterative FIRST/FOLLOW at the smallest viable k¶
File: src/analysis/first_follow.rs. The entry point used by
analyze is solve_lookahead; FIRST/FOLLOW computation,
conflict detection, and the iterative deepening loop all live
together in this module.
The driver algorithm is iterative deepening on k:
k = 0
loop:
k += 1
(nullable, first) = compute_first(g, k)
follow_k = compute_follow_k(g, first, nullable, k)
conflicts = detect_conflicts(g, nullable, first, follow_k, k)
if conflicts empty:
commit (nullable, first, k); break
if conflict count has not decreased for STUCK_LIMIT rounds:
give up — grammar is not LL(k) for any finite k
STUCK_LIMIT is a small constant (three at the time of writing)
that protects against genuinely ambiguous grammars without giving up
too early on ones that slowly converge.
compute_first is a standard least-fixed-point computation lifted
to FIRST(k): instead of a set of single tokens, each rule’s FIRST is
a set of length-≤-k lookahead sequences. nullable is the
classic Boolean fixed point (“does the rule derive ε?”). The
concat_k helper, used pervasively, concatenates two FIRST-sets
with truncation to length k.
compute_follow_k is the FIRST(k)-style follow: for each rule
A, the set of length-≤-k sequences that can legally appear after
an occurrence of A. End-of-input is represented by the placeholder
token $EOF.
Conflict detection¶
For every Alt node, detect_conflicts computes each arm’s
prediction set — its FIRST extended by the tail that follows the
alternation — and then checks pairs of arms for prefix overlap. If
arm i predicts [a b] and arm j predicts [a], an input
starting with a c is fine for j but ambiguous for any
a b …, so the pair is flagged.
The walker is structural: for a Seq(xs) node it folds the tail
right-to-left so each child is checked with the correct forward-
looking context; for Star(x) / Plus(x) the tail concatenates
FIRST(x)* so an iteration’s lookahead includes the possibility
of another iteration; for Opt(x) the tail flows through
unchanged.
When conflicts remain at the final k and the number of conflict
sequences has not dropped for several rounds, the grammar is
reported as “not LL(k) for any finite k”, each remaining conflict
listing the rule, the arm indices, and up to three sample ambiguous
prefixes.
The AnalyzedGrammar¶
When analysis succeeds it produces:
grammar— the originalGrammar, verbatim. Downstream passes reach through it for the raw token and rule definitions.first: BTreeMap<String, FirstSet>— FIRST(k) per rule, deduplicated. Sequences are stored asVec<String>of token names (not ids — those come later).follow: BTreeMap<String, FollowSet>— classic single-token FOLLOW per rule. Used later as a recovery synchronization base.follow_k: BTreeMap<String, FirstSet>— FIRST(k)-style follow. Used to compute accurate prediction sets for nullable arms at lowering time.nullable: BTreeMap<String, bool>— per-rule nullability.k: usize— the smallestkfor which no alternative is ambiguous. This is the LL(k) of the whole grammar; the value flows through to the runtime as a compile-time constant that sizes the parser’s lookahead ring.
Why both follow and follow_k¶
They serve different purposes:
followis used by lowering to compute SYNC sets, the single-token sets the parser recovers to on anExpectmismatch. Recovery needs to stop at the earliest familiar token, which means a one-token horizon.follow_kis used to extend per-arm prediction sets at dispatch sites, so nullable arms can still be chosen when the nextktokens match the surrounding context.
Output¶
AnalysisOutcome { grammar: Option<AnalyzedGrammar>, diagnostics: Vec<Diagnostic> }.
The grammar is Some iff no diagnostic in diagnostics has
Severity::Error; warnings can accompany a successful outcome.
AnalysisOutcome::has_errors is the standard discriminator.
On success the grammar is handed to Pass 3 — Lowering. Either way, the
caller iterates diagnostics for printing — under
--warnings=errors warnings are re-rendered with the error
label and the build fails even if grammar was produced.