The grammar language
====================

A parsuna grammar is a sequence of **declarations** in a ``.parsuna``
file. Each declaration defines either a **token** (matched by a regular
pattern over characters) or a **rule** (an LL expression over tokens
and other rules). Whitespace and ``//`` line comments between
declarations are ignored.

Declarations
------------

Every declaration has the form::

    [?] name = body ;

Whether the declaration is a token or a rule is determined by the case
of the first letter of ``name`` (skipping a leading ``_``):

* An **uppercase** initial makes it a token: ``IDENT``, ``STRING``,
  ``_HEX_DIGIT``.
* A **lowercase** initial makes it a rule: ``expr``, ``statement``,
  ``_parenthesized``.

Two optional prefixes modify the meaning:

* ``?`` marks a **skip token**. The lexer still matches it, but the
  parser drops it from the structural event stream. Skip tokens are
  still delivered to consumers as events — they just appear outside
  any ``Enter``/``Exit`` scope, interleaved with structure in source
  order. Only tokens can be skip-tokens; ``?`` on a rule is an error.
* ``_`` marks a **fragment**. Fragment tokens can be referenced from
  other token bodies but are not themselves produced at runtime; they
  are inlined into their callers before the lexer DFA is built.
  Fragment rules are inlined the same way — a fragment rule emits no
  ``Enter``/``Exit`` event and is not part of the public parser API.

The two markers can be combined only meaningfully on tokens — ``?_`` or
``_?`` are rejected.

Token patterns
--------------

A token body is a regular expression over characters. The atoms are:

* ``"abc"`` — a string literal, matches exactly those bytes.
* ``'a'`` — a character literal, matches one codepoint.
* ``'a'..'z'`` — a character range, matches any codepoint in the
  inclusive range.
* ``.`` — matches any codepoint.
* ``!x`` — the negation of character atom ``x`` (or a list of atoms in
  parentheses separated by ``|``). For example, ``!('"' | '\n')``
  matches any codepoint that is neither ``"`` nor a newline.
* ``NAME`` — reference to another token (usually a fragment).

Operators, from tightest to loosest binding:

* ``x?`` — zero or one ``x``.
* ``x*`` — zero or more ``x``.
* ``x+`` — one or more ``x``.
* Juxtaposition ``x y`` — concatenation.
* ``x | y`` — alternation.

Parentheses group.

Escapes inside string and char literals follow a small fixed set:
``\n``, ``\r``, ``\t``, ``\0``, ``\\``, ``\'``, ``\"``, and
``\u{HHHH}`` for arbitrary Unicode codepoints (1–6 hex digits).

Example: declaring identifier, integer, and whitespace tokens::

    IDENT  = ('A'..'Z' | 'a'..'z' | '_') ('A'..'Z' | 'a'..'z' | '_' | '0'..'9')*;
    INT    = ('0'..'9')+;
    ?WS    = (' ' | '\t' | '\r' | '\n')+;

The names ``EOF`` and ``ERROR`` are reserved; the runtime emits them as
sentinels for end-of-input and no-match respectively.

Rule expressions
----------------

A rule body is an LL expression over tokens and rules. The atoms are:

* ``NAME`` where ``NAME`` starts with an uppercase letter —
  consume one token of that kind.
* ``name`` where ``name`` starts with a lowercase letter — recursively
  parse that rule.

Operators mirror the token language:

* ``x?``, ``x*``, ``x+`` — repetition.
* Juxtaposition ``x y`` — concatenation.
* ``x | y`` — alternation; the analyzer picks the arm using up to
  ``k`` tokens of lookahead.

Parentheses group.

String literals (``"..."``) and character atoms (``'a'``, ``.``, ``!``)
are **not** valid inside a rule — rules refer to tokens by name only.
This keeps tokenization decisions in one place so the lexer stays a
single DFA.

Example: a fragment of a JSON grammar, where ``value`` is the start
rule and ``member`` is factored out::

    value  = object | array | string | number | bool | null;

    object = LBRACE (member (COMMA member)*)? RBRACE;
    array  = LBRACK (value  (COMMA value )*)? RBRACK;
    member = key COLON value;
    key    = STRING;

    string = STRING;
    number = NUMBER;
    bool   = TRUE | FALSE;
    null   = NULL;

What the grammar cannot express
-------------------------------

Some constructs are deliberately rejected:

* **Left recursion.** ``expr = expr PLUS term | term`` is not a valid
  parsuna grammar. Rewrite with repetition::

      expr = term (PLUS term)*;

  Left recursion is detected as a structural check and reported before
  any analysis runs.
* **Ambiguity.** If two alternatives share a common prefix that no
  finite ``k`` can distinguish, the grammar is rejected with a
  conflict report naming the ambiguous prefix. The analyzer tries
  increasing values of ``k`` until either all conflicts vanish or the
  conflict count stops dropping for several rounds in a row.
* **Context-dependent lexing.** Every token matches the same way
  wherever it appears; there is no way to enable or disable a token
  based on parser state. Factor your grammar so that a single DFA can
  disambiguate tokens by their longest match.

Fragment rules and tokens
-------------------------

Fragments let you name a sub-pattern without it showing up in the
output. They are useful in two ways:

* **Readability.** Break a long token pattern into named pieces::

      _DIGIT  = '0'..'9';
      _FRAC   = "." _DIGIT+;
      _EXP    = ('e' | 'E') ('+' | '-')? _DIGIT+;
      NUMBER  = '-'? _DIGIT+ _FRAC? _EXP?;

  The fragments are inlined into ``NUMBER`` before the lexer DFA is
  built; they are not themselves token kinds at runtime.

* **Structure without noise.** Fragment rules factor common rule
  bodies without adding a nesting level to the event stream — a
  consumer that walks ``Enter``/``Exit`` events sees exactly the
  hierarchy your non-fragment rules describe.

Naming conventions and reserved names
-------------------------------------

* The first non-fragment rule declared in the file is the **default
  start rule**. Every non-fragment rule also becomes a public entry
  point in the generated parser, so the start choice is a default —
  consumers can begin parsing from any public rule.
* A grammar must declare at least one non-fragment rule, otherwise
  nothing would be emitted.
* Token and rule name spaces are separate, but parsuna uses the case
  of the first letter of the name to decide which one you meant, so
  you cannot declare a token and a rule with names that differ only
  in case.