The grammar language ==================== A parsuna grammar is a sequence of **declarations** in a ``.parsuna`` file. Each declaration defines either a **token** (matched by a regular pattern over characters) or a **rule** (an LL expression over tokens and other rules). Whitespace and ``//`` line comments between declarations are ignored. Declarations ------------ Every declaration has the form:: [?] name = body ; Whether the declaration is a token or a rule is determined by the case of the first letter of ``name`` (skipping a leading ``_``): * An **uppercase** initial makes it a token: ``IDENT``, ``STRING``, ``_HEX_DIGIT``. * A **lowercase** initial makes it a rule: ``expr``, ``statement``, ``_parenthesized``. Two optional prefixes modify the meaning: * ``?`` marks a **skip token**. The lexer still matches it, but the parser drops it from the structural event stream. Skip tokens are still delivered to consumers as events — they just appear outside any ``Enter``/``Exit`` scope, interleaved with structure in source order. Only tokens can be skip-tokens; ``?`` on a rule is an error. * ``_`` marks a **fragment**. Fragment tokens can be referenced from other token bodies but are not themselves produced at runtime; they are inlined into their callers before the lexer DFA is built. Fragment rules are inlined the same way — a fragment rule emits no ``Enter``/``Exit`` event and is not part of the public parser API. The two markers can be combined only meaningfully on tokens — ``?_`` or ``_?`` are rejected. Token patterns -------------- A token body is a regular expression over characters. The atoms are: * ``"abc"`` — a string literal, matches exactly those bytes. * ``'a'`` — a character literal, matches one codepoint. * ``'a'..'z'`` — a character range, matches any codepoint in the inclusive range. * ``.`` — matches any codepoint. * ``!x`` — the negation of character atom ``x`` (or a list of atoms in parentheses separated by ``|``). For example, ``!('"' | '\n')`` matches any codepoint that is neither ``"`` nor a newline. * ``NAME`` — reference to another token (usually a fragment). Operators, from tightest to loosest binding: * ``x?`` — zero or one ``x``. * ``x*`` — zero or more ``x``. * ``x+`` — one or more ``x``. * Juxtaposition ``x y`` — concatenation. * ``x | y`` — alternation. Parentheses group. Escapes inside string and char literals follow a small fixed set: ``\n``, ``\r``, ``\t``, ``\0``, ``\\``, ``\'``, ``\"``, and ``\u{HHHH}`` for arbitrary Unicode codepoints (1–6 hex digits). Example: declaring identifier, integer, and whitespace tokens:: IDENT = ('A'..'Z' | 'a'..'z' | '_') ('A'..'Z' | 'a'..'z' | '_' | '0'..'9')*; INT = ('0'..'9')+; ?WS = (' ' | '\t' | '\r' | '\n')+; The names ``EOF`` and ``ERROR`` are reserved; the runtime emits them as sentinels for end-of-input and no-match respectively. Rule expressions ---------------- A rule body is an LL expression over tokens and rules. The atoms are: * ``NAME`` where ``NAME`` starts with an uppercase letter — consume one token of that kind. * ``name`` where ``name`` starts with a lowercase letter — recursively parse that rule. Operators mirror the token language: * ``x?``, ``x*``, ``x+`` — repetition. * Juxtaposition ``x y`` — concatenation. * ``x | y`` — alternation; the analyzer picks the arm using up to ``k`` tokens of lookahead. Parentheses group. String literals (``"..."``) and character atoms (``'a'``, ``.``, ``!``) are **not** valid inside a rule — rules refer to tokens by name only. This keeps tokenization decisions in one place so the lexer stays a single DFA. Example: a fragment of a JSON grammar, where ``value`` is the start rule and ``member`` is factored out:: value = object | array | string | number | bool | null; object = LBRACE (member (COMMA member)*)? RBRACE; array = LBRACK (value (COMMA value )*)? RBRACK; member = key COLON value; key = STRING; string = STRING; number = NUMBER; bool = TRUE | FALSE; null = NULL; What the grammar cannot express ------------------------------- Some constructs are deliberately rejected: * **Left recursion.** ``expr = expr PLUS term | term`` is not a valid parsuna grammar. Rewrite with repetition:: expr = term (PLUS term)*; Left recursion is detected as a structural check and reported before any analysis runs. * **Ambiguity.** If two alternatives share a common prefix that no finite ``k`` can distinguish, the grammar is rejected with a conflict report naming the ambiguous prefix. The analyzer tries increasing values of ``k`` until either all conflicts vanish or the conflict count stops dropping for several rounds in a row. * **Context-dependent lexing.** Every token matches the same way wherever it appears; there is no way to enable or disable a token based on parser state. Factor your grammar so that a single DFA can disambiguate tokens by their longest match. Fragment rules and tokens ------------------------- Fragments let you name a sub-pattern without it showing up in the output. They are useful in two ways: * **Readability.** Break a long token pattern into named pieces:: _DIGIT = '0'..'9'; _FRAC = "." _DIGIT+; _EXP = ('e' | 'E') ('+' | '-')? _DIGIT+; NUMBER = '-'? _DIGIT+ _FRAC? _EXP?; The fragments are inlined into ``NUMBER`` before the lexer DFA is built; they are not themselves token kinds at runtime. * **Structure without noise.** Fragment rules factor common rule bodies without adding a nesting level to the event stream — a consumer that walks ``Enter``/``Exit`` events sees exactly the hierarchy your non-fragment rules describe. Naming conventions and reserved names ------------------------------------- * The first non-fragment rule declared in the file is the **default start rule**. Every non-fragment rule also becomes a public entry point in the generated parser, so the start choice is a default — consumers can begin parsing from any public rule. * A grammar must declare at least one non-fragment rule, otherwise nothing would be emitted. * Token and rule name spaces are separate, but parsuna uses the case of the first letter of the name to decide which one you meant, so you cannot declare a token and a rule with names that differ only in case.