The grammar language¶
A parsuna grammar is a sequence of declarations in a .parsuna
file. Each declaration defines either a token (matched by a regular
pattern over characters) or a rule (an LL expression over tokens
and other rules). Whitespace and // line comments between
declarations are ignored.
Declarations¶
Every declaration has the form:
[?] name = body ;
Whether the declaration is a token or a rule is determined by the case
of the first letter of name (skipping a leading _):
An uppercase initial makes it a token:
IDENT,STRING,_HEX_DIGIT.A lowercase initial makes it a rule:
expr,statement,_parenthesized.
Two optional prefixes modify the meaning:
?marks a skip token. The lexer still matches it, but the parser drops it from the structural event stream. Skip tokens are still delivered to consumers as events — they just appear outside anyEnter/Exitscope, interleaved with structure in source order. Only tokens can be skip-tokens;?on a rule is an error._marks a fragment. Fragment tokens can be referenced from other token bodies but are not themselves produced at runtime; they are inlined into their callers before the lexer DFA is built. Fragment rules are inlined the same way — a fragment rule emits noEnter/Exitevent and is not part of the public parser API.
The two markers can be combined only meaningfully on tokens — ?_ or
_? are rejected.
Token patterns¶
A token body is a regular expression over characters. The atoms are:
"abc"— a string literal, matches exactly those bytes.'a'— a character literal, matches one codepoint.'a'..'z'— a character range, matches any codepoint in the inclusive range..— matches any codepoint.!x— the negation of character atomx(or a list of atoms in parentheses separated by|). For example,!('"' | '\n')matches any codepoint that is neither"nor a newline.NAME— reference to another token (usually a fragment).
Operators, from tightest to loosest binding:
x?— zero or onex.x*— zero or morex.x+— one or morex.Juxtaposition
x y— concatenation.x | y— alternation.
Parentheses group.
Escapes inside string and char literals follow a small fixed set:
\n, \r, \t, \0, \\, \', \", and
\u{HHHH} for arbitrary Unicode codepoints (1–6 hex digits).
Example: declaring identifier, integer, and whitespace tokens:
IDENT = ('A'..'Z' | 'a'..'z' | '_') ('A'..'Z' | 'a'..'z' | '_' | '0'..'9')*;
INT = ('0'..'9')+;
?WS = (' ' | '\t' | '\r' | '\n')+;
The names EOF and ERROR are reserved; the runtime emits them as
sentinels for end-of-input and no-match respectively.
Rule expressions¶
A rule body is an LL expression over tokens and rules. The atoms are:
NAMEwhereNAMEstarts with an uppercase letter — consume one token of that kind.namewherenamestarts with a lowercase letter — recursively parse that rule.
Operators mirror the token language:
x?,x*,x+— repetition.Juxtaposition
x y— concatenation.x | y— alternation; the analyzer picks the arm using up toktokens of lookahead.
Parentheses group.
String literals ("...") and character atoms ('a', ., !)
are not valid inside a rule — rules refer to tokens by name only.
This keeps tokenization decisions in one place so the lexer stays a
single DFA.
Example: a fragment of a JSON grammar, where value is the start
rule and member is factored out:
value = object | array | string | number | bool | null;
object = LBRACE (member (COMMA member)*)? RBRACE;
array = LBRACK (value (COMMA value )*)? RBRACK;
member = key COLON value;
key = STRING;
string = STRING;
number = NUMBER;
bool = TRUE | FALSE;
null = NULL;
What the grammar cannot express¶
Some constructs are deliberately rejected:
Left recursion.
expr = expr PLUS term | termis not a valid parsuna grammar. Rewrite with repetition:expr = term (PLUS term)*;
Left recursion is detected as a structural check and reported before any analysis runs.
Ambiguity. If two alternatives share a common prefix that no finite
kcan distinguish, the grammar is rejected with a conflict report naming the ambiguous prefix. The analyzer tries increasing values ofkuntil either all conflicts vanish or the conflict count stops dropping for several rounds in a row.Context-dependent lexing. Every token matches the same way wherever it appears; there is no way to enable or disable a token based on parser state. Factor your grammar so that a single DFA can disambiguate tokens by their longest match.
Fragment rules and tokens¶
Fragments let you name a sub-pattern without it showing up in the output. They are useful in two ways:
Readability. Break a long token pattern into named pieces:
_DIGIT = '0'..'9'; _FRAC = "." _DIGIT+; _EXP = ('e' | 'E') ('+' | '-')? _DIGIT+; NUMBER = '-'? _DIGIT+ _FRAC? _EXP?;The fragments are inlined into
NUMBERbefore the lexer DFA is built; they are not themselves token kinds at runtime.Structure without noise. Fragment rules factor common rule bodies without adding a nesting level to the event stream — a consumer that walks
Enter/Exitevents sees exactly the hierarchy your non-fragment rules describe.
Naming conventions and reserved names¶
The first non-fragment rule declared in the file is the default start rule. Every non-fragment rule also becomes a public entry point in the generated parser, so the start choice is a default — consumers can begin parsing from any public rule.
A grammar must declare at least one non-fragment rule, otherwise nothing would be emitted.
Token and rule name spaces are separate, but parsuna uses the case of the first letter of the name to decide which one you meant, so you cannot declare a token and a rule with names that differ only in case.