Lone Star Ruby Conf 2008: Grammar - A BNF-like Ruby DSL for Parsing
A Grammar is an object similar to a Regexp. It specifies an input pattern to match and it integrates well surrounding Ruby code because it is simply an object. Instead of specifying the “pattern” using operators in a string (or file like most other parsing packages), operators and other methods to create build larger Grammar objects from smaller ones. This results in a Ruby DSL that resembles BNF. With a Ruby DSL, there is quite a bit more flexibility, better integration, and seamless actions (using blocks).
To give an idea how a complex Grammar is formed, here are some of the Grammar molecules and combinators/operators/methods along with the equivalent Regexp:
E(?a) : a
E(?a..?z) : [a..z]
ANY : .
NULL : (?:)
EOF : \Z
Recurse { |f| ... f ... } : (?<f> ... \g<f> ... )
x|y : x|y
x+y : xy
+A : (?=A)
-A : (?!A)
x.optional : x? : x|NULL
x*n : x{n}
x*(n..m) : x{n,m}
x.repeat0 : x* : Recurse { |g| g+x | NULL }
x.repeat1 : x+ : Recurse { |g| (g|NULL) + x }
x.repeat0(y) : x*?y : Recurse { |g| y | x+g }
x.repeat1(y) : x+?y : Recurse { |g| x + (y|g) }
One aspect of the grammar package that has been used extensively to increase flexibility is duck-typing. Here are the major “duck types” along with some uses:
- atomic pattern:
- needs #=== to match an element in the input
- typically exact or a range
- input:
- input sequence to parsing
- like the IO#getc method
- Backtracking wraps a buffer around the input.
- Parser input may come from lexer output.
- No need to read entire file into memory.
- output:
- output of parsing
- needs #<< and #concat methods like Array/String
- will simply mirror the input without intervention
- A producer/consumer queue is used for a multi-threaded lexer.
- An “abstract syntax forest” is formed with a hieararchy of outputs.
- engine:
- Most methods take one or more grammars.
- The reference engine directly parses the input.
- Efficient engines generate optimized parsers in different languages.
- For now only Ruby parser is generated and a C parser is generated using Ruby2CExtension.
- Generated parsers typically rival or beat the performance of
- hand-crafted parsers in the same target language (even with the help of Regexp).
- parser:
- concrete implementation of a Grammar
- API is dependent on the engine.
Like Regexp, Grammar handles ambiguous alternatives by simply letting the first match win. Unlike Regexp, Grammar does not backtrack by default when an alternative fails after it started to match. This is because Grammar is mostly an efficient LL(1) parser and this helps detect a mismatch right where it occurs. Fixed length or unlimited backtracking is still possible by using the #backtrack method on a Grammar. With Grammar, context may also be used to resolve (ambiguous) alternatives, whereas almost ever other parser generator requires context-free grammars.
The way recursion is handled and optimized is another interesting aspect of Grammar. First, a tail-call optimization is made so that right recursive Grammars will imply a loop. Secondly, and most important is that even left recursion is handled. The only other known techinque for handling left recursion in an LL parser is [Frost2008], which incurs a performance penalty. The novel techique used by Grammar results in left recursive grammars being even more efficient than right recursive ones (after tail-call optimization), similar to LR parsers.