Lone Star Ruby Conf 2008: Grammar - A BNF-like Ruby DSL for Parsing

Posted in Conferences, Development on September 17, 2008


Lone Star Ruby Conf 2008: Grammar - A BNF-like Ruby DSL for Parsing

A Grammar is an object similar to a Regexp. It specifies an input pattern to match and it integrates well surrounding Ruby code because it is simply an object. Instead of specifying the “pattern” using operators in a string (or file like most other parsing packages), operators and other methods to create build larger Grammar objects from smaller ones. This results in a Ruby DSL that resembles BNF. With a Ruby DSL, there is quite a bit more flexibility, better integration, and seamless actions (using blocks).

To give an idea how a complex Grammar is formed, here are some of the Grammar molecules and combinators/operators/methods along with the equivalent Regexp:

E(?a)        : a
E(?a..?z) : [a..z]
ANY : .
NULL : (?:)
EOF : \Z
Recurse { |f| ... f ... } : (?<f> ... \g<f> ... )
x|y : x|y
x+y : xy
+A : (?=A)
-A : (?!A)
x.optional : x? : x|NULL
x*n : x{n}
x*(n..m) : x{n,m}
x.repeat0 : x* : Recurse { |g| g+x | NULL }
x.repeat1 : x+ : Recurse { |g| (g|NULL) + x }
x.repeat0(y) : x*?y : Recurse { |g| y | x+g }
x.repeat1(y) : x+?y : Recurse { |g| x + (y|g) }

One aspect of the grammar package that has been used extensively to increase flexibility is duck-typing. Here are the major “duck types” along with some uses:

  • atomic pattern:
    • needs #=== to match an element in the input
    • typically exact or a range
  • input:
    • input sequence to parsing
    • like the IO#getc method
    • Backtracking wraps a buffer around the input.
    • Parser input may come from lexer output.
    • No need to read entire file into memory.
    • output:
  • output of parsing
    • needs #<< and #concat methods like Array/String
    • will simply mirror the input without intervention
    • A producer/consumer queue is used for a multi-threaded lexer.
    • An “abstract syntax forest” is formed with a hieararchy of outputs.
  • engine:
    • Most methods take one or more grammars.
    • The reference engine directly parses the input.
    • Efficient engines generate optimized parsers in different languages.
    • For now only Ruby parser is generated and a C parser is generated using Ruby2CExtension.
  • Generated parsers typically rival or beat the performance of
    • hand-crafted parsers in the same target language (even with the help of Regexp).
  • parser:
    • concrete implementation of a Grammar
    • API is dependent on the engine.

Like Regexp, Grammar handles ambiguous alternatives by simply letting the first match win. Unlike Regexp, Grammar does not backtrack by default when an alternative fails after it started to match. This is because Grammar is mostly an efficient LL(1) parser and this helps detect a mismatch right where it occurs. Fixed length or unlimited backtracking is still possible by using the #backtrack method on a Grammar. With Grammar, context may also be used to resolve (ambiguous) alternatives, whereas almost ever other parser generator requires context-free grammars.

The way recursion is handled and optimized is another interesting aspect of Grammar. First, a tail-call optimization is made so that right recursive Grammars will imply a loop. Secondly, and most important is that even left recursion is handled. The only other known techinque for handling left recursion in an LL parser is [Frost2008], which incurs a performance penalty. The novel techique used by Grammar results in left recursive grammars being even more efficient than right recursive ones (after tail-call optimization), similar to LR parsers.

Eric Mahurin
Eric Mahurin has 15 years of experience in IC hardware and software design with primary areas of expertise being computer arithmetic (i.e. adders, multipliers, shifters), cell design/analysis (timing, power, reliability, verification), and layout generation/analysis (a lot of computational geometry along with graph theory). He has worked at AMD (Athlon and Opteron), EDA companies, and currently works at Intel on a SIMD arithmetic unit. Eric developed an interest in parser generators after writing many ad-hoc parsers for various formats used in IC design. He stumbled upon Ruby in 2005 in a quest to write a parser/lexer generator where the grammar could be specified in the target language. Eric and his wife live in Austin.

Watch Video Watch Video

Tags: Conferences, Ruby, DSL, Parser, Confreaks, Lone Star Ruby Conf 2008, Grammar, BNF