I was asked on a mailing list why I prefer LALR to LL based parser generators. I ended up spending enough time on it to justify posting it here. Summary: "Reduced complexity in the semantic analyser is well worth the 'cost' of thinking bottom-up in the parser."
How come you prefer LALR? What about LR(k)?
I haven't really found the need for the added expressiveness of LR(k); this combined with the dearth of good GLR and LR(k) generators for many languages means that I haven't ended up using one for a project yet. When people write production generators (as opposed to experimental), they seem to be either LL(k) or yacc+.
I'm not surprised at the popularity of LL, as you say, if you're going to write your parser by hand you're going to use top-down, recursive decent - ie LL. Consequently for people coming to the problem of parsing for the first time, they are much easier to visualise, and as you point-out consequently to debug. I recently had to assist a colleague with a couple of sablecc[0] bugs that were a direct result of visualising the parse top-down, while using a bottom-up parser.
However once you've written a handful of parsers, the expressive limitations of LL(k), and the software engineering challenges top-down imposes on semantic analysis quickly teach you to appreciate LALR. Consequently I do not voluntarily use LL parsers in new projects.
I prefer LL(k) as it generates code similar to that you'd write
yourself (top down parser) and that's useful when you are debugging.
LL(k) is quicker :).
You're right, debugging of the parser is simpler with LL(k). That of course is beside the point, as I mentioned in my original email, the parser is the easy stage - the hard part is the semantic analysis, and LL makes this tougher. Simplifying the parser at the expense of the SA is false economy.
The added complexity derives from two places. The first derives from the additional information available with non-terminal productions in bottom-up parsers. Because bottom-up parsers reduce non-terminals after their children are fully parsed. This additional information makes using S or L-attribute grammars feasible for the first stage of SA. When working with LL I have come across three ways of dealing with this:
- Don't use attribute grammars, and use gradually (ie. most of the time, partially) initialised parser-global mutable state.
- Use nested parsers as accumulators.
- Use a eulers walk to convert top-down into bottom-up by attaching the SA to the post-order events.
In my experience the general approach is a combination of 1 and 2 - and they both cause serious structural problems in the resulting code, which all manner of coupling and structural aliasing between the parser and the SA that complicate ongoing maintenance. 3 wastes alot of memory (although admittably that is less of a problem now), but more importantly you have thrown away the very reason you went LL(k) in the first-place - top-down parsing.
The second source of complexity is the consequence of losing access to left-recursive grammars - which has two key impacts.
- Elimination of left-recursion replaces one simple rule with two coupled rules.
- This elimination is achieved by converting left-recursion to right-recursion; which will always[4] convert an S-attribute grammar into an L-attribute grammar; and seriously complicate any L-attribute grammar.
This causes problems analogous to the interaction of non-associative operators with foldr.
Consider the following grammar (where n
is a terminal matching {digit}+
)
E -> E - n | n
converted[1] becomes
E -> nT*
T -> - n
"10 - 5 - 3" is then parsed:
- L-recursive :
(((10) - 5) - 3)
- R-recusive :
(10 (- 5 (- 3)))
Now of course SA can evaluate the R-form correctly, it just takes more work[2]; less intuitive code[3]; greater coupling; greater complexity.
LL parsers can't handle the L-recursive grammar, LR parsers can.
Reduced complexity in the SA is well worth the 'cost' of thinking bottom-up; seriously, it's not that difficult to do.
[1] The kleene star simplifies this a bit, using e for the empty match the classic conversion is
E -> nE'
E' -> - nE' | e
See Aho et al for details.
[2] The L-recursive SA is a trivial S-attribute grammar defined as
E -> E' - n { E.val := E'.val - n.val }
E -> n { E.val := n.val }
[3] In this case the L-recursive form is so trivial that the resulting R-form produces a trivial accumulator based SA. However note that even at this basic level, we have converted a simple declarative form into a recursive form that is fundamentally more complex. In cases where the L-form is naturally recursive, the resulting R-form invariably introduces mutable shared state, with it's attendant complications.
[4] I'm reasonably certain this is the case, I don't however have time to double check this. Presumably someone will correct me if I'm mistaken here.