Friday, September 01, 2006

Why I prefer LALR parsers

I was asked on a mailing list why I prefer LALR to LL based parser generators. I ended up spending enough time on it to justify posting it here. Summary: "Reduced complexity in the semantic analyser is well worth the 'cost' of thinking bottom-up in the parser."

How come you prefer LALR? What about LR(k)?

I haven't really found the need for the added expressiveness of LR(k); this combined with the dearth of good GLR and LR(k) generators for many languages means that I haven't ended up using one for a project yet. When people write production generators (as opposed to experimental), they seem to be either LL(k) or yacc+.

I'm not surprised at the popularity of LL, as you say, if you're going to write your parser by hand you're going to use top-down, recursive decent - ie LL. Consequently for people coming to the problem of parsing for the first time, they are much easier to visualise, and as you point-out consequently to debug. I recently had to assist a colleague with a couple of sablecc[0] bugs that were a direct result of visualising the parse top-down, while using a bottom-up parser.

However once you've written a handful of parsers, the expressive limitations of LL(k), and the software engineering challenges top-down imposes on semantic analysis quickly teach you to appreciate LALR. Consequently I do not voluntarily use LL parsers in new projects.

I prefer LL(k) as it generates code similar to that you'd write yourself (top down parser) and that's useful when you are debugging. LL(k) is quicker :).

You're right, debugging of the parser is simpler with LL(k). That of course is beside the point, as I mentioned in my original email, the parser is the easy stage - the hard part is the semantic analysis, and LL makes this tougher. Simplifying the parser at the expense of the SA is false economy.

The added complexity derives from two places. The first derives from the additional information available with non-terminal productions in bottom-up parsers. Because bottom-up parsers reduce non-terminals after their children are fully parsed. This additional information makes using S or L-attribute grammars feasible for the first stage of SA. When working with LL I have come across three ways of dealing with this:

  1. Don't use attribute grammars, and use gradually (ie. most of the time, partially) initialised parser-global mutable state.
  2. Use nested parsers as accumulators.
  3. Use a eulers walk to convert top-down into bottom-up by attaching the SA to the post-order events.
In my experience the general approach is a combination of 1 and 2 - and they both cause serious structural problems in the resulting code, which all manner of coupling and structural aliasing between the parser and the SA that complicate ongoing maintenance. 3 wastes alot of memory (although admittably that is less of a problem now), but more importantly you have thrown away the very reason you went LL(k) in the first-place - top-down parsing.

The second source of complexity is the consequence of losing access to left-recursive grammars - which has two key impacts.

  1. Elimination of left-recursion replaces one simple rule with two coupled rules.
  2. This elimination is achieved by converting left-recursion to right-recursion; which will always[4] convert an S-attribute grammar into an L-attribute grammar; and seriously complicate any L-attribute grammar.

This causes problems analogous to the interaction of non-associative operators with foldr.
Consider the following grammar (where n is a terminal matching {digit}+ )

E -> E - n | n
converted[1] becomes
E -> nT*
T -> - n
"10 - 5 - 3" is then parsed:
  • L-recursive : (((10) - 5) - 3)
  • R-recusive : (10 (- 5 (- 3)))
Now of course SA can evaluate the R-form correctly, it just takes more work[2]; less intuitive code[3]; greater coupling; greater complexity.

LL parsers can't handle the L-recursive grammar, LR parsers can.

Reduced complexity in the SA is well worth the 'cost' of thinking bottom-up; seriously, it's not that difficult to do.

[1] The kleene star simplifies this a bit, using e for the empty match the classic conversion is

    E -> nE'
    E' -> - nE' | e
See Aho et al for details.

[2] The L-recursive SA is a trivial S-attribute grammar defined as

E -> E' - n { E.val := E'.val - n.val }
E -> n      { E.val := n.val }

[3] In this case the L-recursive form is so trivial that the resulting R-form produces a trivial accumulator based SA. However note that even at this basic level, we have converted a simple declarative form into a recursive form that is fundamentally more complex. In cases where the L-form is naturally recursive, the resulting R-form invariably introduces mutable shared state, with it's attendant complications.

[4] I'm reasonably certain this is the case, I don't however have time to double check this. Presumably someone will correct me if I'm mistaken here.

7 comments:

Katherine said...

Hi,

Just to add more of a grey area to your comparisons, I thought you might enjoy taking a look at the SID parser-generator.

It reads LRish grammars, and transforms them to LL(1) (as well as applying a few optimisations).

It also provides attributes in the grammar. These work especially nicely from an implementation perspective as the actions (which are language-specific) are expressed separately from the grammar itself.

Here's the users' guide. I hope you find it interesting!

Bill Clodius said...

When I first came across this blog I was puzzled. Your comments indicate that it should be difficult to perform semantic analysis using the parse trees generated using LL parser generators. But I have not read any other similar comments in over a decade of casual interest in compiler techniques. Instead the most typical comment has been that it is very easy to perform semantic analysis using such parsers. In particular it is almost too easy, as such parsers are commonly used to implement context dependent grammars.

The confusion arose because of the difference between formal and common usage. Your analysis is strictly correct for the formal definition of recursive descent parsers. However what are commonly meant by such parsers do not meet that formal definition. Instead such parsers supplement recursive descent with operator precedence tables. The selective use of such tablea allows them to easily generate parse trees with the proper structure, and yet remain relatively legible. In the LL(k) parser generators, these precedence techniques are invoked using extensions to the BNF.

Evan Driscoll said...

One more statement in support of LL parsers:

The whole left/right recursion thing is a little bit of a red herring I think. It's true that a LL parser cannot handle left recursion, and on its face this should make it harder to create left-associative operators.

However, look at a couple production LL parser generators, like Antlr and JavaCC. Both of those provide EBNF-like notation, including * and +.

So one way to make a rule for a sequence of operatons is 'term = atom (PLUS atom)*;'. The parser will read the first atom, and then decide whether or not it will read a (PLUS atom) pair. If it does, it'll again decide if it'll read another (PLUS atom) pair. Etc.

So what you can do is accumulate your AST or whatever: reading the first atom creates the root of the tree, and each time through the (PLUS atom) loop, the new atom's vaule is accumulated on.

What do you get? Exactly the tree that you would have with a left-recursive grammar.

Ron Burk said...

left/right recursion is not quite a red herring. The ANTLR example is still chewing up the stack of right recursion (though it's always been true of any LL(1) parser generator that you can get productions to fire in either order you prefer -- the EBNF is just syntactic sugar that adds no power or extra behaviors).

Stack is plenty cheap you say? Actually, most Windows apps only get 1MB of stack. Nobody would ever have an expression that uses 1MB of stack you say? Actually, it's not at all uncommon in a language like C to do things like generate an array initialization that contains (say) a large Windows bitmap, or a large constant floating-point matrix. If you've defined initializer lists with right recursion instead of left, then you just waste (possibly enormous amounts of) stack for no good reason. Or you die with an out of stack error, again for no good reason.

It's not all that hard to tweak the LL(1) algorithm so that it does handle (simple, direct) left recursion. You just allow the user's grammar to contain left recursion and take two simple steps internally. a) internally store it as a right-recursive rule via the standard transformation and b) notice that the result is tail recursion that can be optimized out, producing 100% left recursive behavior. YMMV, some parser stack twiddling required, etc. But there's no serious technical hurdle to making an LL(1) parser generator that handles direct left recursion as well as any bottom-up parser, IMHO.

LL(1-ish) Tinkering-in-progress

David Turner said...

One advantage of LL grammars that you didn't mention is in interactive parsing: after parsing a partial sentence you have the stack of productions that are expected to be unwound, so you can usually give helpful (i.e. contextualized) advice to the user about what comes next. LR grammars of course can tell you what's in the follow set, but it's usually difficult to establish what grammatical construct they'll eventually form.

Evan Driscoll said...

Huh, I have stuff I could say to Ron Burk's response to me. Even though it's a couple years stale, might as well.

Ron states that EBNF adds no extra power. This statement comes from the (very well-founded and correct) observation that it is possible to reduce EBNF grammars to standard CFGs (without the "E"). (Actually describing this transformation formally is a homework problem I've given in a compilers class on a couple occasions.)

However, from a more practical standpoint, it is sometimes wrong. In particular, it is wrong with ANTLR. ANTLR generates top-down, LL(*) parsers from EBNF grammars. However, it does not do the above transformation to turn EBNF into a normal CFG. Instead, a star operator in a rule turns into a loop in the code. Not recursion -- a loop. (E.g. the grammar rule <expr> -> <term> (PLUS <term>)* turns into code like expr() { term(); while(...) { PLUS(); term(); } }. The ... is a lookahead check similar to what happens when it has to choose between productions.)

In that sense, the star operator does add power, because it is not possible to get a loop otherwise. This means that many of the other things you say can be refuted as well -- for instance, there's no stack growth from using a star operator in a rule.

---

I'm not sure if I had fully formed these opinions when I posted before, but I now have a very specific reason why I prefer LL parsers: testability.

It is possible to test the recognition of individual nonterminals directly. If I'm writing a C parser and want to write a test to make sure that 1+3 is an expression, I can do that easily and directly -- I don't need to say "what context can I use an expression in; well, probably the simplest is to say something like int i=1+3;". With an LL parser, I just call expr("1+3"). With any LR parser generator I know, you can only pick one start nonterminal and so I'd have to change the parser spec and rebuild the parser. And then if I want to write a test that for(;;); is a statement, I just call stmt("for(;;);"). With an LR parser, now I have to change the spec and recompile. Oh wait, that breaks my last test.

(It may be that there is a way to make this work. I've tried looking for Bison and couldn't come up with anything. I am not a parser expert, though I do know more than most devs.)

Bill Clodius said...

EBNF is very useful for LL parsers, but much less useful for LR parsers. LL parsers are incompatible with left recursion, but the most common forms of left recursion have an interpretation as an iteration that can be represented by common extensions to BNF. The use of such extensions is an idiom lets the LL parser recognize that implementation (and produce code closely matching the structure of the grammar. FWIW there have been parsers that mostly rely on recursive descent and use non-EBNF and have called themselves LL(k) parsers, but they lied. For left recursion they do not use recursive descent, and the code does not closely match the grammar.) LR parsers do not have the same problem with left recursion that LL parsers have. For some people the extended BNF is easier to understand and some LR parsers have accepted it as input, but it has no significant effect on the resulting parser.