Tuesday, April 11, 2006

Visitor Pattern and Trees Considered Harmful

I have just finished a very fustrating week fighting a parser generator implementing some syntactic sugar for Kowari. The reason why it's been so fustrating is simply because it should have taken one day to implement, test, and document, except that the parser generator was sablecc.

Now sablecc has been stable, featureful, and reasonably performant. Documentation is a minor problem given that it really only consists of a tutorial and the author's masters thesis; still it's a LALR parser generator once you have an EBNF exactly how hard can it be? As a parser generator sablecc works wonderfully.

Unfortunately that's also where the trouble starts, because it's very rare for someone to only want a parser. Generally someone is using a parser generator because they are writing some sort of compiler, and so the parser generator feeds into some sort of semantic analysis stage. And this is where sablecc fails, cf. the topic of todays post.

Sablecc generates an AST. Sablecc always generates an AST! And of course how does your hard working, OOP entrenched, lemming of an author support inspection of this tree? Well as with all well behaved OOP lemmings the author parrots off in the true unquestioning drone of the acolyte, "The visitor pattern of course". Think I'm joking? The thesis was called "SABLECC, AN OBJECT-ORIENTED COMPILER FRAMEWORK"! At this point swearing would be appropriate.

Let's step back and look at the problem we are actually trying to solve. We have an context-free grammar that describes how we can generate an abstract syntax tree from a given input. Semantic Analysis is effectively a transform applied to the AST. Now the nature of this transform is remarkably consistent across compilers. Occasionally it amounts to a eulers walk over the tree, but generally it is a straight forward post-order traversal generating what are called 'Synthsised Attributes' at each production from it and it's children[0]. The result is called an 'Annotated Syntax Tree', and generally the attribute derivations are defined to ensure that the desired output of the semantic analyser is an attribute at the root of the AST.

Attribute derivation defintions are routinely defined as code snippets that convert child attributes into production attributes. In other words we have a series of functions that are dispatched on the AST node type, are called within the context of the results of the child production's functions, and are called as a post-order traversal.

Finally we have the added advantage that each node in the AST corresponds to a reduction in the LALR bottom-up shift-reduce parser that is the ultimate goal of an LALR parser generator. Being bottom-up, this is automatically a post-order sequence of reductions. Consequently we can choose to forgo the generation of the AST altogether and associate the functions directly with the reductions if we so desire. (and why wouldn't we?)

Now generating an AST is sometimes desirable, for instance when you require the afore mentioned eulers walk based traversal, or if you are wanting to unparse the AST back into source-code again later. But regardless of if you generate an AST or not, most compilations are a simple bottom-up generation of the root anntoation in an annotated syntax tree based on associating functions with AST-nodes/parser-reductions.

So that is the problem we want to solve, and in the strictest possible sense it is fundamentally functional in nature. This of course poses a serious problem to our aforementioned object-oriented-lemming, because the natural, straight forward solution is not object-oriented. So the lemming spits out a tree and of course doesn't even need to think about the next step. After all "the object-oriented way to traverse a tree is the visitor pattern", so we'll disregard the fact that the problem is functional in nature --- we have this hammer and by george if we aren't going to make this problem fit our solution.

The visitor pattern is kludge used by programmers in a language that doesn't support multiple-dispatch to provide themselves with a static version of double-dispatch. The cannonical example that was used by the GoF when they documented the pattern is tree traversal. And ever since that fateful day no OOL has dared consider anything else (feh). Note there is no mention of trees in the definition above. Trees are a datastructure. The visitor pattern is a language semantics idiom. THEY ARE NOT RELATED CONCEPTS! Maybe if you are implementing dynamic traversals over a static tree (which does require double-dispatch to do cleanly) you might consider the visitor pattern as a solution to THIS problem; but only if you are willing to bet the life of your first born to guarentee that you will never ever want to introduce a new tree-node type into your design.

For LALR parsers using code definitions for synthesised attributes, this problem has been solved for decades. Yacc got it right. The limitations and difficulties that programmers have with yacc fall into 2 categories: Memory management - garbage collection neatly solves that problem; Conceptual vagueness - a tool can only rarely even help you understand the problem you are trying to solve.

Why is this the right approach? Because the attribute derivations are generally trivial calls to constructors or functions --- along the lines of foobar = new Foobar(token), or combined = combineFoobars(child1.foobar, child2.foobar). And these derivations are intimately associated with individual productions (as discussed above). Consequently it is extremely awkward to seperate the grammar definition from the attribute derivation definition, as they needs must have an identical structure, and hence seperation violates the DRY principle and makes debugging/enhancement a right royal pain.

So why have I suffered this week? Because OO-Lemmings can't be bothered to consider if OO is an appropriate paradigm for the problem that presents itself to us; and because of this unfortunate association of tree-traversal with the visitor pattern. Now I can't do anything about the prevalence of lemmings that won't get me arrested, but please please please stop using the visitor pattern to traverse trees.

[0] If an attribute derivation requires information from a parent or sibling then it is referred to as an 'Inherited Attribute', and this is when you require a dependency aware eulers walk instead of a post-order traversal.

6 comments:

Paula said...

I just speculating out loud here, so take this with a grain of salt.

Functional solutions are awkward, though not impossible to code in Java. You do funky things with interfaces, and you even need to build a data structure to emulate your own stack, but you can do it.

So I'm wondering if you can apply a functional analysis to your tree?

Unfortunately, SableCC doesn't tell you what your tree looks like at all. You just find out through inference as the visitor pattern calls (or is there some way to see the tree that I don't know about?). So if you want to keep SableCC as your lexer, etc (after all, it's full of the required machinery, and is written in Java), then you'd need to build up a real tree structure in memory as the visitor pattern progresses.

Once you have a tree structure (something I've wanted to get at a few times now) then you can apply a functional analysis.

OK, it's messy, and in two stages when it would traditionally be one, but you get to use an existing tool (do you REALLY want to write a parser in Java) and you get to process it "right".

Hmmm, if you really wanted to do functional processing, then maybe you could use Jython or Kawa to manage that stage? Same JVM, but a better tool for the job.

Andrae Muys said...

Hey Quoll,
Well as you know in Kowari we do exactly that. The org.kowari.itql.* hierachy is that visitor, and the org.kowari.query.* hierachy is the functional 'tree structure' we are wanting in return.

As far as a functional solution in java is concerned, the point is that we aren't programming in pure java here, but a blend of java and ebnf. Exactly as yacc .yy files are a blend of C and ebnf. Once you get beyond the first stage semantic analysis you are well out of the parser and all you have to do is ensure that whatever the 1st stage produces is amenable to processing in your language of choice.

That first stage however is a functional problem.

Ryan Moulton said...

Just wanted to mention that I found this post when debating the design of something language-like that I'm working on at the moment, and it convinced me not to shoehorn it into the visitor pattern and instead implement the functionality recursively as methods of the node types themselves. Thanks for putting it online.

Kris Nuttycombe said...

The visitor pattern doesn't have to be harmful; as you mention, it's more or less the best way to implement multiple dispatch in OO languages. Significantly (for my code, at least) visitors can be monads and hence can be nicely composed (so long as you don't take the GOF's example of the visit methods returning void seriously.)

Lukas Eder said...
This comment has been removed by the author.
Lukas Eder said...

After all these years, the visitor pattern still comes up every now and then...

I enjoyed your reading. It reminded me of my most recent rant about bad design decisions and the visitor pattern in particular (although in a different tree-context):

blog.jooq.org/2012/04/10/the-visitor-pattern-re-visited/

Cheers
Lukas