Thursday, April 13, 2006

More on the visitor pattern

The primary reason I started this blog was that I occasionally spend non-trivial amounts of time contributing to mailing lists discussions. Some of those emails approach short articles and I got fustrated having them disappear into list-archives, sometimes never to be seen again (even when I've gone looking for them).

Not surprisingly my recent blog entry, and associated post to the kowari mailing lists has led to a need to respond with a more carefully reasoned, and less emotional defense than yesterdays rant.

First it has been pointed out that I used the term Abstract Syntax Tree when it would be more correct to refer to the output of sablecc as a Concrete Syntax Tree. I acknowledge this, although I have always found the distinction vague. It always seems to rely on the semantics of the grammar not the syntax. The closest I've come to definition that doesn't make reference to semantics is: "An AST is a CST missing some bits"; which of course implies CST is a subtype of AST forall values of some == none.

However as this post is not dealing specifically with sablecc, but rather with parser generators in general, and with their interaction with the visitor pattern in particular, I will continue to use the more general term AST as it is more appropriate given the existence of tools such as javacc that permit the definition of an AST directly within the grammar file.

I was pointed at for a discussion of new features in sablecc to make the use of the visitor pattern more palateable. Indeed these features are going to be useful, however they only demonstrate my point.

Defining the abstract syntax tree and writing transformations from to concrete to abstract tree is quite an effort, and then you have to write your own visitor classes to process that tree.
Exactly the same argument applies if what you are wanting is 3-address code, or a rmi-enabled Query object (which is want Kowari is generating). What the author of sablecc is slowly learning is that the authors of cups, yacc, and antlr whose products he dismissed so casually in his thesis actually did know what they were doing.

Ultimately the Lex -> Parse -> Semantic Analysis stage of a compiler is not suited to an OO based design, and no attempt to shoehorn this into an OO framework is going to be as effective as recognising this and working with the grain of the problem. You can see this clearly illustrated if you look at the actual file generated by sablecc. It is not object oriented in the slightest, rather it is a massive automaton (itql has 205 states), with each state associated with a small fragment of functional code. This accurately reflects the nature of the problem. That the the user is then presented an interface that ignores this underlying reality in favour of ideological purity is unfortunate.

The only justification given is that it allows the semantic analysis to be seperated from the grammar defintion. But then the seperation of the first stage SA from grammar is undesirable, so this is a moot point. A stage-1 SA must follow the exact shape of the productions in the grammar. As structure is very awkward to keep synchronised, DRY dictates that this structure should be defined in exactly one place. Moreover the as the parser is completely defined by the structure, don't think of the 'grammar' file as the parser's grammar; think of it as the stage-1 SA definition. The parser just happens to be generated directly from a simple structural analysis of the SA definition. I have written numerous parsers, they cease being difficult when you stop trying to think of them in imperative terms, but consider them as declarative definitions of symbolic tree transformations. Anyone having trouble with this concept should reread sections 5.1-5.4 of Aho/Sethi/Ullman (Dragon Book).

Parsing is easy. Semantics is not. The visitor pattern results in a very clean elegant interface to the parser, and makes the parser writers life easy. However it wasn't a fundamentally hard life to begin with, and the decision results in scattering the semantic analysis code as 1-3 line fragments across dozens of files (unless you do what kowari does and breaks encapsulation). To the extent you follow the pattern, you need a visitor for each unique synthesised attribute. What is generally a one line annotation on an EBNF grammar becomes an entire class. If you allow apply() to return a synthesised attribute, you can normally reduce the number of classes by combining visitors. Be aware though that having used this approach myself the price is in maintainability and flexibility as you have to be very careful you avoid undesired interactions between the combined visitors.

Parsing is only occasionally hard; Semantic Analysis is only occasionally easy. We should not be trying to find ways of making parsing easier just because it is easy; rather we should be trying to find ways to make semantic analysis easier precisely because it is hard. We are dealing with a functional problem and trying to find an object-oriented imperative solution is working against the grain, and it leaves me wondering, "Why you would bother?".

1 comment:

Tom Copeland said...

That's very well said... I think I started off just like you're saying not to do, following the Visitor pattern because, hey, everyone's doing it, that's just how it's done! But now I find myself moving more and more functionality out of my huge directory full of ASTFooBar classes and back into my nice JavaCC grammar. Much cleaner.