CS 3723: Recursive Descent Parsing

	CS 3723 Programming Languages
Recursive Descent Parsing

Overview: A recursive descent parser is a top-down parser, so called because it builds a parse tree from the top (the start symbol) down, and from left to right, using an input sentence as a target as it is scanned from left to right. The actual tree is not constructed but is implicit in a sequence of function calls. (In the same way the actual tree is implicit in the sequence of reductions used for a shift-reduce parser.) This type of parser was very popular for real compilers in the past, but is not as popular now. A recursive descent parser is usually written entirely by hand and does not require any sophisticated tools. It is a simple and effective technique, but is not as powerful as some of the shift-reduce parsers -- not the one presented in class, but fancier similar ones called LR parsers.

There also exists a table-driven type of top-down parser that is sometimes used.

This parser uses a recursive function corresponding to each non-terminal symbol in the language. For simplicity one often uses the name of the non-terminal as the name of the function. The body of each recursive function mirrors the right side of the corresponding rule. If there are several rule with the same non-terminal at the right side, the code mirrors all those possibilities. In order for this method to work, one must be able to decide which function to call based on the next input symbol. (This parser, like most, looks one token ahead all the time.)

Surprisingly, one hard part of even small recursive descent parsers is the scanning: repeatedly fetching the next token from the scanner. It is tricky to decide when to scan, and the parser doesn't work at all if there is an extra scan or a missing scan.

Initial Example: Consider the grammar used before for simple arithmetic expressions

P ---> E
E ---> E + T | E - T | T
T ---> T * S | T / S | S
S ---> F ^ S | F
F ---> ( E ) | char

The above grammar won't work for recursive descent because of the left recursion in the second and third rules. (The recursive function for E would immediately call E recursively, resulting in an indefinite recursive regression.)

In order to eliminate left recursion, one simple method is to introduce new notation: curly brackets, where {xx} means "zero or more repetitions of xx", and parentheses () used for grouping, along with the or-symbol: |. Because of the many metasymbols, it is a good idea to enclose all terminals in single quotes. Also put a '$' at the end. The resulting grammar looks as follows:

P ---> E '$'
E ---> T { ( '+' | '-' ) T }
T ---> S { ( '*' | '/' ) S }
S ---> F '^' S | F
F ---> '(' E ')' | char

Now the grammar is suitable for creation of a recursive descent parser. Notice that this is a different grammar that describes the same language, that is the same sentences or strings of terminal symbols. A given sentence will have a similar parse tree to one given by the previous grammar, but not necessarily the same parse tree.

One could alter the first grammar in other ways to make it work for recursive descent. For example, one could write the rule for E as:

E ---> T '+' E  |  T

This eliminates the left recursion, and leaves the language the same, but it changes the semantics of the language. With this change, the operator '+' would associate from right to left, instead of from left to right, so this method is not acceptable.

Recursive Descent Parser Coded in C and in Java:

First is a "Bare" Parser in C and in Java for the above grammar for arithmetic expressions. This does nothing except absorb the input source file and either respond with "accept" or reject it with an error message. (See the next item for parser runs.)
Next are Debug Runs in C and in Java for the above grammar for the arithmetic expressions (along with a few error entries:
- a + b + c $
- a ^ b ^ c $
- (a + b) * c $
Finally here is a diagram showing parse trees for the examples above. Notice that the first and third parse trees are different from ones we got before, when we were using left recursion. Let me say it again: The grammar is different, the parse trees are different, but the language is the same. Notice also that the rule for S is the same in the new grammar, since it didn't use left recursion. S involves the ^ operator, so the parse tree for the middle sentence is the same as it was before.

Full-size image: .png, .ps, .pdf.

The sequence of function calls that occurs during the parse essentially does a type of traversal of the parse tree, that is, a process of visiting each node of the tree. The traversal starts at the root and follows the red arrows around the outside of the tree, ending up at the root again. (This traversal goes from left-to-right, repeatedly down and up. You can picture that the traversal starts at the root and goes counter-clockwise completely around the parse tree, ending at the root.) In this case many nodes are visited more than once. In your data structures course you studied traversals of three kinds: preorder, inorder, and postorder. Which kind of traversal do you think this parse gives for the parse tree? The answer is "all three traversals at once." It is called a complete traversal.

The parse tree and its traversal allows us to carry out the process known as syntax-directed translation, which we are studying. This process has several components:

During each visit to a node the program is in the body of the function corresponding to the node. Because of the many visits, there are lots of opportunities for inserting semantic actions that will occur at different times during the parse.
The program can leave information at a node and pick it up or use it or alter it during a later visit to that node.
It is possible to pass information down the tree along the red arrows, using parameters in the parsing functions. This process is called inheritance.
It is also possible to pass information up the tree, again along the red arrows, using the ability of a function to return a value. This process is called synthesis. At a given node, information could be assembled from more than one node below a given node.

During the parse, there is no parse tree, and you only get one trip through the virtual parse tree. The parser could contruct the parse tree if it was needed. Then you could wander around in the tree as much as you wanted to. I've only heard of one case where this was done: in the Ada compiler this was necessary to check for ambiguity in the use of overloaded functions.

Revision date: 2014-01-22. (Please use ISO 8601, the International Standard.)