CS 3723: Recursive Descent Parsing

	CS 3723 Programming Languages
Recursive Descent Parsing

Overview: A recursive descent parser is a top-down parser, so called because it builds a parse tree from the top (the start symbol) down, and from left to right, using an input sentence as a target as it is scanned from left to right. The actual tree is not constructed but is implicit in a sequence of function calls. (In the same way the actual tree is implicit in the sequence of reductions used for a shift-reduce parser.) This type of parser was very popular for real compilers in the past, but is not as popular now. A recursive descent parser is often written entirely by hand and does not require any sophisticated tools. It is a simple and effective technique, but is not as powerful as some of the shift-reduce parsers -- not the one presented in class, but fancier similar ones called LR parsers.

There also exists a table-driven type of top-down parser that is sometimes used.

This parser uses a recursive function corresponding to each non-terminal symbol in the language. For simplicity one often uses the name of the non-terminal as the name of the function. The body of each recursive function mirrors the right side of the corresponding rule. If there are several rules with the same non-terminal on the right side, the code mirrors all those possibilities. In order for this method to work, one must be able to decide which function to call based on the next input symbol. (This parser, like most, looks one token ahead all the time.)

Surprisingly, one hard part of even small recursive descent parsers is the scanning: repeatedly fetching the next token from the scanner. It is tricky to decide when to scan, and the parser doesn't work at all if there is an extra scan or a missing scan.

This parser has another advantage: the writer has complete control over a relatively simple program. If a problem comes up, one can often "cheat" by hacking in specialized code to solve the problem. As one common example, you can "peek ahead", further than the next token, to decide what to do.

Initial Example: Consider the grammar at the left that was used before for simple arithmetic expressions. The more complex grammar at the right will be part of a homework.

P ---> E
E ---> E + T | T
T ---> T * F | F
F ---> ( E ) | digit | letter

P ---> E
E ---> E + T | E - T | T
T ---> T * S | T / S | S
S ---> F ^ S | F
F ---> ( E ) | digit | letter

The above grammar won't work for recursive descent because of the left recursion in the second and third rules. (The recursive function for E would immediately call E recursively, resulting in an indefinite recursive regression.)

In order to eliminate left recursion, one simple method is to introduce new notation: curly brackets, where {xx} means "zero or more repetitions of xx", and parentheses () used for grouping, along with the or-symbol: |. Because of the many metasymbols, it is a good idea to enclose all terminals in single quotes. Also put a '$' at the end. The resulting grammar looks as follows. The more complex one is on the right, which also includes a unary '+' or '-'.

Simple Language: REs Parser below
*P ---> E '$' E ---> T { '+' T } T ---> F { '' F } F ---> '(' E ')' \| digit \| letter**

More Complex Language: REs Parser in Homework 7
*P ---> E '$' E ---> T { ( '+' \| '-' ) T } T ---> S { ( '' \| '/' ) S } S ---> F '^' S \| F F ---> {'+' \| '-'} ( '(' E ')' \| digit \| letter )**

Now the grammar is suitable for creation of a recursive descent parser. Notice that this is a different grammar that describes the same language, that is, the same sentences or strings of terminal symbols. A given sentence will have a similar parse tree to one given by the previous grammar, but not necessarily the same parse tree.

One could alter the first grammar in other ways to make it work for recursive descent. For example, one could write the rule for E as:

Wrong!
E ---> T '+' E \| T

This eliminates the left recursion, and leaves the language the same, but it changes the semantics of the language. With this change, the operator '+' would associate from right to left, instead of from left to right, so this method is not acceptable.

Recursive Descent Parser Coded in Python: This is a parser for the grammar on the left above. The code below translates into C or Java very easily.

Python Parser for Simple AEs

# parse0.py: parser for simple AEs
import sys
next = None
    
def P():
    sys.stdout.write("\nSource: ")
    scan()
    if next == '$':
        sys.exit(1)
    E()
    if next == '$':
        sys.stdout.write(", accept.")
    else:
        error(1)

def E():
    T()
    while next == '+':
        scan()
        T()

def T():
    F()
    while next == '*':
        scan()
        F()

def F():
    if next.isalnum(): # alphanum
        scan()
    elif next == '(':
        scan()
        E()
        if next == ')':
            scan()
        else:
            error(3)
    else:
        error(4)

def error(n):
    sys.stdout.write("\nError:" +
      str(n) + "\n")
    sys.exit(1)

def getch():
    c = sys.stdin.read(1)
    if len(c) > 0:
        sys.stdout.write(c) # echo input
        return c
    else:
        return None

def scan():
    global next
    next = getch()
    if next == None:
        sys.exit(1)
    while next.isspace(): # skip whitesp
        next = getch()

while True:
    P()

% cat sae.txt
a+b*c$
a * b * c $
a+b*c*d$
(a + b)*c$
a*(b + c)$
(a*b*(c + d) +
  e)+f$
((a*b*(c+(d))+e)+i)$
$
% python parse_simple.py < sae.txt
Source: a+b*c$, Accept.
Source: a * b * c $, Accept.
Source: a+b*c*d$, Accept.
Source: (a + b)*c$, Accept.
Source: a*(b + c)$, Accept.
Source: (a*b*(c + d) +
  e)+f$, Accept.
Source: ((a*b*(c+(d))+e)+i)$, Accept.
Inputs rejected:
% python parse_simple.py
Source: a+b*$
a+b*$
Error:4
Source: b*&$
b*&
Error:4
Source: *a+b$
*
Error:4
Source: a+*b$
a+*
Error:4
Source: (a+b$
(a+b$
Error:3
Source: (a+b)+c)*d)$
(a+b)+c)
Error:1

One small issue about the Python code: In the program it's convenient to use a global variable: next. Any function with this variable inside it will assume the variable is global if it isn't given a value inside the function. Otherwise you need a declaration: global next.

The code above prints the complete input if it is accepted, and the input up to an error if it is rejected. This extra output can help with debugging a parser.

The source file above is: sae.txt.

When to Scan: The code above uses an "early" or "immediate" scanning strategy, which is the easiest and least error-prone. (For me, too.) A better strategy is the "late" scan, where one scans as late as possible. (In some interactive situations with the early scan, one would have to do extra input just to get on with the processing. This isn't going to come up for us, though.)

Early Scan Strategy: As soon as you make a decision based on an input token, you immediately scan for the next token.

As mentioned above, this can be the most annoying part to debug. If you omit a scan or add an extra scan, the code won't work, but it's behavior may be hard to diagnose.

Debug Parse: Below is the same program to parse the grammar on the left above, but with a lot of extra debug output (in red below) showing the course of the parse.

Debug Parser in Python for Simple AEs

# debug.py: parser for arith (debug)
import sys
next = None
src = ""
level = 0
    
def P():
    global src
    scan()
    E()
    if next != '$':
        error(1)
    else:
        sys.stdout.write("ACCEPT, ")
        sys.stdout.write("SOURCE:" +
          src + "\n")
        src = ""

def E():
    enter('E')
    T()
    while next == '+':
        scan()
        T()
    leave('E')

def T():
    enter('T')
    F()
    while next == '*':
        scan()
        F()
    leave('T')

def F():
    enter('F')
    if next.isalnum(): # alphanum
        scan()
    elif next == '(':
        scan()
        E()
        if next == ')':
            scan()
        else:
            error(2)
    else:
        error(3)
    leave('F')

def error(n):
    sys.stdout.write("ERROR:%i, SOURCE:%s\n" %
      (n, src))
    sys.exit(1)

def getch():
    c = sys.stdin.read(1)
    if len(c) > 0:
        return c
    else:
        return None

def scan():
    global next
    global src
    next = getch()
    while next.isspace():
        next = getch()
    src += next

def enter(name):
    global level
    spaces(level)
    level += 1
    sys.stdout.write("+-%c: Enter, \t" % name)
    sys.stdout.write("Next == " + str(next) +
      "\n")

def leave(name):
    global level
    level -= 1
    spaces(level)
    sys.stdout.write("+-%c: Leave, \t" % name)
    sys.stdout.write("Next == %c\n" % next)

def spaces(local_level):
    while local_level > 0:
        local_level -= 1
        sys.stdout.write("| ")

while True:
    P()

Output

% python debug0.py
a*b*c$
+-E: Enter,     Next == a
| +-T: Enter,   Next == a
| | +-F: Enter,         Next == a
| | +-F: Leave,         Next == *
| | +-F: Enter,         Next == b
| | +-F: Leave,         Next == *
| | +-F: Enter,         Next == c
| | +-F: Leave,         Next == $
| +-T: Leave,   Next == $
+-E: Leave,     Next == $
ACCEPT, SOURCE:a*b*c$
% python debug0.py
(a+b)*c$
+-E: Enter,     Next == (
| +-T: Enter,   Next == (
| | +-F: Enter,         Next == (
| | | +-E: Enter,       Next == a
| | | | +-T: Enter,     Next == a
| | | | | +-F: Enter,   Next == a
| | | | | +-F: Leave,   Next == +
| | | | +-T: Leave,     Next == +
| | | | +-T: Enter,     Next == b
| | | | | +-F: Enter,   Next == b
| | | | | +-F: Leave,   Next == )
| | | | +-T: Leave,     Next == )
| | | +-E: Leave,       Next == )
| | +-F: Leave,         Next == *
| | +-F: Enter,         Next == c
| | +-F: Leave,         Next == $
| +-T: Leave,   Next == $
+-E: Leave,     Next == $
ACCEPT, SOURCE:(a+b)*c$

% python debug0.py
a+b+c$
+-E: Enter,     Next == a
| +-T: Enter,   Next == a
| | +-F: Enter,         Next == a
| | +-F: Leave,         Next == +
| +-T: Leave,   Next == +
| +-T: Enter,   Next == b
| | +-F: Enter,         Next == b
| | +-F: Leave,         Next == +
| +-T: Leave,   Next == +
| +-T: Enter,   Next == c
| | +-F: Enter,         Next == c
| | +-F: Leave,         Next == $
| +-T: Leave,   Next == $
+-E: Leave,     Next == $
ACCEPT, SOURCE:a+b+c$
% python debug.py
a^b^c$
+-E: Enter,     Next == a
| +-T: Enter,   Next == a
| | +-S: Enter,         Next == a
| | | +-F: Enter,       Next == a
| | | +-F: Leave,       Next == ^
| | | +-S: Enter,       Next == b
| | | | +-F: Enter,     Next == b
| | | | +-F: Leave,     Next == ^
| | | | +-S: Enter,     Next == c
| | | | | +-F: Enter,   Next == c
| | | | | +-F: Leave,   Next == $
| | | | +-S: Leave,     Next == $
| | | +-S: Leave,       Next == $
| | +-S: Leave,         Next == $
| +-T: Leave,   Next == $
+-E: Leave,     Next == $
ACCEPT, SOURCE:a^b^c$

Diagrams of the R-D Parse Trees: Finally, here is a diagram showing parse trees for the examples above, along with the example a^b^c$ which is one of the more complicated examples. Notice that the first and third parse trees are different from ones we got before, when we were using left recursion. Let me say it again: The grammar is different, the parse trees are different, but the language is the same. Notice also that the rule for S is the same in the new grammar, since it didn't use left recursion. S involves the ^ operator, so the parse tree for the middle sentence is the same as it was before.

Full-size image: .png, .ps, .pdf.

The sequence of function calls that occurs during the parse essentially does a type of traversal of the parse tree, that is, a process of visiting each node of the tree. The traversal starts at the root and follows the red arrows around the outside of the tree, ending up at the root again. (This traversal goes from left-to-right, repeatedly down and up. You can picture that the traversal starts at the root and goes counter-clockwise completely around the parse tree, ending at the root.) In this case many nodes are visited more than once. In your data structures course you studied traversals of three kinds: preorder, inorder, and postorder. Which kind of traversal do you think this parse gives for the parse tree? The answer is "all three traversals at once." It is called a complete traversal.

The parse tree and its traversal allows us to carry out the process known as syntax-directed translation, which we are studying. This process has several components:

During each visit to a node the program is in the body of the function corresponding to the node. Because of the many visits, there are lots of opportunities for inserting semantic actions that will occur at different times during the parse.
The program can leave information at a node and pick it up or use it or alter it during a later visit to that node.
It is possible to pass information down the tree along the red arrows, using parameters in the parsing functions. This process is called inheritance.
It is also possible to pass information up the tree, again along the red arrows, using the ability of a function to return a value. This process is called synthesis. At a given node, information could be assembled from more than one node below a given node.

During the parse, there is no parse tree, and you only get one trip through the virtual parse tree. The parser could construct the parse tree if it was needed. Then you could wander around in the tree as much as you wanted to. I've only heard of one case where this was done: in the Ada compiler this was necessary to check for ambiguity in the use of overloaded functions.

(Revision date: 2014-10-20. (Please use ISO 8601, the International Standard.)