|
 |
CS 3723
Programming Languages |
Recursive Descent Parsing |
Overview:
A recursive descent parser is a top-down parser,
so called because it builds a parse tree from the
top (the start symbol) down, and from left to right, using
an input sentence as a target as it is scanned from left to right.
The actual tree is not constructed but is implicit in a sequence
of function calls. (In the same way the actual tree is
implicit in the sequence of reductions used for a shift-reduce
parser.) This type of parser was very popular for real
compilers in the past, but is not as popular now.
A recursive descent parser is often written entirely by hand and does not require
any sophisticated tools. It is a simple and effective technique,
but is not as powerful as some of the shift-reduce parsers --
not the one presented in class, but fancier
similar ones called LR parsers.
There also exists a table-driven type of top-down parser that is
sometimes used.
This parser uses a recursive function corresponding to each
non-terminal symbol in the language. For simplicity one often uses
the name of the non-terminal
as the name of the function. The body of each recursive function
mirrors the right side of the corresponding rule.
If there are several rules with the same non-terminal on the right
side, the code mirrors all those possibilities.
In order for this method to work, one must be able to decide
which function to call based on the next input symbol.
(This parser, like most, looks one token ahead all the time.)
Surprisingly, one hard part of even small recursive descent parsers
is the scanning: repeatedly fetching the next token from the scanner.
It is tricky to decide when to scan, and the parser doesn't work at all
if there is an extra scan or a missing scan.
This parser has another advantage: the writer has complete control
over a relatively simple program. If a problem comes up, one can
often "cheat" by hacking in specialized code to solve the problem.
As one common example, you can "peek ahead", further than the next
token, to decide what to do.
Initial Example:
Consider the grammar at the left that was used before for simple
arithmetic expressions. The more complex grammar
at the right will be part of a homework.
P ---> E
E ---> E + T | T
T ---> T * F | F
F ---> ( E ) | digit | letter
|
| |
P ---> E
E ---> E + T | E - T | T
T ---> T * S | T / S | S
S ---> F ^ S | F
F ---> ( E ) | digit | letter
|
|
The above grammar won't work for recursive descent because
of the left recursion in the second and third rules.
(The recursive function for E would immediately
call E recursively,
resulting in an indefinite recursive regression.)
In order to eliminate left recursion, one simple method is to
introduce new notation: curly brackets, where
{xx} means "zero or more repetitions of
xx",
and parentheses () used for grouping,
along with the or-symbol: |.
Because of the many metasymbols, it is a good idea to enclose all
terminals in single quotes. Also put a '$' at the end.
The resulting grammar looks as follows. The more
complex one is on the right, which also includes a unary
'+' or '-'.
Simple
Language: REs
Parser below |
P ---> E '$'
E ---> T { '+' T }
T ---> F { '*' F }
F ---> '(' E ')' | digit | letter
|
| |
More Complex
Language: REs
Parser in Homework 7 |
P ---> E '$'
E ---> T { ( '+' | '-' ) T }
T ---> S { ( '*' | '/' ) S }
S ---> F '^' S | F
F ---> {'+' | '-'} ( '(' E ')' |
digit | letter )
|
|
Now the grammar is suitable for creation of a recursive descent parser.
Notice that this is a different grammar that describes
the same language, that is, the same sentences or strings
of terminal symbols. A given sentence will have a similar parse
tree to one given by the previous grammar, but not necessarily
the same parse tree.
One could alter the first grammar in other ways to make it
work for recursive descent. For example, one could write
the rule for E as:
Wrong! |
E ---> T '+' E | T
|
This eliminates the left recursion, and leaves the language the
same, but it changes the semantics of the language.
With this change, the operator '+' would associate from
right to left, instead of from left to right, so this
method is not acceptable.
Recursive Descent Parser
Coded in Python:
This is a parser for the grammar on the left above.
The code below translates into C or Java very easily.
Python Parser
for Simple AEs |
# parse0.py: parser for simple AEs
import sys
next = None
def P():
sys.stdout.write("\nSource: ")
scan()
if next == '$':
sys.exit(1)
E()
if next == '$':
sys.stdout.write(", accept.")
else:
error(1)
def E():
T()
while next == '+':
scan()
T()
def T():
F()
while next == '*':
scan()
F()
def F():
if next.isalnum(): # alphanum
scan()
elif next == '(':
scan()
E()
if next == ')':
scan()
else:
error(3)
else:
error(4)
def error(n):
sys.stdout.write("\nError:" +
str(n) + "\n")
sys.exit(1)
def getch():
c = sys.stdin.read(1)
if len(c) > 0:
sys.stdout.write(c) # echo input
return c
else:
return None
def scan():
global next
next = getch()
if next == None:
sys.exit(1)
while next.isspace(): # skip whitesp
next = getch()
while True:
P()
|
% cat sae.txt
a+b*c$
a * b * c $
a+b*c*d$
(a + b)*c$
a*(b + c)$
(a*b*(c + d) +
e)+f$
((a*b*(c+(d))+e)+i)$
$
% python parse_simple.py < sae.txt
Source: a+b*c$, Accept.
Source: a * b * c $, Accept.
Source: a+b*c*d$, Accept.
Source: (a + b)*c$, Accept.
Source: a*(b + c)$, Accept.
Source: (a*b*(c + d) +
e)+f$, Accept.
Source: ((a*b*(c+(d))+e)+i)$, Accept.
Inputs rejected:
% python parse_simple.py
Source: a+b*$
a+b*$
Error:4
Source: b*&$
b*&
Error:4
Source: *a+b$
*
Error:4
Source: a+*b$
a+*
Error:4
Source: (a+b$
(a+b$
Error:3
Source: (a+b)+c)*d)$
(a+b)+c)
Error:1
|
One small issue about the Python code: In the program it's
convenient to use a global variable: next.
Any function with this variable inside it will assume
the variable is global if it isn't given a value inside the
function. Otherwise you need a declaration: global next.
The code above prints the complete input if it is accepted,
and the input up to an error if it is rejected.
This extra output can help with debugging a parser.
The source file above is:
sae.txt.
When to Scan:
The code above uses an "early" or "immediate" scanning
strategy, which is the easiest and least error-prone.
(For me, too.) A better strategy is the "late"
scan, where one scans as late as possible. (In some interactive
situations with the early scan, one would have to do extra
input just to get on with the processing. This isn't going to
come up for us, though.)
Early Scan Strategy: As soon as you make a decision
based on an input token, you immediately scan for the next
token.
As mentioned above, this can be the most annoying part to
debug. If you omit a scan or add an extra scan, the code
won't work, but it's behavior may be hard to diagnose.
Debug Parse:
Below is the same program to parse the grammar on the left
above, but with a lot of extra debug output (in
red below) showing the course
of the parse.
Debug Parser
in Python for Simple AEs |
# debug.py: parser for arith (debug)
import sys
next = None
src = ""
level = 0
def P():
global src
scan()
E()
if next != '$':
error(1)
else:
sys.stdout.write("ACCEPT, ")
sys.stdout.write("SOURCE:" +
src + "\n")
src = ""
def E():
enter('E')
T()
while next == '+':
scan()
T()
leave('E')
def T():
enter('T')
F()
while next == '*':
scan()
F()
leave('T')
def F():
enter('F')
if next.isalnum(): # alphanum
scan()
elif next == '(':
scan()
E()
if next == ')':
scan()
else:
error(2)
else:
error(3)
leave('F')
| def error(n):
sys.stdout.write("ERROR:%i, SOURCE:%s\n" %
(n, src))
sys.exit(1)
def getch():
c = sys.stdin.read(1)
if len(c) > 0:
return c
else:
return None
def scan():
global next
global src
next = getch()
while next.isspace():
next = getch()
src += next
def enter(name):
global level
spaces(level)
level += 1
sys.stdout.write("+-%c: Enter, \t" % name)
sys.stdout.write("Next == " + str(next) +
"\n")
def leave(name):
global level
level -= 1
spaces(level)
sys.stdout.write("+-%c: Leave, \t" % name)
sys.stdout.write("Next == %c\n" % next)
def spaces(local_level):
while local_level > 0:
local_level -= 1
sys.stdout.write("| ")
while True:
P()
| |
Output |
% python debug0.py
a*b*c$
+-E: Enter, Next == a
| +-T: Enter, Next == a
| | +-F: Enter, Next == a
| | +-F: Leave, Next == *
| | +-F: Enter, Next == b
| | +-F: Leave, Next == *
| | +-F: Enter, Next == c
| | +-F: Leave, Next == $
| +-T: Leave, Next == $
+-E: Leave, Next == $
ACCEPT, SOURCE:a*b*c$
% python debug0.py
(a+b)*c$
+-E: Enter, Next == (
| +-T: Enter, Next == (
| | +-F: Enter, Next == (
| | | +-E: Enter, Next == a
| | | | +-T: Enter, Next == a
| | | | | +-F: Enter, Next == a
| | | | | +-F: Leave, Next == +
| | | | +-T: Leave, Next == +
| | | | +-T: Enter, Next == b
| | | | | +-F: Enter, Next == b
| | | | | +-F: Leave, Next == )
| | | | +-T: Leave, Next == )
| | | +-E: Leave, Next == )
| | +-F: Leave, Next == *
| | +-F: Enter, Next == c
| | +-F: Leave, Next == $
| +-T: Leave, Next == $
+-E: Leave, Next == $
ACCEPT, SOURCE:(a+b)*c$
| % python debug0.py
a+b+c$
+-E: Enter, Next == a
| +-T: Enter, Next == a
| | +-F: Enter, Next == a
| | +-F: Leave, Next == +
| +-T: Leave, Next == +
| +-T: Enter, Next == b
| | +-F: Enter, Next == b
| | +-F: Leave, Next == +
| +-T: Leave, Next == +
| +-T: Enter, Next == c
| | +-F: Enter, Next == c
| | +-F: Leave, Next == $
| +-T: Leave, Next == $
+-E: Leave, Next == $
ACCEPT, SOURCE:a+b+c$
% python debug.py
a^b^c$
+-E: Enter, Next == a
| +-T: Enter, Next == a
| | +-S: Enter, Next == a
| | | +-F: Enter, Next == a
| | | +-F: Leave, Next == ^
| | | +-S: Enter, Next == b
| | | | +-F: Enter, Next == b
| | | | +-F: Leave, Next == ^
| | | | +-S: Enter, Next == c
| | | | | +-F: Enter, Next == c
| | | | | +-F: Leave, Next == $
| | | | +-S: Leave, Next == $
| | | +-S: Leave, Next == $
| | +-S: Leave, Next == $
| +-T: Leave, Next == $
+-E: Leave, Next == $
ACCEPT, SOURCE:a^b^c$
|
Diagrams of the R-D Parse Trees:
Finally, here is a diagram showing parse trees for the examples above,
along with the example a^b^c$ which is one of the
more complicated examples.
Notice that the first and third parse trees are different from
ones we got before, when we were using left recursion. Let me say it again:
The grammar is different,
the parse trees are different, but the language is
the same. Notice also that the rule for S
is the same in the new grammar, since it didn't use left recursion.
S involves the ^ operator, so the parse tree for the middle
sentence is the same as it was before.

Full-size image: .png,
.ps,
.pdf.
The sequence of function calls that occurs during the parse
essentially does a type of traversal of the parse tree,
that is, a process of visiting each node of the tree.
The traversal starts at the root and follows
the red arrows
around the outside of the tree, ending up at the root again.
(This traversal goes from left-to-right, repeatedly down and up.
You can picture that the traversal starts at the root and goes
counter-clockwise completely around the parse tree, ending at the root.)
In this case many nodes are visited more than once.
In your data structures course you studied traversals of
three kinds: preorder, inorder, and
postorder. Which kind of traversal do you think
this parse gives for the parse tree? The answer is "all
three traversals at once." It is called a complete
traversal.
The parse tree and its traversal allows us to carry out
the process known as syntax-directed translation,
which we are studying. This process has several components:
- During each visit to a node the program is in the body of
the function corresponding to the node. Because of the many
visits, there are lots of opportunities for inserting semantic actions
that will occur at different times during the parse.
- The program can leave information at a node and pick
it up or use it or alter it during a later visit to that node.
- It is possible to pass information down the tree
along the red arrows, using parameters in the parsing functions.
This process is called inheritance.
- It is also possible to
pass information up the tree, again along the red arrows,
using the ability of a function to return a value.
This process is called synthesis. At a given node, information
could be assembled from more than one node below a given node.
During the parse, there is no parse tree, and you only get one trip
through the virtual parse tree. The parser could construct the
parse tree if it was needed. Then you could wander around in the
tree as much as you wanted to. I've only heard of one case where
this was done: in the Ada compiler this was necessary to check
for ambiguity in the use of overloaded functions.
(Revision date: 2014-10-20.
(Please use ISO
8601, the International Standard.)
|