CS 3723/3721 -- Tiny Compiler 1: Recursive Descent Parser

CS 3723/3721 Programming Languages
Tiny Compiler 1: Recursive Descent Parser

Overview: A recursive descent parser is a top-down parser, so called because it effectively builds a parse tree from the top (the start symbol) down, and from left to right. (The actual tree is not constructed but is implicit in a sequence of function calls.) This type of parser was very popular for real compilers in the past, but is not as popular now. The parser is usually written entirely by hand and does not require any sophisticated tools. It is a simple and effective technique, but is not as powerful as some of the shift-reduce parsers -- not the one in the previous recitation, but fancier similar ones called LR parsers.

This parser uses a recursive function corresponding to each grammar rule (that is, corresponding to each non-terminal symbol in the language). For simplicity one can just use the non-terminal as the name of the function. The body of each recursive function mirrors the right side of the corresponding rule. In order for this method to work, one must be able to decide which function to call based on the next input symbol.

Perhaps the hardest part of a recursive descent parser is the scanning: repeatedly fetching the next token from the scanner. It is tricky to decide when to scan, and the parser doesn't work at all if there is an extra scan or a missing scan. In the examples here and in the next two recitations, all tokens are just single characters, so the scanner is simple, but it is still hard to do the scanning at the right places in the program.

Initial Example:

Consider the grammar used before for simple arithmetric expressions:


P ---> E
E ---> E + T | E - T | T
T ---> T * S | T / S | S
S ---> F ^ S | F
F ---> ( E ) | char

The above grammar won't work for recursive descent because of the left recursion in the second and third rules. (The recursive function for E would immediately call E recursively, resulting in an indefinite recursive regression.)
In order to eliminate left recursion, one simple method is to introduce new notation: curley brackets, where {xx} means "zero or more repetitions of xx", and parentheses () used for grouping, along with the or-symbol: |. Because of the many metasymbols, it is a good idea to enclose all terminals in single quotes. Also put a '#' at the end. The resulting grammar looks as follows:
Now the grammar is suitable for creation of a recursive descent parser. Notice that this is a different grammar that describes the same language, that is the same sentences or strings of terminal symbols. A given sentence will have a similar parse tree to one given by the previous grammar, but not necessarily the same parse tree.
One could alter the first grammar in other ways to make it work for recursive descent. For example, one could write the rule for E as:
This eliminates the left recursion, and leaves the language the same, but it changes the semantics of the language. With this change, the operator '+' would associate from right to left, instead of from left to right, so this method is not acceptable.
Recursive Descent parser in C for the above grammar for arithmetic expressions:
Java version of the recursive descent parser for a grammar for arithmetic expressions:
- Java source file (.java)
- File showing a run (.txt)

Initial Recitation work: This and the next two recitations work with a very simple programming language. The language includes arithmetic expressions (as above), only integers, no declarations, if-then-else and while statements, and assignments. The following is a grammar for this language that allows recursive descent parsing:


M  --->  { S } '#'
S  --->  I  |  W  |  A  |  P  |  C  |  G
I  --->  '[' E '?' { S } ':' { S } ']'  |  '[' E '?' { S } ']'
W  --->  '{' E '?' { S } '}'
A  --->  lower-case '=' E ';'
P  --->  '<'  E  ';'
G  --->  '>'  lower-case ';'
C  --->  '<'  upper-case ';'
E  --->  T {('+' | '-') T}
T  --->  U {('*' | '/' | '%') U}
U  --->  F '^' U | F
F  --->  '(' E ')'  |  lower-case  |  digit

Here "lower-case" stands for a single lower-case letter, and "upper-case" stands for a single upper-case letter. For a more colorful grammar (in a slightly different form), see colorful grammar.

This grammar (and the language it defines) may look a little strange, but it was designed to have only single-character tokens. In particular it doesn't have any reserved words (key words), though in a sense the upper-case letters are reserved.

Just to help with understanding, here is the intuitive meaning of each of the above non-terminals:

Symbol	Meaning
M	Main Program
S	Statement
I	If-Then-[Else] Statement
W	While Statement
A	Assignment Statement
P	Put or Print (integer)
C	Print Character
G	Get (integer)
E	Expression (logical or arith)
T	Term
U	(Hmmm. Ugly Term?)
F	Factor

Notice that the languge uses [ ] for an if-then-else, because this notation is often used in BNF for an optional item, while the { } is used for a while above because in the BNF { } is used for zero or more repetitions.

You are to write a recursive descent parser for this expanded language. Notice that this includes much of the earlier grammar as a special case, so that you can just use the earlier parser for these parts. As mentioned above, all the tokens are single characters, so that the scanner (lexical analyser) can be very simple, as shown in the earlier examples.

Perhaps the hardest part for students is to decide how to handle alternatives on the right side of a rule. Each such choice must be based on the next input token. For example, the grammar rule for F has three alternatives, but these are easily distinguished because the first starts with a '(', the second with a lower-case letter, and the third starts with a digit. At this point, any other token on input is an error.

How about the alternatives in the rule for S ? Here the single rule itself does not show what next token to look for. However, following rules show this, so that the alternative A is chosen in case the next token is a lower-case letter, the alternative W is chosen in case the next token is '{', and so forth.

The part of the grammar involving { S } means "zero or statements". You could easily alter the syntax of your version of the language to match S { S } instead, that is, "one or more statements". In handling such a construct, you need to know how to stop calling the function S(), the one that handles a statement. There are two ways: you can keep calling S() as long as the next terminal is one of the six kinds of tokens that can start a statement. The other (equivalent) method is to keep calling S() until the next token is the proper one to end the sequence of statements: '#' for the whole program, ':' or ']' for an if-then or an if-then-else, and '}' for a while.

A true parser will input a sentence and either say that the sentence was legal or not. However, the parser is actually carrying out a complete traversal of the parse tree by the function calls and returns (as will be illustrated in class). The examples above have extra temporary output illustrating the calls and returns, so you can have confidence that your parser is working correctly. You should have temporary output also, though it does not have to be the same as the above examples.

Another example of a recursive-descent parser:

Here is another example of a recursive descent parser, this time handling the very simple syntax of Lisp:

What to turn in to the recitation instructor: Turn in a working version of your parser. You should start with very simple input as you are developing your parser -- perhaps just a single assignment statement at first, and then other single statements. Here is slightly more complicated sample source to use as the input to your parser:


f = 0; g = 1; n = 0;
{ n - 8*5 ?
   h = f + g;
   < n; < B; < f; < B;
   [ f%2 ? < 1; : < 2; ]
   < N;
   n = n + 1;
   f = g; g = h;
}
#

To understand what this program does, you need to know the semantics of the P (Put an integer using <) and C (Put a character using <) constructs. < prints the value of the expression following it as an integer, without any extra blanks or newlines. < also prints a special character, chosen depending on the upper-case letter following it: B for a blank, T for a tab, N for a newline, etc.

The above program prints the integers from 0 to 39, followed on the same line by the corresponding Fibonacci number, followed by a 1 if the Fibonacci number is odd, and a 2 if it is even.

Here is a much more complicated sample source to use as the input to your parser:


f = 1; g = 2; n = 3; > m;
{ m - n ?
   < n; < T; < g; < T;
   j = g; d = 2; t = 1;
   { t ?
      [ j%d ? e = 0; : e = 1; ]
      { e ?
         j = j/d; < d; < B;
         [ j%d ? e = 0; : e = 1; ]
      }
      [ j - 1 ? t = 1; : t = 0; ]
      [ d - 2 ? d = d + 2; : d = 3; ]
   }
   < N;
   n = n + 1;
   h = f + g;
   f = g; g = h;
}
#

This program produces the first m Fibonacci numbers along with their prime factorizations, where a value for m is read from the terminal using the > construct. Here is one possible result of running your parser with the above input: Parser output.

Key ideas: A recursive descent parser is particularly easy to write by hand and requires no special software tools as the other parsers do. It is easy to "cheat" with this parser, doing whatever you like. The capability of "cheating" can also lead to poorer code, particularly for a compiler that has to evolve over time.

Revision date: 2003-02-09. (Use ISO 8601, an International Standard.)

CS 3723/3721 Programming Languages Tiny Compiler 1: Recursive Descent Parser

CS 3723/3721 Programming Languages
Tiny Compiler 1: Recursive Descent Parser