CS 3723/3721 -- Tiny Compiler 1: Recursive Descent Parser

(Click or use local copy.)

CS 3723/3721
Programming Languages
Spring 2005
Recitation 5
Tiny Compiler 1: Recursive Descent Parser
    Week 5: Feb 14-18
Due (on time): 2005-02-21  23:59:59
Due (late):        2005-02-25  23:59:59

Recitation 5 must be submitted following directions at: submissions with deadlines

2005-02-21 23:59:59 (that's Monday, 21 February 2005, 11:59:59 pm) for full credit.
2005-02-25 23:59:59 (that's Friday, 25 February 2005, 11:59:59 pm) for 75% credit.

Overview: You are to write a recursive descent parser for a small language called Tiny®, which has a grammar given below. Refer to recursive descent parsing for reference material.

For a normal compiler course, you would be using the scanner from Recitation 3 for this parser, so that the language could make use of identifiers, and floating point and integer constants. However, I want the recitations to be as independent as possible, and some of you may have had trouble finishing Recitation 3. For this reasons, in this and in the next two recitations, all tokens are just single characters, so the scanner is particularly simple: just fetch the next non-whitespace character. This means that identifiers will be a single letter, and the only constants will be a single digit. These restrictions are not as great as you might think, and you can still write complex programs in this language

If you wish, you can use the scanner from Recitation 3 as part of this assignment.

Initial Recitation work: This and the next two recitations work with a very simple programming language. The language includes arithmetic expressions (as in examples elsewhere), only integers, no declarations, if-then-else and while statements, and assignments. The following is a grammar for this language that allows recursive descent parsing:

Grammar for Tiny® language
M ---> { S } '$' S ---> I \| W \| A \| P \| C \| G I ---> '[' E '?' { S } ':' { S } ']' \| '[' E '?' { S } ']' W ---> '{' E '?' { S } '}' A ---> lower-case '=' E ';' P ---> '<' E ';' G ---> '>' lower-case ';' C ---> '<' upper-case ';' E ---> T {('+' \| '-') T} T ---> U {('*' \| '/' \| '%') U} U ---> F '^' U \| F F ---> '(' E ')' \| lower-case \| digit

Grammar for Tiny® language

M  --->  { S } '$'
S  --->  I  |  W  |  A  |  P  |  C  |  G
I  --->  '[' E '?' { S } ':' { S } ']'  |  '[' E '?' { S } ']'
W  --->  '{' E '?' { S } '}'
A  --->  lower-case '=' E ';'
P  --->  '<'  E  ';'
G  --->  '>'  lower-case ';'
C  --->  '<'  upper-case ';'
E  --->  T {('+' | '-') T}
T  --->  U {('*' | '/' | '%') U}
U  --->  F '^' U | F
F  --->  '(' E ')'  |  lower-case  |  digit

Here "lower-case" stands for a single lower-case letter, and "upper-case" stands for a single upper-case letter. For a more colorful grammar (in a slightly different form), see colorful grammar.

This grammar (and the language it defines) may look a little strange, but it was designed to have only single-character tokens. In particular it doesn't have any reserved words (key words), though in a sense the upper-case letters are reserved.

Just to help with understanding, here is the intuitive meaning of each of the above non-terminals:

Symbol	Meaning
M	Main Program
S	Statement
I	If-Then-[Else] Statement
W	While Statement
A	Assignment Statement
P	Put or Print (integer)
C	Print Character
G	Get (integer)
E	Expression (logical or arith)
T	Term
U	(Hmmm. Ugly Term?)
F	Factor

Notice that the languge uses [ ] for an if-then-else statement, because this notation is often used in BNF for an optional item, while the { } is used for a while statement because in the BNF { } is used for zero or more repetitions.

You are to write a recursive descent parser for this expanded language. Notice that this includes much of the example parser for arithmetic expressions as a special case, so that you can just use this earlier parser for these parts. As mentioned above, all the tokens are single characters, so that the scanner (lexical analyser) can be very simple, as shown in the earlier examples.

Perhaps the hardest part for students is to decide how to handle alternatives on the right side of a rule. Each such choice must be based on the next input token. For example, the grammar rule for F has three alternatives, but these are easily distinguished because the first starts with a '(', the second with a lower-case letter, and the third starts with a digit. At this point, any other token on input is an error.

How about the alternatives in the rule for S ? Here the single rule itself does not show what next token to look for. However, following rules show this, so that the alternative A is chosen in case the next token is a lower-case letter, the alternative W is chosen in case the next token is '{', and so forth.

The alternatives on S of P and C pose an additional problem, since in both cases the next token is a <. Here you could in effect introduce an extra non-terminal and an extra grammar rule: PorC ---> P | C. The choice between these two is then based on the next token after a <, that is, an upper-case letter in case of C, and either a lower-cases letter, a digit, or a left paren in case of P.

The part of the grammar involving { S } means "zero or statements". You could easily alter the syntax of your version of the language to match S { S } instead, that is, "one or more statements". In handling such a construct, you need to know how to stop calling the function S(), the one that handles a statement. There are two ways: you can keep calling S() as long as the next terminal is one of the six kinds of tokens that can start a statement. The other (equivalent) method is to keep calling S() until the next token is the proper one to end the sequence of statements: '#' for the whole program, ':' or ']' for an if-then or an if-then-else, and '}' for a while.

A pure parser will input a sentence and just say whether the sentence was legal or not, without any other output. However, this parser is actually carrying out an implicit complete traversal of the parse tree by the function calls and returns (as will be illustrated in class). The examples above have extra temporary output illustrating the calls and returns, so you can have confidence that your parser is working correctly. You should have temporary output also, though it does not have to be the same as the above examples.

Sample input to your parser: You should start with very simple input as you are developing your parser -- perhaps just a single assignment statement at first, and then test other single statements one-at-a-time.

Here is slightly more complicated sample source to use as the input to your parser:

Sample Input 1
*f = 0; g = 1; n = 0; { n - 85 ? h = f + g; < n; < B; < f; < B; [ f%2 ? < 1; : < 2; ] < N; n = n + 1; f = g; g = h; } $**

To understand what this program does, you need to know the semantics of the P (Put an integer using <) and C (Put a character using <) constructs. < prints the value of the expression following it as an integer, without any extra blanks or newlines. < also prints a special character, chosen depending on the upper-case letter following it: B for a blank, T for a tab, N for a newline, etc.

The above program prints the integers from 0 to 39, followed on the same line by the corresponding Fibonacci number, followed by a 1 if the Fibonacci number is odd, and a 2 if it is even.

Here is a much more complicated sample source to use as the input to your parser:

Sample Input 2
f = 1; g = 2; n = 3; > m; { m - n ? < n; < T; < g; < T; j = g; d = 2; t = 1; { t ? [ j%d ? e = 0; : e = 1; ] { e ? j = j/d; < d; < B; [ j%d ? e = 0; : e = 1; ] } [ j - 1 ? t = 1; : t = 0; ] [ d - 2 ? d = d + 2; : d = 3; ] } < N; n = n + 1; h = f + g; f = g; g = h; } $

Sample Input 2

f = 1; g = 2; n = 3; > m;
{ m - n ?
   < n; < T; < g; < T;
   j = g; d = 2; t = 1;
   { t ?
      [ j%d ? e = 0; : e = 1; ]
      { e ?
         j = j/d; < d; < B;
         [ j%d ? e = 0; : e = 1; ]
      }
      [ j - 1 ? t = 1; : t = 0; ]
      [ d - 2 ? d = d + 2; : d = 3; ]
   }
   < N;
   n = n + 1;
   h = f + g;
   f = g; g = h;
}  $

This program produces the first m Fibonacci numbers along with their prime factorizations, where a value for m is read from the terminal using the > construct. Here is roughly what your parser output might look like with the above input (using a slightly different parser): Parser output.

What you should submit: Refer to the submissions directions and to deadlines at the top of this page. The text file that you submit should first have Your Name, the Course Number, and the Recitation Number. The rest of the file should have the following in it, in the order below, and clearly labeled, including at the beginning the appropriate item letters: a, b, and c.

Contents of submission for Recitation 5:
Last Name, First Name; Course Number; Recitation Number (5).
a. Java or C++ code for your parser, perhaps in multiple files.
b. Results of a run of your program using Sample Input 1 above, which should include debug output, though not necessarily the same as in the examples in this recitation.
c. Results of a run of your program using Sample Input 2 above, which should include debug output, though not necessarily the same as in the examples in this recitation.
Note: If you were not able to run your program with the sample inputs above, but were able to run it with simpler inputs, please do so and explain.

Key ideas: A recursive descent parser is particularly easy to write by hand and requires no special software tools as the other parsers do. It is easy to "cheat" with this parser, doing whatever you like. The capability of "cheating" can also lead to poorer code, particularly for a compiler that has to evolve over time.

Revision date: 2005-02-11. (Use ISO 8601, an International Standard.)