CS 3723, Fall 2013: Course Overview

CS 3723
Programming Languages
Fall 2013

Course Overview

The main objective of the semester's first 8 weeks is an understanding of syntax-directed translation. Here is an online definition:

In simplest terms, 'Syntax Directed Translation' means driving the entire compilation (translation) process with the syntax recognizer (the parser).

Using a number of specific examples we will see how this works in practice. This includes the entire overall approach to the work in Parts I and II below.

I. Regular Expression Processor. During the first 2 weeks we will show how a programming language can handle a regular expression. Specifically, we will show how the language can take an input regular expression R and determine the match data for an input string S.

See material about regular expressions:

Definition: regular expressions. My own write-up.
Introduction: Matching. The first third has clear exposition and definitions.
References:
- Implementation. A list of articles including the one above.
- Tutorial.
- Wikipedia.
Cartoon: xkcd/208.

As an example, suppose R = /ab|aac/ and S = cccaabaccc. Here we are looking for the first occurrence of either "ab" or "aac" in the string. In case of a match, we want what comes before the match, the data that matched, and what comes after the match. In this case, "ccca" comes before, "ab" is the match, and "accc" comes after. A simple Ruby program that carries out this match is: abaac.rb.

We want a program that will handle an arbitrarily long and complex regular expression. The running example will be /ab|aac/. In order to get the match data and to match an interior portion of a string, it is necessary to add .* to the start of R so that anything can come at the beginning of the string. This makes R look like:

R = /.*(ab|aac)/

The parentheses are required because the "or" operator ( | ) has lower precedence than the "star" operator ( * ). The concatenation operation (with only an implicit operator) has precedence between these two.

The processing has four main steps:

Unravel the Syntax of the Regular Expression R. The term for this is to parse the regular expression. Later in the course we will see how to use a formal grammar to describe all regular expressions. Then we will have a parsing algorithm that will allow to build a parse tree for R. For the specific example above, the parse tree would look like the following diagram. Here the leaves (in green) are single characters (or in the case of dot, a single character class, namely any character). The other nodes are operators (in red). A plus sign is used for concatenation.

Parse Tree for: .*(ab/aac)
+ / \ / \ * \| / / \ . + \ / \ \ a b + / \ + c / \ a a

We will be studying grammars and parsers in the weeks ahead, but we haven't done it yet. In the meantime we will simply convert the regular expression R = /.*(ab/aac)/ to RPN (reverse Polish notation): .*ab+aa+c+|+$, where $ is used as an end marker.

Process the RPN Version of the Regular Expression R. For this we use an auxiliary stack to hold indexes of the start and terminal states of intermediate NFAs. Basically, the stack holds intermediate versions of the growing final NFA. The RPN is scanned from left to right. Operands cause a stand-alone NFA to be pushed onto the stack. Operators are applied to the top NFA on the stack (in the case of *) or to the top two NFAs on the stack (in the case of + or |). As the RPN is processed, the program will carry out the next part of the construction.
Build FA. As indicated in part 2 above, the NFA will be built up during the evaluation of the RPN.
Later, after we study parsers, we will see that steps 1, 2, and 3 above can be carried out with simple additions to the parser. There will be no need for the intermediate RPN, and no need for the auxiliary stack (which is replaced by recursion.) This is the syntax-directed translation process mentioned at the start.
Simulate FA. Next simulate the actions of the NFA N as it processes the string S, character-by-character.
In particular, as the NFA processes each character in turn, our program keeps track of all the possible states that the NFA might be in.

You will be getting an overview of how the above process works. You can see how the steps work, and what an implementing program looks like. However, you will at most be programming a few small pieces of all these steps.

II. Compiler for a Small Language. During the next 5 or 6 weeks we will write a compiler program. This will be much more challenging than Part I because you will write complete working code in C or in Java. It will follow the steps in the diagram Compilers, but several steps will be left out and other parts simplified. Still, it will be a "real" compiler and will illustrate how the process works. In particular:

The lexical analyzer (also known as a scanner) will be extremely simple, and there will be no symbol table.
We will have a proper syntax analyzer (also known as a parser), although not as complex as those in current use.
Almost no semantic analysis.
No intermediate code generator.
Generate MIPS assembly code for floating point numbers (doubles). This generation will be accomplished using syntax-directed translation.

III. Investigate Several Other "Strange" Languages. These may include Lisp and Ruby.

Revision date: 2013-07-23. (Please use ISO 8601, the International Standard.)