CS 3723: Shift-Reduce Parsers

	CS 3723 Programming Languages
Shift-Reduce Parsers

Overview: Parsers are algorithms and programs that will unravel the syntax of a sentence described by a formal grammar. Most parsers will not handle an arbitrary grammar, but place limitations on the form of grammar allowed. These parsers go through the motions of building a parse tree without actually generating the tree. While the parse is going on, extra code can carry out a variety of tasks related to translating the source into some other form.

Shift-reduce (S-R) Parsers: These include the LALR parsers which are the most sophisticated in use today (the "industry standard"). What you see on this page uses some of the same ideas, but is orders of magnitude simpler.

These parsers start with a unambiguous grammar and with a sentence in the language defined by the grammar. They construct a rightmost derivation backwards by performing a series of reductions. As shown below, the setup consists of a sequence of input symbols (all terminals) that are used up one-at-a-time, called below the Rest of Input. There is a place for the Current Symbol under consideration (also a terminal), and a Stack of terminals and non-terminals. At each step, the symbol on the top of the stack is looked up against the current symbol, using a table called the Shift-Reduce Table, which has 4 different entries:

s for "shift", meaning to shift the current symbol onto the top of the stack and let the next input symbol become the new current symbol.
r for "reduce", meaning to find the right side of a grammar rule on the top of the stack and replace (reduce) it with the left side.
acc for "accept", meaning to accept the input.
a blank, which means an error has occurred.

A final rule is needed when there are two possible reductions on the top of the stack, as an example shows below: E + T is at the top of the stack. In that case you use the longest match, so in this example one would reduce E + T to E, rather than reducing T to E.

The problem with the specific approach given here is that the collection of grammars that one can use is severely limited. A refinement of this approach (LALR parsers mentioned above) allow a much broader collection of grammars (but not all CF grammars), and the grammars allowed are sufficient for all programming language needs that I know of. (Well, as with Python, some issues are often handled by the scanner (= lexical analyzer).)

Initial Example:

Consider the following grammar:

Grammar: Arithmetic Expressions
P −−> E E −−> E + T \| T T −−> T * F \| F F −−> ( E ) \| id

Use the following table for this grammar:

Parser: Shift-Reduce Table

        |          Current Symbol           |
        +−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−+
        |  id |  *  |  +  |  (  |  )  |  $  |
−−+−−−−−+−−−−−+−−−−−+−−−−−+−−−−−+−−−−−+−−−−−+
S |  P  |     |     |     |     |     | acc |  (s = "shift")
t |  E  |     |     |  s  |     |  s  |  r  |
a |  T  |     |  s  |  r  |     |  r  |  r  |  (r = "reduce")
c |  F  |     |  r  |  r  |     |  r  |  r  |
k |  id |     |  r  |  r  |     |  r  |  r  |  (acc = "accept")
  |  *  |  s  |     |     |  s  |     |     |
T |  +  |  s  |     |     |  s  |     |     |  (empty = "error")  
o |  (  |  s  |     |     |  s  |     |     |
p |  )  |     |  r  |  r  |     |  r  |  r  |
  |  $  |  s  |     |     |  s  |     |     |
−−+−−−−−+−−−−−+−−−−−+−−−−−+−−−−−+−−−−−+−−−−−+

The table below shows the shift-reduce parse of the following sentence, showing the stack, current symbol, remaining symbols, and next action to take at each stage. This sentence has the extra artificial symbol $ stuck in at the beginning and the end. Any sentence at all described by the grammar will work here just like the example below.

Input Sentence
$ ( id + id ) * id $

(You should initially shift the starting $.)

Shift-Reduce Actions

Stack               Curr    Rest of Input       Action
(top at right)      Sym
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
$                   (       id + id ) * id $    shift
$  (                id      + id ) * id $       shift
$  ( id             +       id ) * id $         reduce: F −−> id
$  ( F              +       id ) * id $         reduce: T −−> F
$  ( T              +       id ) * id $         reduce: E −−> T
$  ( E              +       id ) * id $         shift
$  ( E  +           id      ) * id $            shift
$  ( E  +  id       )       * id $              reduce: F −−> id
$  ( E  +  F        )       * id $              reduce: T −−> F
$  ( E  +  T        )       * id $              reduce: E −−> E + T
$  ( E              )       * id $              shift
$  ( E )            *       id $                reduce: F −−> ( E )
$  F                *       id $                reduce: T −−> F
$  T                *       id $                shift
$  T  *             id      $                   shift
$  T  *  id         $                           reduce: F −−> id
$  T  *  F          $                           reduce: T −−> T * F
$  T                $                           reduce: E −−> T
$  E                $                           reduce: P −−> E
$  P                $                           accept

Notice that the sequence of reductions give the following rightmost derivation in reverse:

Rightmost Derivations
( id + id ) * id
P ==> E ==> T ==> T * F ==> T * id ==> F * id ==> ( E ) * id ==> ( E + T ) * id ==> ( E + F ) * id ==> ( E + id ) * id ==> ( T + id ) * id ==> ( F + id ) * id ==> ( id + id ) * id

S-R Parsers in Practice: In actual practice, the S-R parsing would be carried out using a Parser Generator Program and using the much fancier LALR(1) bottom-up approach, or perhaps the LL(1) top-down approach. This program would read an input grammar, and would output a skeleton parser program, which includes a table like the one above to guide the parse. This skeleton parse would carry out the parse using any input sentence at all, rather than doing it by hand as we show above.

The most important work of such a parser generator program (typically 5000 lines of C) is to create the tables, which are much too complicated to generate by hand. The program would also identify problems with the input grammar, including especially ambiguities, but other problems also. An early such program was yacc (for Yet Another Compiler Compiler) as part of the Unix system in the early 70s. There are very many tools of this sort now: compiler-compilers.

One time in teaching a compiler course, I used a table-generating program, but had the students write the LALR parser itself. (Not too hard to write by hand; almost all the difficulty is in creating the tables.) These students actually liked this approach because they had complete control of the parser (hey, they wrote it). With yacc you have to live with the idiosyncrasies of the parser it generates.

( Revision date: 2014-09-28. Please use ISO 8601, the International Standard Date and Time Notation.)