CS 3723 Programming Languages: Machines and Grammars

	CS 3723 Programming Languages
Machines vs Grammars

Machines (recognize a language)
versus
Grammars (generate a language):

Languages: Start with a finite set of symbols called an alphabet, such as {0, 1}, or {a, b, c, d}, or the set of all Ascii characters. A finite string of these symbols is called a sentence. A language is a set of sentences. Thus a language is just some set of strings of symbols. For example, the set E of all strings of 0s and 1s with an even number of 1s is a language. (E = even parity bit-strings.)

Notice that everything is finite except for the language, where in the interesting cases there are infinitely many sentences, as with the example above. We want to give finite descriptions of languages, including infinite languages. Consider another example: the set L of all strings of as and bs that end with "abb". The previous (English) sentence is an informal finite description. It would (of course) be impossible to list the elements of this language, but we could start: L = {abb, aabb, babb, aaabb, ababb, baabb, bbabb, aaaabb, aababb, ...}.

Finite State Machines (FSM): These are also called Finite Automata (FA) (singular: "automaton"). One specific type of FA is called a Non-Deterministic Finite Automaton (NFA). Here is a NFA that recognizes the language L above:

NFA for Language L

In order to recognize strings from L with the NFA above, we process an input string character-at-a-time, starting with the 0th character. Start with the start state 0. For each character processed, follow a path labeled with that character. If we can manage (somehow) to end up in the terminal state 3 just as we finish all the characters in the string, we accept the string as belonging to L. Otherwise we reject the string.

This automaton is called non-deterministic because there is a case with two possible choices for the arrow to take. Suppose we are processing the string "abaabaabb". How do we know when to leave state 0 and finish the string? (We could decide with a look-ahead buffer, but we want a simpler solution.)

Below is a Deterministic Finite Automaton (DFA) that recognizes the same language L:

DFA for Language L

This automaton is called deterministic because in each case there is exactly one choice for which arrow to follow. The process is uniquely determined. If you end up in the terminal state 3, and have reached the end of the string, then you accept the input string as belonging to L and otherwise you reject it. There is a clever algorithm for constructing a DFA that recognizes the same language as a given NFA (the "subset algorithm"). These two examples of an NFA and corresponding DFA both have the same numbered states and look very similar. This similarity is an artifact of such a simple example. In general the corresponding DFA will be much more complex than the original NFA. In the worst case it will be exponentially larger.

Simulating Finite State Machines: We show first DFAs and then NFAs.

Simulating a DFA: In a language with labels and gotos, this is easy. Each state becomes a labeled location in a simulation program, and each arrow becomes a goto between the two labeled locations that correspond to the two states. You start at the location corresponding to the start state. You accept if you are in an accepting state (or terminal state) at the end of the string being processed. All this is illustrated with programs to recognize C-style comments: Comments. The above link also shows the use of a while-switch in case gotos are not available or not allowed (say, by an instructor in a course).

Simulating an NFA: As you process each input character, your simulating program should keep track of the set of all possible states that you might be in. You start out with the singleton set {start state}. At the end of the input string, if your set of states includes a terminal state, then you accept. Otherwise you reject. (An exercise in Recitation 2 will ask you to carry this process out by hand for a specific string from the language L above.) This same process is essentially the "subset algorithm" that lets one take an NFA and construct a DFA that accepts the same language as the NFA.

Regular Expressions (RE): We will work more with regular expressions later in this course. A regular expression is another way of recognizing or specifying a language. Regular expressions are exactly equivalent to finite state machines. Here is a regular expression that specifies the same language L discussed above:

Reg Expr
*(a\|b)abb**

In this simple example, the letters stand for themselves. The vertical bar means "or", the parens are used for grouping, and the star means "zero of more repetitions of." Note that below we often put in extra blanks for clarity, but here a blank in the formula represents an actual blank character, which we don't want.

Many things (notation, concepts, conventions, more complex examples) have been left off in this overview.

Formal Grammars: Finally, here is a formal grammar (or context-free grammar or just grammar) that generates exactly the strings of the language L above:

Grammar Rules
S ----> a S S ----> b S S ----> a b b

This is a particularly simple type of a grammar, called a regular grammar because this type is equivalent to regular expressions. We are going to deal with grammars a lot in this course, so this is just the first exposure. Each line above is called grammar rule. They are replacement rules, meaning that the symbol on the left can always be replaced by what is on the right. One always starts with a single start symbol, which in this case is S. One can keep making any finite number of replacements until finally one replaces the S with abb. Then no further replacements are possible.

There is a notation for explicitly showing each replacement as it is done. You use an arrow with a "double shaft": ===> . Suppose we start with S (as we always must), replace S with b S, then replace the new S with a S, then replace S with b S, and finally replace S with a b b. This replacement sequence or derivation sequence (or just derivation) looks like:

Replacement Sequence

S	===> b S
	===> b a S
	===> b a b S
	===> b a b a b b

We often write this on a single line (the grammar rules above are always on separate lines):

Replacement Sequence

S ===> b S ===> b a S ===> b a b S ===> b a b a b b

The grammar has been used to generate one of the sentences or strings in the language L. All sentences of L can be generated in this way.

Parse Trees: Corresponding to a replacement sequence, we always have a unique parse tree, with the start symbol as the root, and what is replaced at each stage the children:

Parse Tree
S / \ b S / \ a S / \ b S /\|\ a b b	S / \ / S / / \ / / S / / / \ / / / S / / / /\|\ b a b a b b

The leaves across the bottom levels make up the generated sentence. (Sometimes we write all the leaves pulled down to the bottom, as shown at the right -- it makes no difference.)

The grammars, derivation sequences, parse trees, and other concepts and refinements make up the key ideas that allow one to construct a compiler in a systematic fashion.

Revision date: 2013-09-10. (Please use ISO 8601, the International Standard.)