 |
CS 3723
Programming Languages |
Machines vs Grammars |
Machines (recognize a language)
versus
Grammars (generate a language):
|
Languages: Start with
a finite set of symbols called an alphabet, such
as {0, 1}, or {a, b, c, d}, or the set of all
Ascii characters.
A finite string of these symbols is called a sentence.
A language is a set of sentences. Thus a language is
just some set of strings of symbols. For example, the set E of
all strings of 0s and 1s with an even number
of 1s is a language. (E = even parity bit-strings.)
Notice that everything
is finite except for the language, where in the interesting
cases there are infinitely many sentences, as with the example above.
We want to give finite descriptions of languages,
including infinite languages. Consider another example:
the set L of all strings of as and bs
that end with "abb".
The previous (English) sentence is an informal finite description.
It would (of course) be impossible to list the elements of
this language, but we could start: L =
{abb, aabb, babb, aaabb, ababb, baabb, bbabb, aaaabb, aababb, ...}.
Finite State Machines (FSM):
These are also called Finite Automata (FA) (singular: "automaton").
One specific type of FA is called a
Non-Deterministic Finite Automaton (NFA).
Here is a NFA that recognizes the language L above:

NFA for Language L
In order to recognize strings from L with the NFA above, we
process an input string character-at-a-time, starting with the
0th character. Start with the start state 0. For
each character processed, follow a path labeled with that character.
If we can manage (somehow) to end up in the terminal state 3 just
as we finish all the characters in the string, we accept
the string as belonging to L. Otherwise we reject
the string.
This automaton is called non-deterministic because there
is a case with two possible choices for the arrow to take.
Suppose we are processing the string "abaabaabb". How
do we know when to leave state 0 and finish the string?
(We could decide with a look-ahead buffer, but we want a simpler
solution.)
Below is a Deterministic Finite Automaton
(DFA) that recognizes the same language L:

DFA for Language L
This automaton is called deterministic because in each
case there is exactly one choice for which arrow to follow.
The process is uniquely determined. If you end up in the
terminal state 3, and have reached the end of the string,
then you accept the input string as belonging to L and otherwise
you reject it. There is a clever algorithm for constructing
a DFA that recognizes the same language as a given NFA
(the "subset algorithm"). These two examples of an
NFA and corresponding DFA both have the same numbered
states and look very similar. This similarity is an artifact of
such a simple example. In general the corresponding DFA
will be much more complex than the original NFA. In the worst
case it will be exponentially larger.
Simulating Finite State Machines:
We show first DFAs and then NFAs.
Simulating a DFA: In a language with labels and gotos,
this is easy. Each state becomes a labeled location in
a simulation program, and each arrow becomes a goto between
the two labeled locations that correspond to the two states.
You start at the location corresponding to the start state.
You accept if you are in an accepting state
(or terminal state) at
the end of the string being processed. All this is illustrated
with programs to recognize C-style comments:
Comments.
The above link also shows the use of a while-switch in case
gotos are not available or not allowed (say, by an instructor in
a course).
Simulating an NFA: As you process each input character, your
simulating program should keep track of the set of all possible
states that you might be in. You start out with the singleton set
{start state}. At the end of the input string, if your set
of states includes a terminal state, then you accept. Otherwise you
reject. (An exercise in Recitation 2 will ask you to carry this
process out by hand for a specific string from the language
L above.) This same process is essentially the "subset
algorithm" that lets one take an NFA and construct a DFA
that accepts the same language as the NFA.
Regular Expressions (RE):
We will work more with regular expressions later in this course.
A regular expression is another way of recognizing or specifying a language.
Regular expressions are exactly equivalent to finite state machines. Here is
a regular expression that specifies the same language L
discussed above:
In this simple example, the letters stand for themselves.
The vertical bar means "or", the parens are used for
grouping, and the star means "zero of more repetitions of."
Note that below we often put in extra blanks for clarity,
but here a blank in the formula represents an actual blank
character, which we don't want.
Many things (notation, concepts, conventions, more complex
examples) have been left off in this overview.
Formal Grammars:
Finally, here is a formal grammar (or context-free grammar
or just grammar) that generates exactly the strings
of the language L above:
Grammar Rules |
S ----> a S
S ----> b S
S ----> a b b
|
This is a particularly simple type of a grammar, called
a regular grammar because this type is equivalent to regular
expressions. We are going to deal with grammars a lot
in this course, so this is just the first exposure.
Each line above is called grammar rule. They
are replacement rules, meaning that the symbol on the
left can always be replaced by what is on the right.
One always starts with a single start symbol,
which in this case is S. One can keep making any finite
number of replacements until finally one replaces the
S with abb. Then no further replacements are possible.
There is a notation for explicitly showing each replacement
as it is done. You use an arrow with a "double shaft":
===> .
Suppose we start with S (as we always must), replace S
with b S, then replace the new S with a S,
then replace S
with b S, and finally replace S with a b b. This
replacement sequence or derivation sequence
(or just derivation) looks like:
Replacement Sequence |
S |
===> b S |
| ===> b a S |
| ===> b a b S |
| ===> b a b a b b |
|
We often write this on a single line (the grammar rules above
are always on separate lines):
Replacement Sequence |
---|
S ===> b S ===> b a S
===> b a b S ===> b a b a b b |
|
The grammar has been used to generate one of the
sentences or strings in the language L. All sentences
of L can be generated in this way.
Parse Trees:
Corresponding to a replacement sequence, we always have a unique
parse tree, with the start symbol as the root, and
what is replaced at each stage the children:
Parse Tree |
S
/ \
b S
/ \
a S
/ \
b S
/|\
a b b |
S
/ \
/ S
/ / \
/ / S
/ / / \
/ / / S
/ / / /|\
b a b a b b
|
The leaves across the bottom levels make up the generated sentence.
(Sometimes we write all the leaves pulled down to the bottom, as shown
at the right -- it makes no difference.)
The grammars, derivation sequences, parse trees, and other
concepts and refinements make up the key ideas that allow one to construct
a compiler in a systematic fashion.
Revision date: 2013-09-10.
(Please use ISO
8601, the International Standard.)
|