 |
CS 3723
Programming Languages |
I. Formal Grammars |
Material about Grammars:
This is the first of 4 pages about grammars.
Languages: Start with
a finite set of symbols called an alphabet, such
as {0, 1}, or {a, b, c, d}, or the set of all
Ascii characters.
A finite string of these symbols is called a sentence.
A language is a set of sentences. Thus a language is
just some set of strings of symbols. For example, the set E of
all strings of 0s and 1s with an even number
of 1s is a language. (E = even parity bit-strings.)
Notice that everything
is finite except for the language, where in the interesting
cases there are infinitely many sentences, as with the example above.
We want to give finite descriptions of languages,
including infinite languages. Consider another example:
the set L of all strings of as and bs
that end with "abb".
The previous (English) sentence is an informal finite description.
It would (of course) be impossible to list the elements of
this language, but we could start: L =
{abb, aabb, babb, aaabb, ababb, baabb, bbabb, aaaabb, aababb, ...}.
Context-Free Grammars (CFG):
A CFG is also called a Formal Grammar, or a BNF Grammar,
or a Backus-Naur Grammar or just a grammar.
A context-free grammar (CFG) is a set of
recursive replacement rules (or rewriting rules, or
productions, or just rules)
that are uses to generate patterns of strings.
More formally, a CFG consists of the following components:
- a set of terminal symbols,
which are the characters of the alphabet
that appear in the strings generated by the grammar.
- a set of non-terminal symbols,
which are placeholders for patterns of terminal symbols
that can be generated by the non-terminal symbols.
- a set of replacement rules, which are rules for
replacing (or rewriting) non-terminal symbols (on the left
side of the production) in a string with other non-terminal or
terminal symbols (on the right side of the production).
(Replacements are carried out one-at-a-time, independent of
other replacements. Replacements are never carried out globally.)
- a start symbol,
which is a special non-terminal symbol
that appears in the initial string generated by the grammar.
The non-terminal on the left side of the first grammar rule
is usually the start symbol, so in this case we often won't bother to
say what the start is.
A CFG for Arithmetic Expressions:
First we give an example grammar that generates strings representing
arithmetic expressions made up of:
- Identifiers: a, b, or c
- Operators: +, -, *, or /
- Grouping symbols: ( or )
All of the above are terminal symbols. Somethings terminal symbols
are written in quotation marks to emphasize that they are concrete
entities: 'b', '+', '(', etc.
We also need a non-terminal symbol to stand for the abstract
entity "arithmetic expression". Sometimes one uses
<expression> or even <arithmetic-expression>,
but for us these are too long to write conveniently. Instead we
will usually choose a capital letter for a non-terminal, in this
case E for an arithmetic expression.
The rules of the grammar are written with a single non-terminal
on the left, and a string of terminals and non-terminals used to
replace a single occurrence of the non-terminal.
Replacements are always done one-at-a-time. We also use
an arrow (−−>) in the rule between the single non-terminal
and the string used for replacement. This arrow is called
a meta-symbol: it is neither a terminal nor a non-terminal,
but used as part of the notation for the grammar. Here is
the grammar:
Grammar: Arith. Exp. |
E −−> a
E −−> b
E −−> c
E −−> E + E
E −−> E - E
E −−> E * E
E −−> E / E
E −−> ( E )
|
The start symbol must be a non-terminal, and we only have a single
non-terminal, so it must be E.
This grammar defines a language over the alphabet of terminal symbols:
- The first three rules state that an expression
can be rewritten as (or replaced by) one of the three identifiers.
In other words, an identifier is a valid expression.
- The next four rules say that the sum, difference, product,
or division of two expressions is also an expression.
- The final rule says that an expression enclosed in
parentheses is also an expression.
Note that all but the first three rules define an expression
in terms of expressions,
an example of the use of recursion in the definition of context-free
grammars.
Generating Strings from a CFG:
In our grammar for arithmetic expressions, the start
symbol is E, so our initial string is: E.
Using the sixth rule above we can choose to replace this non-terminal,
producing the string: E * E.
We now have two non-terminals to replace. We can apply the fourth
rule to the first non-terminal, producing the string:
E + E * E.
We can apply the last rule to the first non-terminal in this string
to produce: (E) + E * E.
If we apply rule 1 three times to the remaining non-terminals
(the recursion must end somewhere!), we get:
(a) + c * b.
This is a valid arithmetic expression, as generated by the grammar.
When applying the rules above, we often face a choice as
to which production to choose.
Different choices will typically result in different strings
being generated.
Given a grammar G with start symbol S,
if there is some sequence of productions that, when applied to
the initial string S, result in the string s
of terminal symbols, then s is in L(G),
the language of the grammar.
We will use the symbol ==> to denote the result of
a replacement. The sequence of replacements above could be written
(below, the three dots stand for three more trivial replacements):
Replacement Sequence:
Deriving "(a) + c * b" |
E ==> E * E ==> E + E * E ==> (E) + E * E ==> . . . ==> (a) + c * b
|
Let's rewrite the grammar above. We often use a vertical bar
as another meta-symbol to stand for "or". If we do this, then to
represent an actual vertical bar as a terminal symbol, we need
to enclose it in quotes. Let's also use a non-terminal
A to stand for "operator". Then the grammar becomes:
Grammar: Arith. Exp. |
E −−> a | b | c
E −−> E A E | ( E )
A −−> + | - | * | /
|
Other Examples:
Here is a grammar for a list of letters. The grammar on the
left writes non-terminals enclosed in "< >", while
on the left is exactly the same grammar, using capital letters
for non-terminals.
Grammar: List of Letters |
<let-list> −−> <let> | <let> <let-list>
<let> −−> a | b | c
|
| |
Grammar: List of Letters |
M −−> L | L M
L −−> a | b | c
|
|
Grammar for Doubles:
The next example gives a grammar describing a constant double
(mostly). Here it gets hard to use capital letters for non-terminals,
because there are 6 of them and it could get confusing.
Grammar: Double (almost) |
---|
<double> −−> <digit-list> <decimal part> <exp> |
<digit-list> <decimal part> |
<decimal part> <exp> |
<digit-list> <exp> |
<digit-list> '.' <exp>
<digit-list> −−> <digit> <digit-list> | <digit>
<decimal-part> −−> '.' <digit-list>
<exp> −−> 'E' <sign> <digit-list> |
'E' <digit-list>
<sign> −−> '+' | '−'
<digit> −−> '0' | '1' | '2' ... | '9'
|
Constant doubles are never defined in a real language using a
grammar, but instead they are handled by the scanner (lexical
analyzer) for the language. An initial + or − sign at the start
of a double is never a part of the constant double, but instead
is a unary + or − operator applied to the double.
The initial
grammar rule is complicated because just a single <digit-list>
(by itself) is an int and not a double.
(It becomes a double when there is at least either a decimal point
or an <exp> part.) The grammar is also complicated because
there must be at least one digit either before or after any decimal point.
In fact, there
there must be at least one digit, and there can't be just an
<exp> part by itself. Thus the following are NOT legal
doubles:
but the following ARE legal doubles:
Some languages, such as
Ruby, require a digit before and after any decimal point.
Even now we're not done with the syntax of a double, since in
real languages they include extra possibilities at the end.
(The optional "type suffix" and an 'e' instead of 'E' .)
The above grammar for a double does not include semantic
limitations, such as the number of digits allowed in a
<digit-list>, especially the one following the E.
(The grammar allows arbitrarily many digits in an exponent.)
The discussion of formal grammars continues with
II. Derivations, Parse Trees, Ambiguity.
(Revision date: 2014-09-20.
(Please use ISO
8601, the International Standard.)
|