 |
CS 3723
Programming Languages |
Formal Grammars |
Material about Grammars:
This is the first of 3 pages about grammars.
- Formal Grammars (Introduction, this page)
- Derivations, Parse Trees,
Ambiguity
- Types of Grammars
Languages: Start with
a finite set of symbols called an alphabet, such
as {0, 1}, or {a, b, c, d}, or the set of all
Ascii characters.
A finite string of these symbols is called a sentence.
A language is a set of sentences. Thus a language is
just some set of strings of symbols. For example, the set E of
all strings of 0s and 1s with an even number
of 1s is a language. (E = even parity bit-strings.)
Notice that everything
is finite except for the language, where in the interesting
cases there are infinitely many sentences, as with the example above.
We want to give finite descriptions of languages,
including infinite languages. Consider another example:
the set L of all strings of as and bs
that end with "abb".
The previous (English) sentence is an informal finite description.
It would (of course) be impossible to list the elements of
this language, but we could start: L =
{abb, aabb, babb, aaabb, ababb, baabb, bbabb, aaaabb, aababb, ...}.
Context-Free Grammars (CFG):
A CFG is also called a Formal Grammar, or a BNF Grammar,
or a Backus-Naur Grammar or just a grammar.
A context-free grammar (CFG) is a set of
recursive replacement rules (or rewriting rules, or
productions, or just rules)
that are uses to generate patterns of strings.
More formally, a CFG consists of the following components:
- a set of terminal symbols,
which are the characters of the alphabet
that appear in the strings generated by the grammar.
- a set of non-terminal symbols,
which are placeholders for patterns of terminal symbols
that can be generated by the non-terminal symbols.
- a set of replacement rules, which are rules for
replacing (or rewriting) non-terminal symbols (on the left
side of the production) in a string with other non-terminal or
terminal symbols (on the right side of the production).
(Replacements are carried out one-at-a-time, independent of
other replacements. Replacements are never carried out globally.)
- a start symbol,
which is a special non-terminal symbol
that appears in the initial string generated by the grammar.
A CFG for Arithmetic Expressions:
First we give an example grammar that generates strings representing
arithmetic expressions made up of:
- Identifiers: a, b, or c
- Operators: +, -, *, or /
- Grouping symbols: ( or )
All of the above are terminal symbols. Somethings terminal symbols
are written in quotation marks to emphasize that they are concrete
entities: 'b', '+', '(', etc.
We also need a non-terminal symbol to stand for the abstract
entity "arithmetic expression". Sometimes one uses
<expression> or even <arithmetic-expression>,
but for us these are too long to write conveniently. Instead we
will usually choose a capital letter for a non-terminal, in this
case E for an arithmetic expression.
The rules of the grammar are written with a single non-terminal
on the left, and a string of terminals and non-terminals used to
replace a single occurrence of the non-terminal.
Replacements are always done one-at-a-time. We also use
an arrow (---->) in the rule between the single non-terminal
and the string used for replacement. This arrow is called
a meta-symbol: it is neither a terminal nor a non-terminal,
but used as part of the notation for the grammar. Here is
the grammar:
Grammar: Arith. Exp. |
E ----> a
E ----> b
E ----> c
E ----> E + E
E ----> E - E
E ----> E * E
E ----> E / E
E ----> ( E )
|
The start symbol must be a non-terminal, and we only have a single
non-terminal, so it must be E.
This grammar defines a language over the alphabet of terminal symbols:
- The first rule states that an expression
can be rewritten as (or replaced by) a one of the three identifiers.
In other words, an identifier is a valid expression.
- The second rule says that an expression enclosed in
parentheses is also an expression.
Note that this rule defines an expression in terms of expressions,
an example of the use of recursion in the definition of context-free
grammars.
- The remaining rules say that the sum, difference, product,
or division of two expressions is also an expression.
Generating Strings from a CFG:
In our grammar for arithmetic expressions, the start
symbol is E, so our initial string is: E.
Using the sixth rule above we can choose to replace this non-terminal,
producing the string: E * E.
We now have two non-terminals to replace. We can apply the fourth
rule to the first non-terminal, producing the string: E + E * E.
We can apply the last rule to the first non-terminal in this string
to produce: (E) + E * E.
If we apply rule 1 three times to the remaining non-terminals
(the recursion must end somewhere!), we get: (a) + c * b.
This is a valid arithmetic expression, as generated by the grammar.
When applying the rules above, we often face a choice as
to which production to choose.
Different choices will typically result in different strings
being generated.
Given a grammar G with start symbol S,
if there is some sequence of productions that, when applied to
the initial string S, result in the string s
of terminal symbols, then s is in L(G),
the language of the grammar.
We will use the symbol ===> to denote the result of
a replacement. The sequence of replacements above could be denoted:
Replacement Sequence:
Deriving "(a) + c * b" |
E ===> E * E ===> E + E * E ===> (E) + E * E ===> . . . ===> (a) + c * b
|
Let's rewrite the grammar above. We often use a vertical bar
as another meta-symbol to stand for "or". If we do this, then to
represent an actual vertical bar as a terminal symbol, we need
to enclose it in quotes. Let's also use a non-terminal
A to stand for "operator". Then the grammar becomes:
Grammar: Arith. Exp. |
E ----> a | b | c
E ----> E A E | ( E )
A ----> + | - | * | /
|
Other Examples:
Grammar for a list of letters:
Grammar: List of Letters |
<let-list> ----> <let> | <let> <let-list>
<let>----> a | b | c
|
A CFG may have a production for a non-terminal in which
the right hand side is the empty string
(which is often denoted by Greek epsilon, ε).
The effect of this production is to remove the non-terminal
from the string being generated.
Grammar: List of Letters,
with ε |
<let-list> ----><let> <let-list> | ε
<let>----> a | b | c
|
Grammar: Double (almost) |
---|
<real> | ---> | <digit> <digit-list> <decimal part> <exp> |
<digit-list> | ---> | <digit> <digit-list> | ε |
<decimal part> | ---> | '.' <digit> <digit-list> | ε |
<exp> | ---> | 'E' <sign> <digit> <digit-list> | ε |
<sign> | ---> | '+' | '-' | ε |
<digit> | ---> | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' |
|
The discussion of formal grammars continues with
Derivations, Parse Trees, Ambiguity.
Revision date: 2013-09-10.
(Please use ISO
8601, the International Standard.)
|