CS 3723 Programming Languages: Formal Grammars

	CS 3723 Programming Languages
I. Formal Grammars

Material about Grammars:

This is the first of 4 pages about grammars.

I. Formal Grammars (this page).
II. Derivations, Parse Trees, Ambiguity,
III. BNF Notation.
IV. if-then-else and Ambiguity.

Languages: Start with a finite set of symbols called an alphabet, such as {0, 1}, or {a, b, c, d}, or the set of all Ascii characters. A finite string of these symbols is called a sentence. A language is a set of sentences. Thus a language is just some set of strings of symbols. For example, the set E of all strings of 0s and 1s with an even number of 1s is a language. (E = even parity bit-strings.)

Notice that everything is finite except for the language, where in the interesting cases there are infinitely many sentences, as with the example above. We want to give finite descriptions of languages, including infinite languages. Consider another example: the set L of all strings of as and bs that end with "abb". The previous (English) sentence is an informal finite description. It would (of course) be impossible to list the elements of this language, but we could start: L = {abb, aabb, babb, aaabb, ababb, baabb, bbabb, aaaabb, aababb, ...}.

Context-Free Grammars (CFG): A CFG is also called a Formal Grammar, or a BNF Grammar, or a Backus-Naur Grammar or just a grammar.

A context-free grammar (CFG) is a set of recursive replacement rules (or rewriting rules, or productions, or just rules) that are uses to generate patterns of strings.

More formally, a CFG consists of the following components:

a set of terminal symbols, which are the characters of the alphabet that appear in the strings generated by the grammar.
a set of non-terminal symbols, which are placeholders for patterns of terminal symbols that can be generated by the non-terminal symbols.
a set of replacement rules, which are rules for replacing (or rewriting) non-terminal symbols (on the left side of the production) in a string with other non-terminal or terminal symbols (on the right side of the production). (Replacements are carried out one-at-a-time, independent of other replacements. Replacements are never carried out globally.)
a start symbol, which is a special non-terminal symbol that appears in the initial string generated by the grammar. The non-terminal on the left side of the first grammar rule is usually the start symbol, so in this case we often won't bother to say what the start is.

A CFG for Arithmetic Expressions: First we give an example grammar that generates strings representing arithmetic expressions made up of:

Identifiers: a, b, or c
Operators: +, -, *, or /
Grouping symbols: ( or )

All of the above are terminal symbols. Somethings terminal symbols are written in quotation marks to emphasize that they are concrete entities: 'b', '+', '(', etc. We also need a non-terminal symbol to stand for the abstract entity "arithmetic expression". Sometimes one uses <expression> or even <arithmetic-expression>, but for us these are too long to write conveniently. Instead we will usually choose a capital letter for a non-terminal, in this case E for an arithmetic expression.

The rules of the grammar are written with a single non-terminal on the left, and a string of terminals and non-terminals used to replace a single occurrence of the non-terminal. Replacements are always done one-at-a-time. We also use an arrow (−−>) in the rule between the single non-terminal and the string used for replacement. This arrow is called a meta-symbol: it is neither a terminal nor a non-terminal, but used as part of the notation for the grammar. Here is the grammar:

Grammar: Arith. Exp.
*E −−> a E −−> b E −−> c E −−> E + E E −−> E - E E −−> E E E −−> E / E E −−> ( E )**

The start symbol must be a non-terminal, and we only have a single non-terminal, so it must be E.

This grammar defines a language over the alphabet of terminal symbols:

The first three rules state that an expression can be rewritten as (or replaced by) one of the three identifiers. In other words, an identifier is a valid expression.
The next four rules say that the sum, difference, product, or division of two expressions is also an expression.
The final rule says that an expression enclosed in parentheses is also an expression.
Note that all but the first three rules define an expression in terms of expressions, an example of the use of recursion in the definition of context-free grammars.

Generating Strings from a CFG: In our grammar for arithmetic expressions, the start symbol is E, so our initial string is: E.

Using the sixth rule above we can choose to replace this non-terminal, producing the string: E * E.

We now have two non-terminals to replace. We can apply the fourth rule to the first non-terminal, producing the string: E + E * E.

We can apply the last rule to the first non-terminal in this string to produce: (E) + E * E.

If we apply rule 1 three times to the remaining non-terminals (the recursion must end somewhere!), we get: (a) + c * b. This is a valid arithmetic expression, as generated by the grammar.

When applying the rules above, we often face a choice as to which production to choose. Different choices will typically result in different strings being generated.

Given a grammar G with start symbol S, if there is some sequence of productions that, when applied to the initial string S, result in the string s of terminal symbols, then s is in L(G), the language of the grammar.

We will use the symbol ==> to denote the result of a replacement. The sequence of replacements above could be written (below, the three dots stand for three more trivial replacements):

Replacement Sequence: Deriving "(a) + c * b"
E ==> E * E ==> E + E * E ==> (E) + E * E ==> . . . ==> (a) + c * b

Let's rewrite the grammar above. We often use a vertical bar as another meta-symbol to stand for "or". If we do this, then to represent an actual vertical bar as a terminal symbol, we need to enclose it in quotes. Let's also use a non-terminal A to stand for "operator". Then the grammar becomes:

Grammar: Arith. Exp.
*E −−> a \| b \| c E −−> E A E \| ( E ) A −−> + \| - \| \| /**

Other Examples: Here is a grammar for a list of letters. The grammar on the left writes non-terminals enclosed in "< >", while on the left is exactly the same grammar, using capital letters for non-terminals.

Grammar: List of Letters
<let-list> −−> <let> \| <let> <let-list> <let> −−> a \| b \| c

Grammar: List of Letters
M −−> L \| L M L −−> a \| b \| c

Grammar for Doubles: The next example gives a grammar describing a constant double (mostly). Here it gets hard to use capital letters for non-terminals, because there are 6 of them and it could get confusing.

Grammar: Double (almost)

<double>       −−> <digit-list> <decimal part> <exp> |
                   <digit-list> <decimal part> |
                   <decimal part> <exp> |
                   <digit-list> <exp> |
                   <digit-list> '.' <exp>

<digit-list>   −−> <digit> <digit-list> | <digit>

<decimal-part> −−> '.' <digit-list>

<exp>          −−> 'E' <sign> <digit-list> |
                   'E' <digit-list>

<sign>         −−> '+' | '−'

<digit>        −−> '0' | '1' | '2'  ... | '9'

Constant doubles are never defined in a real language using a grammar, but instead they are handled by the scanner (lexical analyzer) for the language. An initial + or − sign at the start of a double is never a part of the constant double, but instead is a unary + or − operator applied to the double.

The initial grammar rule is complicated because just a single <digit-list> (by itself) is an int and not a double. (It becomes a double when there is at least either a decimal point or an <exp> part.) The grammar is also complicated because there must be at least one digit either before or after any decimal point. In fact, there there must be at least one digit, and there can't be just an <exp> part by itself. Thus the following are NOT legal doubles:

E2 .E2 123 1.2E −2

but the following ARE legal doubles:

.1 1e2 1.E2 1.2 2.

Some languages, such as Ruby, require a digit before and after any decimal point.

Even now we're not done with the syntax of a double, since in real languages they include extra possibilities at the end. (The optional "type suffix" and an 'e' instead of 'E' .)

The above grammar for a double does not include semantic limitations, such as the number of digits allowed in a <digit-list>, especially the one following the E. (The grammar allows arbitrarily many digits in an exponent.)

The discussion of formal grammars continues with II. Derivations, Parse Trees, Ambiguity.

(Revision date: 2014-09-20. (Please use ISO 8601, the International Standard.)