CS 3723 Programming Languages: Formal Grammars

	CS 3723 Programming Languages
Formal Grammars

Material about Grammars:

This is the first of 3 pages about grammars.

Formal Grammars (Introduction, this page)
Derivations, Parse Trees, Ambiguity
Types of Grammars

Languages: Start with a finite set of symbols called an alphabet, such as {0, 1}, or {a, b, c, d}, or the set of all Ascii characters. A finite string of these symbols is called a sentence. A language is a set of sentences. Thus a language is just some set of strings of symbols. For example, the set E of all strings of 0s and 1s with an even number of 1s is a language. (E = even parity bit-strings.)

Notice that everything is finite except for the language, where in the interesting cases there are infinitely many sentences, as with the example above. We want to give finite descriptions of languages, including infinite languages. Consider another example: the set L of all strings of as and bs that end with "abb". The previous (English) sentence is an informal finite description. It would (of course) be impossible to list the elements of this language, but we could start: L = {abb, aabb, babb, aaabb, ababb, baabb, bbabb, aaaabb, aababb, ...}.

Context-Free Grammars (CFG): A CFG is also called a Formal Grammar, or a BNF Grammar, or a Backus-Naur Grammar or just a grammar.

A context-free grammar (CFG) is a set of recursive replacement rules (or rewriting rules, or productions, or just rules) that are uses to generate patterns of strings.

More formally, a CFG consists of the following components:

a set of terminal symbols, which are the characters of the alphabet that appear in the strings generated by the grammar.
a set of non-terminal symbols, which are placeholders for patterns of terminal symbols that can be generated by the non-terminal symbols.
a set of replacement rules, which are rules for replacing (or rewriting) non-terminal symbols (on the left side of the production) in a string with other non-terminal or terminal symbols (on the right side of the production). (Replacements are carried out one-at-a-time, independent of other replacements. Replacements are never carried out globally.)
a start symbol, which is a special non-terminal symbol that appears in the initial string generated by the grammar.

A CFG for Arithmetic Expressions: First we give an example grammar that generates strings representing arithmetic expressions made up of:

Identifiers: a, b, or c
Operators: +, -, *, or /
Grouping symbols: ( or )

All of the above are terminal symbols. Somethings terminal symbols are written in quotation marks to emphasize that they are concrete entities: 'b', '+', '(', etc. We also need a non-terminal symbol to stand for the abstract entity "arithmetic expression". Sometimes one uses <expression> or even <arithmetic-expression>, but for us these are too long to write conveniently. Instead we will usually choose a capital letter for a non-terminal, in this case E for an arithmetic expression.

The rules of the grammar are written with a single non-terminal on the left, and a string of terminals and non-terminals used to replace a single occurrence of the non-terminal. Replacements are always done one-at-a-time. We also use an arrow (---->) in the rule between the single non-terminal and the string used for replacement. This arrow is called a meta-symbol: it is neither a terminal nor a non-terminal, but used as part of the notation for the grammar. Here is the grammar:

Grammar: Arith. Exp.
E ----> a E ----> b E ----> c E ----> E + E E ----> E - E E ----> E * E E ----> E / E E ----> ( E )

The start symbol must be a non-terminal, and we only have a single non-terminal, so it must be E.

This grammar defines a language over the alphabet of terminal symbols:

The first rule states that an expression can be rewritten as (or replaced by) a one of the three identifiers. In other words, an identifier is a valid expression.
The second rule says that an expression enclosed in parentheses is also an expression. Note that this rule defines an expression in terms of expressions, an example of the use of recursion in the definition of context-free grammars.
The remaining rules say that the sum, difference, product, or division of two expressions is also an expression.

Generating Strings from a CFG: In our grammar for arithmetic expressions, the start symbol is E, so our initial string is: E.

Using the sixth rule above we can choose to replace this non-terminal, producing the string: E * E.

We now have two non-terminals to replace. We can apply the fourth rule to the first non-terminal, producing the string: E + E * E.

We can apply the last rule to the first non-terminal in this string to produce: (E) + E * E.

If we apply rule 1 three times to the remaining non-terminals (the recursion must end somewhere!), we get: (a) + c * b. This is a valid arithmetic expression, as generated by the grammar.

When applying the rules above, we often face a choice as to which production to choose. Different choices will typically result in different strings being generated.

Given a grammar G with start symbol S, if there is some sequence of productions that, when applied to the initial string S, result in the string s of terminal symbols, then s is in L(G), the language of the grammar.

We will use the symbol ===> to denote the result of a replacement. The sequence of replacements above could be denoted:

Replacement Sequence: Deriving "(a) + c * b"
*E ===> E E ===> E + E * E ===> (E) + E * E ===> . . . ===> (a) + c * b**

Let's rewrite the grammar above. We often use a vertical bar as another meta-symbol to stand for "or". If we do this, then to represent an actual vertical bar as a terminal symbol, we need to enclose it in quotes. Let's also use a non-terminal A to stand for "operator". Then the grammar becomes:

Grammar: Arith. Exp.
E ----> a \| b \| c E ----> E A E \| ( E ) A ----> + \| - \| * \| /

Other Examples: Grammar for a list of letters:

Grammar: List of Letters
<let-list> ----> <let> \| <let> <let-list> <let>----> a \| b \| c

A CFG may have a production for a non-terminal in which the right hand side is the empty string (which is often denoted by Greek epsilon, ε). The effect of this production is to remove the non-terminal from the string being generated.

Grammar: List of Letters, with ε
<let-list> ----><let> <let-list> \| ε <let>----> a \| b \| c

Grammar: Double (almost)

<real>	--->	<digit> <digit-list> <decimal part> <exp>
<digit-list>	--->	<digit> <digit-list> \| ε
<decimal part>	--->	'.' <digit> <digit-list> \| ε
<exp>	--->	'E' <sign> <digit> <digit-list> \| ε
<sign>	--->	'+' \| '-' \| ε
<digit>	--->	'0' \| '1' \| '2' \| '3' \| '4' \| '5' \| '6' \| '7' \| '8' \| '9'

The discussion of formal grammars continues with Derivations, Parse Trees, Ambiguity.

Revision date: 2013-09-10. (Please use ISO 8601, the International Standard.)