CS 3723 Programming Languages: Lexical Analysis

	CS 3723 Programming Languages
Lexical Analysis

Lexical Analysis or Scanning: The first stage of a compiler uses a scanner or lexical analyzer to break the source program into a sequence of basic unit called tokens. The scanner also recognizes and discards comments and keeps track of line numbers. The different types of tokens recognized are the following (with a lot of variation in different languages):

identifier: This is usually defined as an initial letter, followed by any sequence of letters or digits. The formal definition often includes other possibilities, such as an underscore character ( _ ) in the same places where a letter can occur.
keyword or reserved word: Certain strings of letters would look just like an identifier, but are actually used as a keyword in the language and cannot be used as an identifier.
constant: Depending on the language, there are a number of different types of constants: integer, floating point, boolean, string, character, and others.
operator: Operators are often one or two special characters, though they can be more complicated. They are usually infix, prefix, or postfix. More complex operators, such at the infamous ternary operator ? : (present in C, C++, Java, and even Ruby), are treated by the scanner as separate tokens.
special character: A number of special characters are used to separate other tokens from one another, such as
( ) [ ] { } : ; , .

In addition there are lexical conventions that can sometimes be quite complicated. For example, so-called whitespace characters are normally used just to separate tokens. Tokens such as keywords or identifiers need to be separated from one another, while others such as special characters don't need to be separated from anything. As a programmer, you know this more-or-less, but for someone writing a scanner these rules need to be precisely defined.

Even comments in "traditional" languages can be confusing and complex. For example, a newline (\n) can be embedded in a /* ... */-style comment, but a //-style comment must end with a newline. What value does the following assign to x ? Is it 3 or 1.5 or 0.75? Or is it an error?

double x = 3//*4*/2; ;

Scanners are Dumb: The scanner is the "(favorite ethnic group to denigrate)-Peasant" of programming language features. Any sequence of tokens at all is OK by the scanner as long as they are properly separated. The scanner doesn't put out that many error messages by itself, either.

Recognizers versus Generators: Here and later in the course there are two well-developed, formal, and equivalent approaches: One can use a tool (machine or program) that recognizes tokens (feed in the characters of a token and the machine says "yes"), or a machine that generates tokens. Later when we get to syntax we'll be using a generator, but right now it's a recognizer.

Writing a recognizer for individual tokens: The most common way to write such a recognizer is to first write a finite state machine (FSM) that describes the token. Then the FSM is converted to a program, either by hand or using a automated software tool to produce the program (such as lex or flex or others in unix; even Java has library stuff to produce a scanner). The FSM approach is formally equivalent to using a regular expression to recognize tokens. We'll use a write-by-hand approach to a simple scanner program (in Recitation 2).

C-style comments: C-style comments, that is, material between and initial /* and a terminal */ form a comment. Anything can appear between, including a newline, a quote character, and the string "/*". A recognizer for these constructs is illustrated here: C style comments.

Here is another, more formal treatment of lexical analysis: lexical analysis

Revision date: 2013-12-20. (Please use ISO 8601, the International Standard.)