A scanner (or lexical analyzer) for a compiler or interpreter converts the source sequence of characters into a sequence of basic syntactic units called tokens. In C and C++ we will represent each token internally (inside a struct or a class) using an enumerated type along with one item of a union. For this part of the project you are to write a scanner for Lisp. With only four kinds of tokens, Lisp is very simple, but a scanner for Lisp illustrates all of the principles of more complex scanners. The goals of this part of the project are to give you a first acquaintance with scanners and to introduce the use of a finite automaton in creating a scanner.
In C you should use type declarations similar to the following for the header file that goes with your scanner. (You are not required to use these exact declarations or these names.) As we get into the use of C++, we can replace this basic struct with a class.
/* * scanner.h */ #define ALF 20 typedef enum {lprtok, rprtok, numtok, alftok} Ttype; typedef char Alftype[ALF]; typedef double Numtype; typedef struct { Ttype tokentype; union { Alftype alfval; Numtype numval; } tokenval; } Token; long gettoken (Token* tok); /*-------------------------------------------------*/As shown above in the header file, you should write a procedure gettoken that will read input characters and recognize the next token, returning it as a pointer to the struct that contains the tokentype and the value of the token if applicable. (Tokens of type alftok and numtok have associated values, named respectively, alfval and numval.)
Each time gettoken is called, the next input token should be returned. For example, if the source is "(MINUS 3)", then successive calls to gettoken should return the tokens: lprtok, alftok (with value "MINUS"), numtok (with value the double 3), and rprtok.
The construction of gettoken should be based on the finite automaton given at the end of this handout. Each time you want another token, start again at state 0 in the automaton. For some tokens you need to read one character ahead in order to recognize the end of the token. One strategy always stays one character ahead, but in C it is probably better to use ungetc to push back the source character in those cases requiring readahead. As your program reads characters, it uses each character to decide which state to proceed to. Of course for an alphabetic atom, you will need to save the characters of the atom as they are processed, and for a numeric atom you will need to create the correct integer representing the token, to store in the union part of the struct. (Notice that I show a double here rather than an int or a long.)
It is fairly simple to simulate this automaton in a language like C. Each state can become a labeled location in the program and each arrow between states can become a goto, conditional on seeing the character above the arrow. If avoiding goto's seems important, you can use a big switch statement with a separate case for each state. Finally for a scanner as simple as this one it is possible to "optimize" and simplify the code quite a bit. (If you make such "improvements" be careful that your scanner still works correctly.)
You should use something like getchar in order to isolate the reading of each character and possible machine dependencies. I suggest that you ignore end-of-file altogether.
The scanner does not know anything about the syntax of Lisp or of an S-expression, but only the syntax of individual atoms, and ")( 3 PLUS (" is fine as far as the scanner is concerned. Thus there are very few types of errors. You should allow 20 characters for alphabetic atoms as shown in the header file. In case there are more than 20, the remaining ones should be discarded and an error message issued. An alphabetic atom must start with a letter, and subsequent characters must be letters, digits, or "_". You could consider it an error for an integer to have more than say MAXDIGITS = 9 digits, but I prefer to store the resulting value in a variable of type double, so that a large number of digits could be handled, though only limited precision. An integer will be in error if a letter immediately follows a digit (e.g. 123q). The only remaining error is to encounter an illegal character.
Errors can be conveniently handled with a single function
void err(long errcode);Initially just use calls like "err(14)", and have procedure err print out something like "ERROR NUMBER 14". Then late in your debugging you can change err so that it prints out nice error messages. In this way you don't have string constants for error messages scattered all through your code. (Real compilers often use this technique also so that they can keep the text of error messages in secondary storage.) Later in the project we can change err so that it returns the null pointer NULL.
Demonstrate your scanner with a simple driver main program that repeatedly calls gettoken and prints the token type and value (i.e., the numval or alfval field if applicable) of each token detected. The driver should halt when it encounters the atom "QUIT", or "EXIT" if you like. You should turn in your source listing and a run with sample input. The run should demonstrate completely your scanner's ability to handle errors. Don't forget overview documentation. Your source should use a separate scanner header file, scanner source file, and main source file.