CS 3723, Programming Language Translation

	CS 3723 Programming Languages
Programming Language Translation

Programming Language Translation

The structure of a typical compiler is illustrated in Figure 3.2 of the text. The three basic phases of translation are lexical analysis, syntactic analysis, and semantic analysis, as shown in the figure. In many cases there are additional code generation phases such as optimization and translation from an intermediate form to the code for a target platform.

Some general characteristics of the phases of translation are:

Lexical analysis (sometimes also called "scanning"): The objective of the lexical analysis phase is to convert the input character string (text) representing a program into meaningful chunks such as reserved words, identifiers, operators, symbols, and literals. The output of the lexical analyzer is a sequence of tokens, where each token encodes the type (reserved word, identifier, etc.) and value (which reserved word, what operator, etc.) of the token.
Syntactic analysis (sometimes also called "parsing"): In the syntactic analysis phase the tokens are analyzed and the structure and semantic components of the program, such as if statements, function definitions, while loops, and assignment statements, are determined. The output of the parser is an encoding, such as a parse tree (discussed later), that represents the syntactic structure of the program.
Semantic analysis: The semantic analysis phase gives meaning to the program's internal syntactic representation by translating the parse tree or other output from the parser into code for a virtual or actual computer. If code for an actual computer is produced then this is the first phase that is platform dependent. If code for a virtual machine is produced then the first three phases are all platform independent.
Optimization: Optimization is often optional and might even be omitted or combined with another phase. Optimization can be platform independent, such as eliminating the calculation of expressions whose value does not change, or platform dependent, such as eliminating redundant loads for values that are already in registers, or using an increment instruction rather than adding 1.
Code generation: If it is not a part of semantic analysis, code generation is just the process of translating the internal virtual machine code into code for a specific platform.

The symbol table and other internal tables provide overall information about identifiers and other components that are used in the program. For example, each variable that is used in the program has an entry in the symbol table giving the attributes of the variable that are known (type, scope, etc.).

By separating the lexical and syntactic phases, the syntactic phase does not have to worry about the mundane aspect of identifying the meaningful sequences of characters that are of interest, and it can concentrate on identifying the structure and the semantic components of the program. It also facilitates the use of different algorithmic paradigms to be used in each phase, which improves the efficiency because typical parsing algorithms are not the most efficient algorithms for lexical analysis.

The different phases of a translator can be done as separate passes or as subprograms that pass data to each other. That is, the lexical analyzer can convert the entire input text to tokens, then the parser can translate the set of input tokens to a parse tree or other encoding, and the semantic analyzer can then translate the entire parser output to code in separate passes over the program. Alternatively, the tokens can be fed one at a time to the parser, which can then produce all of the parse tree and send it to the semantic analyzer or send parts of the parse tree to the semantic analyzer as they are produced.

Note that an interpreter would normally have the first three phases and perhaps some optimization as well. An interpreter usually simulates the execution of virtual machine code produced during semantic analysis, rather than directly executing source code.

It is interesting to note that some components of a translator can be shared among different translators. For example, lexical analysis depends only on the input character set and the rules for combining sequences of characters and identifying special sequences such as reserved words and operators. Thus the translators for languages such as Java, C, and C++ might all be able to use the same lexical analyzer if it were constructed properly. The syntactic analyzer (parser) is independent of the target platform, as is the lexical analyzer, so these two components can be used in a translator for any target platform.

If the semantic analyzer produces code for a virtual machine, then it too is platform independent and can be used in translators for multiple platforms. But if the parsers for two different languages produce parse trees using the same encoding, it might be possible to use the same semantic analyzer in the translator for more than one language. Similarly, a code generator is target platform specific in that it generates code for a specific computer, but if the internal code produced by the translators for two different languages is the same, then the same code generator could be used in the translators for the two languages.

[Taken from: here.]

(Revision date: 2013-12-20. Please use ISO 8601, the International Standard.)