Chapter one
Basic Concepts
1.1 The Parts of a Compiler
Compiler defines as a program or group of programs that
translates one language into another in this case the source code
of a high-level computer language is translated into assembly
language.
The assembler, linker, and so forth are not considered to be part
of the compiler.
The structure of a typical four-pass compiler is shown in Figure
1.1.
1- The preprocessor is the first pass.
2- The second pass is the heart of the compiler. It is made up
of a lexical analyzer, parser, and code generator, and it
translates the source code into an intermediate language
that is much like assembly language.
3- The third pass is the optimizer, which improves the quality
of the generated intermediate code, and
4- the fourth pass, the back end, translates the optimized code
to real assembly language or some form of binary,
executable code.
Many compilers don't have preprocessors; others generate
assembly language in the second pass.
1.1.1 The Lexical Analyzer
A phase is an independent task used in the compilation process.
Typically, several phases are combined into a single pass. The
lexical analyzer phase of a compiler (often called a scanner or
tokenizer) translates the input into a form that's more useable by
the rest of the compiler.
The lexical analyzer looks at the input stream as a collection of
basic language elements called tokens.
That is, a token is an indivisible lexical unit. In C, keywords
like while or for are tokens, symbols like >,>=, and>> are
tokens, names and numbers are tokens, and so forth.
The original string that comprises the token is called a lexeme.
Note that there is not a one-to-one relationship between lexemes
and tokens.
The parser calls the lexical-analyzer every time it needs a new
token, and the analyzer returns that token and the associated
lexeme.
1.1.2 The Parser
To parse an English sentence is to break it up into its
component parts in order to analyze it grammatically. For
example, a sentence like this:
Jane sees Spot run is broken up into a subject ("Jane") and a
predicate ("sees Spot run"). The predicate is broken up into a
verb ("sees"), a direct object ("Spot"), and a participle that
modifies the direct object ("run").
A compiler performs this same process (of decomposing a
sentence into its component parts) in the parser phase, though it
usually represents the parsed sentence in a tree form rather than
as sentence diagram.
The sentence diagram itself shows the syntactic relationships
between the parts of the sentence, so this kind of graph is
formally called a syntax diagram (or, if it's in tree form, a syntax
tree). You can expand the syntax tree to show the grammatical
structure as well as the syntactic structure. This second
representation is called a parse tree. A parse tree for our earlier
sentence diagram is shown in Figure 1.3. Syntax and parse trees
for the expression A*B+C*D are shown in Figure 1.4.
A tree structure is used here primarily because it's easy to
represent in a computer program, unlike a sentence diagram.
A parser is a group of subroutines that converts a token stream
into a parse tree, and a parse tree is a structural representation of
the sentence being parsed.
The parse tree represents the sentence in a hierarchical fashion,
moving from a general description of the sentence (at the root of
the tree) down to the specific sentence being parsed (the actual
tokens) at the leaves.
There are advantages and disadvantages to an intermediate-
language approach.
The main disadvantage is lack of speed.
A parser that goes straight from tokens to binary object code
will be very fast, since an extra stage to process the intermediate
code can often double the compile time.
The advantages are usually enough to justify the loss of speed.
These are optimization and flexibility. A few optimizations,
such as simple constant folding-the evaluation of constant
expressions at compile time rather than run time-can be done in
the parser.
Most optimizations, however, are difficult, if not impossible, for
a parser to perform. Consequently, parsers for optimizing
compilers output an intermediate language that's easy for a
second pass to optimize.
Intermediate languages give you flexibility as well.
1.2 Representing Computer Languages
1.2.1 Grammars and Parse Trees
The most common method used to describe a programming
language is a formal grammar
Formal grammars are most often represented in a modified
Backus-Naur Form (also called Backus-Normal Form), BNF for
short. A strict BNF representation starts with a set of tokens,
called terminal symbols, and a set of definitions, called
nonterminal symbols.
The definitions create a system in which every legal structure in
the language can be represented. One operator is supported, the
::=operator, translated by the phrase "is defined as" or "goes to."
For example, the following BNF rule might start a grammar for
an English sentence:
sentence : := subject predicate
A sentence is defined as a subject followed by a predicate. You
can also say "a sentence goes to a subject followed by a
predicate."
Each rule of this type is called a production.
The nonterminal to the left of the ::=is the left-hand side and
everything to the right of the ::= is the right-hand side of the
production.
The left-hand side of a production always consists of a single,
nonterminal symbol, and every nonterminal that's used on a
right-hand side must also appear on a left-hand side. All
symbols that don't appear on a left-hand side, such as the tokens
in the input language, are terminal symbols.
Grammars must be as flexible as possible, and one of the ways
to get that flexibility is to make the application of certain rules
optional.
A rule like this:
article →THE
says that THE is an article, and you can use that production like
this:
object →article noun
In English, an object is an article followed by a noun. A rule
like the foregoing requires that all nouns that comprise an object
be preceded by a participle. But what if you want the article to
be optional? You can do this by saying that an article can either
be the noun "the" or an empty string. The following is used to
do this:
article →THE |ε
The ε (pronounced "epsilon") represents an empty string. If the
THE token is present in the input, then the
article →THE
production is used. If it is not there, however, then the article
matches an empty string, and
article → ε
is used.
A grammar that recognizes a limited set of English sentences is
shown below:
An input sentence can be recognized using this grammar, with a
series of replacements, as follows:
(1) Start out with the topmost symbol in the grammar, the goal
symbol.
(2) Replace that symbol with one of its right-hand sides.
(3) Continue replacing nonterminals, always replacing the
leftmost nonterminal with its right-hand side, until there are no
more nonterminals to replace.
1.2.2 An Expression Grammar
Table 1.1 shows a grammar that recognizes a list of one or more
statements, each of which is an arithmetic expression followed
by a semicolon. Statements are made up of a series of
semicolon-delimited expressions, each comprising a series of
numbers separated either by asterisks (for multiplication) or plus
signs (for addition).
Note that the grammar is recursive. For example, Production 2
has statements on both the left- and right-hand sides. There's
also third-order recursion in Production 8, since it contains an
expression, but the only way to get to it is through Production 3,
which has an expression on its left-hand side. This last recursion
is made clear if you make a few algebraic substitutions in the
grammar. You can substitute the right-hand side of Production 6
in place of the reference to term in Production 4, yielding
expression →factor
and then substitute the right-hand side of Production 8 in place
of the factor:
expression → ( expression )
In Production 3, expression appears both on the left-hand side
and at the far left of the right-hand side.
The property is called left recursion, and certain parsers can't
handle left-recursive productions. They just loop forever,
repetitively replacing the leftmost symbol in the right-hand side
with the entire right-hand side.
You can understand the problem by considering how the parser
decides to apply a particular production when it is replacing a
nonterminal that has more than one right hand side.
The simple case is evident in Productions 7 and 8. The parser
can choose which production to apply when it's expanding a
factor by looking at the next input symbol.
If this symbol is a number, then the compiler applies
Production 7 and replaces the factor with a number. If the next
input symbol was an open parenthesis, the parser would use
Production 8. The choice between Productions 5 and 6 cannot
be solved in this way, however. In the case of Production 6, the
right-hand side of term starts with a factor which, in tum, starts
with either a number or left parenthesis. Consequently, the
parser would like to apply Production 6 when a term is being
replaced and the next input symbol is a number or left
parenthesis. Production 5-the other right-hand side-starts with a
term, which can start with a factor, which can start with a
number or left parenthesis, and these are the same symbols that
were used to choose Production 6.
Chapter Two
Lexical Analysis
2.1 Languages
An alphabet is any finite set of symbols. For example, the
ASCII character set is an alphabet; the set {' 0' ,' 1' } is binary
alphabet. A string or word is a sequence of alphabetic symbols.
In practical terms, a string is an array of characters.
There is also the special case of an empty string, represented by
the symbol ε (pronounced epsilon). In C, the ' \0' is not part of
the input alphabet. As a consequence, it can be used as an end-
of-string marker because it cannot be confused with any of the
characters in the string itself, all of which are part of the input
alphabet. An empty string in C can then be represented by an
array containing a single ' \0' character. Note, here, that there's
an important difference between ε, an empty string, and a null
string. The former is an array containing the end-of-string
marker. The latter is represented by a NULL pointer-a pointer
that doesn't point anywhere. In other words, there is no array
associated with a null string.
A language is a set of strings that can be formed from the input
alphabet. A sentence is a sequence of the strings that comprise a
language. The ordering of strings within the sentence is defined
by a collection of syntactic rules called a grammar. The lexical
analyzer doesn't understand meaning.
A prefix is a string composed of the characters remaining after
zero or more symbols have been deleted from the end of a
string: "in" is a prefix of" inconsequential". Officially, ε is a
prefix of every string.
A suffix is a string formed by deleting zero or more symbols
from the front of a string. "ible" is a suffix of"
incomprehensible". The suffix is what's left after you've
removed a prefix. ε is a suffix of every string.
A substring is what's left when you remove both a suffix and
prefix: "age" is a substring of "unmanageable". Note that
suffixes and prefixes are substrings. Also ε, the empty string, is
a substring of every string.
A proper prefix, suffix, or substring of the string x has at least
one element and it is not the same as x. That is, it can't be ε, and
it can't be identical to the original string.
A sub-sequence of a string is formed by deleting zero or more
symbols from the string. The symbols don't have to be
contiguous, so "iiii" and "ssss" are both sub-sequences of
"Mississippi".
Several useful operations can be performed on strings. The
concatenation of two String concatenation.
Strings is formed by appending all characters of one string to the
end of another string. The concatenation of" fire" and "water"
is" firewater". The empty string, ε, can be concatenated to any
other string without modifying it.
If you look at concatenation as a sort of multiplication, then
exponentiation makes String exponentiation sense. An
expression like Xn represents x, repeated n times. You could
define a language consisting of the eight legal octal digits with
the following:
L(octal) ={ 0, I, 2, 3, 4, 5, 6, 7 }
and then you could specify a three-digit octal number with
L(octal)3•
The exponentiation process can be generalized into the closure
operations. If L is language, then the Kleene closure of L is L
repeated zero or more times. This operation is usually
represented as L*. In the case of a language comprised of a
single character, L* is that character repeated zero or more
times. If the language elements are strings rather than single
characters, L* are the strings repeated zero or more times. For
example, L(octal)* is zero or more octal digits. If L(v1) is a
language comprised of the string Va and L(v2) is a language
comprised of the string Voom, then L(v1)* · L(v2)
describes all of the following strings:
Voom VaVoom VaVaVoom VaVaVaVoom etc.
The positive closure of L is L repeated one or more times,
usually denoted L+.
Since languages are sets of symbols, most of the standard set
operations can be applied to them. The most useful of these is
union, denoted with the u operator. For example, if letters is a
language containing all 26 letters [denoted by L(letters)] and
digits is a set containing all 10 digits [denoted by L(digits)], then
{ L(letters) L(digits)} is the set of alphanumeric characters.
Union is the equivalent of a logical OR operator.
Other set operations (like intersection) are, of course possible,
but have less practical application.
L(digit) = {0,1, 2, 3, 4, 5, 6, 7, 8, 9 }
L(alpha) = { a, b, c, ... , z }
you can say:
L(digit)+ is a decimal constant in C (one or more digits).
L(digit)* is an optional decimal constant (zero or more digits).
L(alpha) L(digit) is the set of alphanumeric characters.
(L(alpha) L(digit))* is any number of alphanumeric
characters.
L(alpha) · (L(alpha) L(digit) )* is a C identifier.