Module -I
Introduction to Compiling:
1.1 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Fig 1.1: Language Processing System
Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands for
longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more modern flow-of-
control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language by certain
amounts to build-in macro
COMPILER
Compiler is a translator program that translates a program written in (HLL) the source program and
translate it into an equivalent program in (MLL) the target program. As an important part of a
compiler is error showing to the programmer.
Fig 1.2: Structure of Compiler
Executing a program written n HLL programming language is basically of two parts. the source
program must first be compiled translated into a object program. Then the results object program is
loaded into a memory executed.
Fig 1.3: Execution process of source program in Compiler
ASSEMBLER
Programmers found it difficult to write or read programs in machine language. They begin to use a
mnemonic (symbols) for each machine instruction, which they would subsequently translate into
machine language. Such a mnemonic machine language is now called an assembly language.
Programs known as assembler were written to automate the translation of assembly language in to
machine language. The input to an assembler program is called source program, the output is a
machine language translation (object program).
INTERPRETER
An interpreter is a program that appears to execute a source program as if it were machine language.
Fig1.4: Execution in Interpreter
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution proceeds.
Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for interpretation.
The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
LOADER AND LINK-EDITOR:
Once the assembler procedures an object program, that program must be placed into memory and
executed. The assembler could place the object program directly in memory and transfer control to it,
thereby causing the machine language program to be execute. This would waste core by leaving the
assembler in memory while the user’s program was being executed. Also the programmer would
have to retranslate his program with each execution, thus wasting translation time. To over come this
problems of wasted translation time and memory. System programmers developed another
component called loader
“A loader is a program that places programs into memory and prepares them for execution.” It would
be more efficient if subroutines could be translated into object form the loader could”relocate”
directly behind the user’s program. The task of adjusting programs o they may be placed in arbitrary
core locations is called relocation. Relocation loaders perform four functions.
1.2 TRANSLATOR
A translator is a program that takes as input a program written in one language and produces as
output a program in another language. Beside program translation, the translator performs another
very important role, the error-detection. Any violation of d HLL specification would be detected and
reported to the programmers. Important role of translator are:
1 Translating the HLL program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates specification of the HLL.
1.3 LIST OF COMPILERS
1. Ada compilers
2 .ALGOL compilers
3 .BASIC compilers
4 .C# compilers
5 .C compilers
6 .C++ compilers
7 .COBOL compilers
8 .Common Lisp compilers
9. ECMAScript interpreters
10. Fortran compilers
11 .Java compilers
12. Pascal compilers
13. PL/I compilers
14. Python compilers
15. Smalltalk compilers
1.4 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated operation
that takes source program in one representation and produces output in another representation. The
phases of a compiler are shown in below
There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)
Compilation process is partitioned into no-of-sub processes called ‘phases’.
Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the source program into a
sequence of automic units called tokens.
Fig 1.5: Phases of Compiler
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase expressions,
statements, declarations etc… are identified by using the results of lexical analysis. Syntax analysis is
aided by using techniques based on formal grammar of the programming language.
Intermediate Code Generations:-
An intermediate representation of the final machine language code is produced. This phase bridges
the analysis and synthesis phases of translation.
Code Optimization :-
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space.
Code Generation:-
The last phase of translation is code generation. A number of optimizations to reduce the length of
machine language program are carried out during this phase. The output of the code generator is
the machine language program of the specified computer.
Table Management (or) Book-keeping:- This is the portion to keep the names used by the
program and records essential information about each. The data structure used to record this
information called a ‘Symbol Table’.
Error Handlers:-
It is invoked when a flaw error in the source program is detected. The output of LA is a stream of
tokens, which is passed to the next phase, the syntax analyzer or parser. The SA groups the tokens
together into syntactic structure called as expression. Expression may further be combined to form
statements. The syntactic structure can be regarded as a tree whose leaves are the token called as
parse trees.
The parser has two functions. It checks if the tokens from lexical analyzer, occur in pattern that are
permitted by the specification for the source language. It also imposes on tokens a tree-like structure
that is used by the sub-sequent phases of the compiler.
Example, if a program contains the expression A+/B after lexical analysis this expression might
appear to the syntax analyzer as the token sequence id+/id. On seeing the /, the syntax analyzer
should detect an error situation, because the presence of these two adjacent binary operators violates
the formulations rule of an expression. Syntax analysis is to make explicit the hierarchical structure
of the incoming token stream by identifying which parts of the token stream should be grouped.
Example, (A/B*C has two possible interpretations.)
1, divide A by B and then multiply by C or
2, multiply B by C and then use the result to divide A.
each of these two interpretations can be represented in terms of a parse tree.
Intermediate Code Generation:-
The intermediate code generation uses the structure produced by the syntax analyzer to create a
stream of simple instructions. Many styles of intermediate code are possible. One common style uses
instruction with one operator and a small number of operands. The output of the syntax analyzer is
some representation of a parse tree. the intermediate code generation phase transforms this parse tree
into an intermediate language representation of the source program.
Code Optimization
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space. Its output is another intermediate code program that does the some job as the
original, but in a way that saves time and / or spaces.
a. Local Optimization:-
There are local transformations that can be applied to a program to make an improvement. For
example,
If A > B goto L2
Goto L3
L2 :
This can be replaced by a single statement
If A < B goto L3
Another important local optimization is the elimination of common sub-expressions
A := B + C + D
E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.
b. Loop Optimization:-
Another important source of optimization concerns about increasing the speed of loops. A
typical loop improvement is to move a computation that produces the same result each time
around the loop to a point, in the program just before the loop is entered.
Code generator :-
Code Generator produces the object code by deciding on the memory locations for data, selecting
code to access each datum and selecting the registers in which each computation is to be done. Many
computers have only a few high speed registers in which computations can be performed quickly. A
good code generator would attempt to utilize registers as efficiently as possible.
Table Management OR Book-keeping :-
A compiler needs to collect information about all the data objects that appear in the source program.
The information about data objects is collected by the early phases of the compiler-lexical and
syntactic analyzers. The data structure used to record this information is called as Symbol Table.
Error Handing :-
One of the most important functions of a compiler is the detection and reporting of errors in the
source program. The error message should allow the programmer to determine exactly where the
errors have occurred. Errors may occur in all or the phases of a compiler.
Whenever a phase of the compiler discovers an error, it must report the error to the error handler,
which issues an appropriate diagnostic msg. Both of the table-management and error-Handling
routines interact with all phases of the compiler.
Example:
Fig 1.6: Compilation Process of a source code through phases
3. Lexical Analysis:
3.1 OVER VIEW OF LEXICAL ANALYSIS
• To identify the tokens we need some method of describing the possible tokens that can
appear in the input stream. For this purpose we introduce regular expression, a notation
that can be used to describe essentially all the tokens of programming language.
• Secondly , having decided what the tokens are, we need some mechanism to recognize
these in the input stream. This is done by the token recognizers, which are designed using
transition diagrams and finite automata.
3.2 ROLE OF LEXICAL ANALYZER
The LA is the first phase of a compiler. It main task is to read the input character and produce as
output a sequence of tokens that the parser uses for syntax analysis.
Fig. 3.1: Role of Lexical analyzer
Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads
the input character until it can identify the next token. The LA return to the parser representation
for the token it has found. The representation will be an integer code, if the token is a simple
construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank, tab
and new line characters. Another is correlating error message from the compiler with the source
program.
3.3 TOKEN, LEXEME, PATTERN:
Token: Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set
of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.
Fig. 3.2: Example of Token, Lexeme and Pattern
3.4. LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that
there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other
side, will be thrown by your scanner when a given set of already recognised valid tokens don't
match any of the right sides of your grammar rules. simple panic-mode error handling system
requires that we return to a high-level parsing function when a parsing or lexical error is
detected.
Error-recovery actions are:
i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.
3.5. REGULAR EXPRESSIONS
Regular expression is a formula that describes a possible set of string. Component of regular
expression..
X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R1|R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view the
set of strings in each token class as an language, we can use the regular-expression notation to
describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or digits.
In regular expression notation we would write.
Identifier = letter (letter | digit)*
Here are the rules that define the regular expression over alphabet .
• is a regular expression denoting { € }, that is, the language containing only the empty
string.
• For each ‘a’ in Σ, is a regular expression denoting { a }, the language with only one string
consisting of the single symbol ‘a’ .
• If R and S are regular expressions, then
(R) | (S) means L(r) U L(s)
R.S means L(r).L(s)
R* denotes L(r*)
3.6. REGULAR DEFINITIONS
For notational convenience, we may wish to give names to regular expressions and to define
regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following regular
definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examins the input string and
finds a prefix that is a lexeme matching one of the patterns.
Stmt →if expr then stmt
| If expr then else stmt
|є
Expr →term relop term
| term
Term →id
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is “equals”
and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using regular
definitions.
digit → [0,9]
digits →digit+
number →digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id →letter(letter/digit)*
if → if
then →then
else →else
relop →< | > |<= | >= | = = | < >
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the
“token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of
the same names. Token ws is different from the other tokens in that ,when we recognize it, we do
not return it to parser ,but rather restart the lexical analysis from the character that follows the
white space . It is the following token that gets returned to the parser.
Lexeme Token Name Attribute Value
Any WS - -
if if -
then then -
else else -
Any id Id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
== relop EQ
<> relop NE
3.7. TRANSITION DIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled by a
symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s labeled
by a. if we find such an edge ,we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a lexeme has
been found, although the actual lexeme may not consist of all positions b/w the lexeme
Begin and forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then we shall
additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”
entering from nowhere .the transition diagram always begins in the state before any input
symbols have been used.
Fig. 3.3: Transition diagram of Relational operators
As an intermediate step in the construction of a LA, we first produce a stylized flowchart,
called a transition diagram. Position in a transition diagram, are drawn as circles and are
called as states.
Fig. 3.4: Transition diagram of Identifier
The above TD for an identifier, defined to be a letter followed by any no of letters or digits.A
sequence of transition diagram can be converted into program to look for the tokens specified
by the diagrams. Each state gets a segment of code.
3.8. FINITE AUTOMATON
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
• We call the recognizer of the tokens as a finite automaton.
• A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
• Both deterministic and non-deterministic finite automaton recognize regular sets.
• Which one?
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
• First, we define regular expressions for tokens; Then we convert them into a DFA to get a
lexical analyzer for our tokens.
3.9. Non-Deterministic Finite Automaton (NFA)
• A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
o S - a set of states
o Σ - a set of input symbols (alphabet)
o move - a transition function move to map state-symbol pairs to sets of states.
o s0 - a start (initial) state
o F- a set of accepting states (final states)
• ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.
Example:
3.10. Deterministic Finite Automaton (DFA)
• A Deterministic Finite Automaton (DFA) is a special form of a NFA.
• No state has ε- transition
• For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition
function is from pair of state-symbol to state (not set of states)
Example:
3.11. Converting RE to NFA
• This is one way to convert a regular expression into a NFA.
• There can be other ways (much efficient) for the conversion.
• Thomson’s Construction is simple and systematic method.
• It guarantees that the resulting NFA will have exactly one final state, and one start state.
• Construction starts from simplest parts (alphabet symbols).
• To create a NFA for a complex regular expression, NFAs of its sub-expressions are
combined to create its NFA.
• To recognize an empty string ε:
• To recognize a symbol a in the alphabet Σ:
• For regular expression r1 | r2:
N(r1) and N(r2) are NFAs for regular expressions r1 and r2.
• For regular expression r1 r2
Here, final state of N(r1) becomes the final state of N(r1r2).
• For regular expression r*
Example:
For a RE (a|b) * a, the NFA construction is shown below.
3.12. Converting NFA to DFA (Subset Construction)
We merge together NFA states by looking at them from the point of view of the input characters:
• From the point of view of the input, any two states that are connected by an –transition
may as well be the same, since we can move from one to the other without consuming
any character. Thus states which are connected by an -transition will be represented by
the same states in the DFA.
• If it is possible to have multiple transitions based on the same symbol, then we can regard
a transition on a symbol as moving from a state to a set of states (ie. the union of all those
states reachable by a transition on the current symbol). Thus these states will be
combined into a single DFA state.
To perform this operation, let us define two functions:
• The -closure function takes a state and returns the set of states reachable from it based on
(one or more) -transitions. Note that this will always include the state itself. We should be
able to get from a state to any state in its -closure without consuming any input.
• The function move takes a state and a character, and returns the set of states reachable by
one transition on this character.
We can generalise both these functions to apply to sets of states by taking the union of the
application to individual states.
For Example, if A, B and C are states, move({A,B,C},`a') = move(A,`a') move(B,`a')
move(C,`a').
The Subset Construction Algorithm is a follows:
put ε-closure({s0}) as an unmarked state into the set of DFA (DS)
while (there is one unmarked S1 in DS) do
begin
mark S1
for each input symbol a do
begin
S2 ← ε-closure(move(S1,a))
if (S2 is not in DS) then
add S2 into DS as an unmarked state
transfunc[S1,a] ← S2
end
end
• a state S in DS is an accepting state of DFA if a state in S is an accepting state of NFA
• the start state of DFA is ε-closure({s0})
3.13. Lexical Analyzer Generator
3.18. Lex specifications:
A Lex program (the .l file ) consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
1. The declarations section includes declarations of variables,manifest constants(A manifest
constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14),
and regular definitions.
2. The translation rules of a Lex program are statements of the form :
p1 {action 1}
p2 {action 2}
p3 {action 3}
……
……
Where, each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex
the actions are written in C.
3. The third section holds whatever auxiliary procedures are needed by the
[Link] these procedures can be compiled separately and loaded with the
lexical analyzer.
Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the book:
Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman for more clarity.
3.19. INPUT BUFFERING
The LA scans the characters of the source pgm one at a time to discover tokens. Because of large
amount of time can be consumed scanning characters, specialized buffering techniques have been
developed to reduce the amount of overhead required to process an input character.
Buffering techniques:
1. Buffer pairs
2. Sentinels
The lexical analyzer scans the characters of the source program one a t a time to discover tokens.
Often, however, many characters beyond the next token many have to be examined before the
next token itself can be determined. For this and other reasons, it is desirable for thelexical
analyzer to read its input from an input buffer. Figure shows a buffer divided into two haves of,
say 100 characters each. One pointer marks the beginning of the token being discovered. A look
ahead pointer scans ahead of the beginning point, until the token is discovered .we view the
position of each pointer as being between the character last read and thecharacter next to be read.
In practice each buffering scheme adopts one convention either apointer is at the symbol last
read or the symbol it is ready to read.
Token beginnings look ahead pointerThe distance which the lookahead pointer may have to
travel past the actual token may belarge. For example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or an
array name until we see the character that follows the right parenthesis. In either case, the token
itself ends at the second E. If the look ahead pointer travels beyond the buffer half in which it
began, the other half must be loaded with the next characters from the source file. Since the
buffer shown in above figure is of limited size there is an implied constraint on how much look
ahead can be used before the next token is discovered. In the above example, ifthe look ahead
traveled to the left half and all the way through the left half to the middle, we could not reload
the right half, because we would lose characters that had not yet been groupedinto tokens. While
we can make the buffer larger if we chose or use another buffering scheme,we cannot ignore the
fact that overhead is limited.