Compilers: Principles and Techniques
Compilers: Principles and Techniques
Compilers
Principles & Techniques
By
Dr. Esam T. Yassen
[Link]-Anbar,Computers College
Compilers Anbar University – Computer College
Principle ,Techniques, and Tools
Introduction
Programming languages :-
Interactions involving humans are most effectively carried out
through the medium of language . language permits the expression of
thoughts and ideas , and without it , communication as we know it
would be very difficult indeed .
In computer programming , programming language serves as
means of communication between the person with a problem and the
computer used to solve it . programming language is a set of symbols ,
words , and rules used to instruct the computer .
A hierarchy of programming languages based on increasing
machine independence include the following :-
1- machine language : is the actual language in which the
computer carries out the instructions of program . otherwise , " it is
the lowest from of computer language , each instruction in
program is represented by numeric cod , and numeric of addresses
are used throughout the program to refer to memory location in the
computer memory .
2- Assembly languages : is a symbolic version of a machine
language ,each operation code is given a symbolic code such a
Add , SUB ,…. Moreover , memory location are given symbolic
name , such as PAY , RATE .
3-high – level language :Is a programming language where the
programming not require knowledge of the actual computing
machine to write a program in the language .H.L.L . offer a more
enriched set of language features such as control structures , nested
statements , block …
4- problem-oriented language : It provides for the expression of
problems in a specific application . Examples of such language
are SQL for Database application and COGO for civil engineering
applications .
(compilation Process)
Data
( Interpretive process )
Compilation concepts
What is compiler?
A compiler is a program that translates a computer
program(source program) written in H.L.L (such as Pascal,C++)
into an equivalent program (target program) written in L.L.L.
Error messages
Model of Compiler:
The task of constructing a compiler for a particular source
language is complex. The complexity of the compilation process
depend on the source language.A compiler must perform two
major tasks:
1. Analysis :deals with the decomposition of the source program
into its basic parts.
2. Synthesis:builds their equivalent object program using these
basic parts.
To perform these tasks, compiler operates in phases each of
which transforms the source program from one representation to
another. A typical decomposition of a compiler is shown in the
following figure.
Lexical analyzer
Syntax analyzer
Semantic analyzer
Intermediate code
generation
Code optimization
Code generation
Target program
Compiler phases
(+,a,b,t1)
Y=(a+b)*(c+b) (+,c,d,t2)
(*,t1,t2,t3)
• Types of Errors
Lexical errors: The lexical phase can detect errors where the
characters remaining in the input do not form any token of the
language.
Syntax errors: The syntax analysis phase can detect errors
Errors where the token stream violates the structure rules (syntax)
of the language .
Semantic errors: During semantic analysis the compiler tries to
detect constructs that have the right syntactic structure but no
meaning to the operation involved, e.g. to add two identifiers, one
of which is the name of an array, and the other the name of a
procedure .
Runtime Errors
Runtime errors occur while the program is running, although the
compilation is successful. The causes of Runtime Errors are [5]:
1) Errors that only become apparent during the course of execution of the
program
2) External Factors – e.g.
Out of memory
Hard disk full
Insufficient i/o privileges
etc.
3) Internal Factors – e.g.
Arithmetic errors
Attempts to read beyond the end of a file
Attempt to open a non-existent file
Attempts to read beyond the end of an array
etc.
Preprocessor
Compiler
Assembly program
Assembler
Machine program
Machine code
Lexical Analyzer
The analysis of source program during compilation is often
complex . The construction of compiler can often be made easier
if the analysis of source program is separated into two parts , with
one part identifying the low – level language constructs , such as
variable names , keyword , labels , and operations , and the
second part determine the syntactic organization of the program .
SYMBOL
Table
Symbol Table ( ST )
A symbol table is a data structure containing a record for
each identifier, with fields for attributes of the [Link]
data structure allows us to find the record for each identifier
quickly and to store or retrieve data from that record quickly.
When an identifier in source program is detected by the
lexical analyzer, the identifier is entered into ST. however, the
attributes of an identifier can not be determined during lexical
analysis, the remaining phases enter information about identifier
into ST and then use this information in various ways.
Symbol Table Contents :-
A symbol table is most often conceptualized as series of
rows, each row containing a list of attributes values that are
associated with a particular variable. The kinds of attributes
appearing in ST are dependent to some degree on the nature of
language for which compiler is written. For example, a language
may be typeless, and therefore the type attribute need not appear
in ST .The following list of attribute are not necessary for all
compilers, however, each should be considered for a particular
compiler:-
1- Variable Name: A variable's name most always reside in
the ST. major problem in ST organization can be the
variability in the length of identifier names. For languages
such as BASIC with its one – and two – character names
and FORTRAN with names up to six characters in length,
this problem is minimal and can usually be handled by
storing the complete identifier in a fixed – size maximum
length fields. While there are many ways of handling the
storage of variable names, two popular approaches will be
outlined, one which facilitates quick table access and
another which supports the efficient storage of variable
names. The provide quick access, yet sufficiently large,
maximum variable name length. A length of sixteen or
greater is very likely adequate ()ﻣﻼﺋ ﻢ, the complete
identifier can then be stored in a fixed – length fields in
Operation on ST:-
The two operations that are most commonly performed on
ST are: Insertion & Lookup (Retrieval) .For language in
which explicit declaration of all variables is mandatory
()إﺟﺒ ﺎري, an insertion is required when processing a
declaration. If ST is Ordered, then insertion may also involve
a lookup operation to find allocations at which the variable's
attributes are to be placed. In such a situation an insertion is
ST Organizations:-
The primary measure which is used to determine the
complexity of a ST operation is the average length of search.
This measure is the average number of comparisons required
to Retrieve a ST record in a particular table organization, the
- E
( E )
E + E
id id
Derivations :-
This derivational of view gives a precise description of the
top-down construction of parse tree. The central idea here is that
a production is treated as rewriting rule in which the
nonterminal on the left is replaced by the string on the right side
of the production. For example, consider the following
grammar:
E E+E
E E*E
E (E)
E -E
E id
The derivation of the input string id + id* id is:
E + E
id E * E
id id
Ambiguity :-
A grammar that produce more that one parse tree for same
sentence is said to be Ambiguous. In the another way, by
produced more that one left-most derivation or more that one
Right-most derivation for the same sentence.
E
E + E
id E * E
id id
(1)
E * E
E + E id
id id
(2)
two parse tree for id+id*id
E E
E+E E*E
id+E E+E*E
id+E*E id+E*E
id*id*E id+id*E
id+id*id id+id*id
(1) (2)
Two left-most derivations for id+id*id
With My Best Wishes 22 Esam
Compilers Anbar University – Computer College
Principle ,Techniques, and Tools
E E+E E E+T T
E E*E E (E)
E (E) E -E
E -E T T*F F
E id F id
E + T
T T * F
F F id
id id
parse tree for id+id*id
Left-Recursion :-
A grammar is left-recursion if has a nonterminal A such
that there is a derivation A Aα for some string α . Top-
down parser cannot handle left-recursion grammars, so a
transformation that eliminates left-recursion is needed:
A → Aα ß
A → ß A'
A' → αA' ε
OR:
Example:
E→ E+T T
T→ T*F F
F→ (E) id
E→ T E'
E' → +T E' ∊
T→ FT'
T'→ *F T' ∊
F→ (E) id
Left-Factoring :-
The basic idea is that when is not clear which of two
alternative production to use to expand a nonterminal A . We
may be able to rewrite the A-productions to defer the decision
until we have seen enough of the input to make the right choice.
A→ α ß1 α ß2 where α ≠ ∊
A→ α A'
A'→ß1 ß2
OR :-
A→ α ß1 α ß2 .. α ßn γ where α ≠ ∊
A→ α A' γ
A'→ ß1 ß2 .. ßn
Example:
S→ iEtS iEtSeS a
E→b
S → iEtSS' a
S' → eS ∊
E→ b
Top-Down Parsing :-
Top-down parsing can be viewed as an attempt to
find a leftmost derivation for an input string. Equivalently, A top
down parser, such as LL(1) parsing, move from the goal symbol
to a string of terminal symbols. in the terminology of trees, this
is moving from the root of the tree to a set of the leaves in the
syntax tree for a program. in using full backup we are willing to
attempt to create a syntax tree by following branches until the
correct set of terminals is reached. in the worst possible case,
that of trying to parse a string which is not in the language, all
possible combinations are attempted before the failure to parse
is recognized. the nature of top down parsing technique is
characterized by:
1-Recursive-Descent Parsing : It is a general form of Top-
Down Parsing that may involve " Backtracking ",that is ,making
repeated scans of the input.
Example: consider the grammar
S cAd
A ab a
Input : cad
Then the implementation of Recursive-Descent Parsing is:
S S S
c A d c A d c A d
a b a
E→ E+T T
T→ T*F F // Original grammar
F→ (E) id
Eliminate left-recursion and left factoring
E→ T E'
E' → +T E' ∊
T→ FT'
T'→ *F T' ∊
F→ (E) id
Transition Diagrams
E: T E'
0 1 2
E' : + T E'
3 4 5 6
ε
T: F T'
7 8 9
* F T'
T' : 10 11 12 13
ε
( E )
F: 14 15 16 17
id
E→ T E'
E' → +T E' ∊
T→ FT'
T'→ *F T' ∊
F→ (E) id
Input a + b $
X Predictive Parsing
Output
Program
Y
Z
$ Parsing Table
M
Stack
E→ E+T T
T→ T*F F // Original grammar
F→ (E) id
Eliminate left-recursion and left factoring
E→ T E'
E' → +T E' ∊
T→ FT'
T'→ *F T' ∊
F→ (E) id
Predictive Parsing Table M
Input symbol
Nonterminals
id + * ( ) $
E TE' TE'
E' +TE' є є
T FT' FT'
T' є *FT' є є
F id (E)
S' eS ε
E b
Bottom-Up Parsing
The term "Bottom-Up Parsing" refer to the order in which
nodes in the parse tree are constructed, construction starts at the
leaves and proceeds towards the root. Bottom-Up Parsing can
handle a large class of grammars.
1. Shift-Reduce Parsing: Is a general style of Bottom-up
syntax analysis , it attempts to construct a parse tree for an
input string beginning at leaves and working up towards
the root,(reducing a string w to the start symbol of
grammar).At each reduction step a particular substring
matching the right side of production is replaced by the
symbol on the left of that production.
Example : consider the grammar
S aABe
A Abc b
B d
Action
Stack Input Buffer
$ id+id*id$ Shift
$id +id*id$ Reduce: E→id
$E +id*id$ Shift
$E+ id*id$ Shift
$E+id *id$ Reduce: E→id
$E+E *id$ Shift(*)
$E+E* id$ Shift
$E+E*id $ Reduce: E→id
$E+E*E $ Reduce: E→E*E
$E+E $ Reduce: E→E+E
$E $ Accept
LR Parsers
This section presents an efficient Bottom-Up syntax
analysis technique that can be used to parse a large class of
context-free grammars. The technique is called LR(k) parsing,
the "L" is for left to right scanning of input, the "R" for
constructing a rightmost derivation in reverse, and "k" for the
number of input symbols of lookahead that are used in making
parsing decisions-when "k" is omitted , k is assumed to be 1).
LR parsing is attractive for a variety of reasons:-
1. LR parsers can be constructed to recognize virtually all
programming language constructs for which context-free
grammars can be written.
2. The LR parsing method is the most general
nonbacktracking shift-reduce parsing method known, yet it
can be implemented as efficiently as other shift-reduce
methods.
3. The class of grammars that can be parsed using LR
methods is a proper superset of the class of grammars that
can be parsed with predictive parsers.
The schematic form of an LR parser is shown in following
figure .It consists of an Input,an Output,a Stack,a Driver
program,and a Parsing table that has two parts (action and
goto) .
a1 …… ai ……. an $
Input
LR
Sm Output
Parsing program
Xm
Sm-1
Xm-1
………
action goto
S0
Stack
Model of an LR parser
With My Best Wishes 35 Esam
Compilers Anbar University – Computer College
Principle ,Techniques, and Tools
state x + $ E T
0 S4 g1 g2
1 Accept
2 S3 r2
3 S4 g5 g2
4 r3 r3
5 r1
Semantic Analysis
The semantic analysis phase of compiler connects variable
definition to their uses ,and checks that each expression has a
correct type.
This checking called "static type checking" to distinguish it
from "dynamic type checking" during execution of target
program. This phase is characterized be the maintenance of
symbol tables mapping identifiers to their types and locations.
Examples of static type checking:-
1. Type checks : A compiler should report an error if an
operator is applied to an incompatible operand.
2. Flow of control checks:- Statements that cause flow of
control to leave a construct must have some place to which
to transfer the flow of control. For example, a "break"
statement in 'C' language causes control to leave the
smallest enclosing while , for , or switch statement ;an
error occurs if such an enclosing statement does not exist.
3. Uniqueness checks:- There are situations in which an
object must be defined exactly once. For example, in
'Pascal' language, an identifier must be declared uniquely.
4. Name-related checks:- Sometimes, the same name must
appear two or more times. For example, in 'Ada' language
a loop or block may have a name that appear at the
beginning and end of the construct. The compiler must
check that the same name is used at both places.
Type system:-
The design of type checker for a language is based on
information about the syntactic constructs in the language, the
notation of types, and the rules for assigning types to language
constructs.
The following excerpts are examples of information that a
compiler writer might have to start with.
• If both operands of the arithmetic operators "addition",
"subtraction", and "multiplication" are of type integer ,
then the result is of type integer.
D D ; D id : T
P D;E
D D;D
D id:T {addtype([Link],[Link])}
T char {[Link]=char}
T int {[Link]=int}
T T1 {[Link]=pointer([Link])}
T array[num] of T1 {[Link]=array(1..[Link],[Link])}
= =
a + a +
* * *
b - b - b -
c c c
0 id b
= • •
1 id c
id a 2 - 1
3 * 0 2
+ • • 4 id b
5 id c
6 - 5
7 * 4 6
* • • * • •
8 + 3 7
id b id b 9 id a
10 = 9 8
…. ….. …..
- • - •
….. ….. …..
id c id c
1 2
t1 = y * z
t2 = x + t1
t1 = - c t1 = - c
t2 = b * t1 t2 = b * t1
t3 = - c t5 = t2 + t2
t4 = b * t3 a = t5
t5 = t2 + t4
a = t5
Three address code Three address code
For syntax tree For DAG
• OP // operator
• arg1 , arg2 // operands
Example: a = b * -c + b * -c
i. By Quadruples
Position OP arg1 arg2 result
0 - c t1
1 * b t1 t2
2 - c t3
3 * b t3 t4
4 + t2 t4 t5
5 = t5 a
ii. By Triples
Position OP arg1 arg2
0 - c
1 * b (0)
2 - c
3 * b (2)
4 + (1) (3)
5 = a (4)
Code Optimization
Compilers should produce target code that is as good as can be
written by hand. This goal is achieved by program
transformations that are called " Optimization " . Compilers that
apply code improving transformations are called " Optimizing
Compilers ".
Code optimization attempts to increase program efficiency by
restructuring code to simplify instruction sequences and take
advantage of machine specific features:-
• Run Faster , or
• Less Space , or
• Both ( Run Faster & Less Space ).
The transformations that are provided by an optimizing compiler
should have several properties:-
1. A transformation must preserve the meaning of
program. That is , an optimizer must not change the
output produce by program for an given input, such
as division by zero.
2. A transformation must speed up programs by a
measurable amount.
Control Data
Flow Flow Transformations
Analysis Analysis
Basic Blocks:-
The code is typically devided into a sequence of "Basic
Blocks". A Basic Block is a sequence of straight-line code,with
no branches " In " or " Out " except a branch "In" at the top of
block and a branch "Out" at the bottom of block.
• Set of Basic Block : The following steps are used to set
the Basic Block:
1. Determine the Block beginning:
i- The First instruction
ii- Target of conditional & unconditional
Jumps.
iii- Instruction follow Jumps.
Example\\ 1) i=0
B1
2) t=0
1) i=0
2) t=0
3) t=t+1
3) t=t+1
B2 4) i=i+1
4) i=i+1
5) if I < 10 then goto 3
5) if I < 10 then goto 3
6) x=t
6) x=t
B3
Basic Blocks
1) i=0
B1
2) t=0
3) t=t+1
B2 4) i=i+1
5) if I < 10 then goto B2
6) x=t
B3
Control Flow
With My Best Wishes 51 Esam
Compilers Anbar University – Computer College
Principle ,Techniques, and Tools
B2 d4 : b= d5 : c= B3
d6 : b=
d7 : c=
B4
d8 : a=
B5
With My Best Wishes 52 Esam
Compilers Anbar University – Computer College
Principle ,Techniques, and Tools
B1
B2
Loop No. Header Preheader Blocks
1 B2 B1 2-3-4-5-2
B3 2 B2 B1 2-2
3 B3 B2 3-3
B4
Loop Information
B5 B6
Flow Graph
Global Transformations:
1. Common Subexpression Elimination
a=b+c a=b+c
c=b+c c=a
d=b+c d=b+c
2. Dead Code Elimination: Variable is dead if never used
x=y+1
y=1 y=1
x=2*z x=2*z
3. Copy Propagation
Origin Copy Propagation Dead Code
x=t3 x=t3
a[t2]=t5 a[t2]=t5 a[t2]=t5
a[4]=x a[4]=t3 a[4]=t3
Goto B2 Goto B2 Goto B2
4. Constant Propagation
Origin Copy Propagation Dead Code
x=3 x=3
a[t2]=t5 a[t2]=t5 a[t2]=t5
a[4]=x a[4]=3 a[4]=3
Goto B2 Goto B2 Goto B2
5. Loop Optimization
• Code Motion: An important modification that
decreases the amount of code in a loop is Code
Motion. If result of expression does not change
during loop( Invariant Computation ),can hoist its
computation out of the loop.
For(i=0;i<n;i++)
A[i]=a[i]+( x*x )/( y*y );
c=( x*x )/( y*y );
For(i=0;i<n;i++)
A[i]=a[i]+c;
Code Generation
In computer science, code generation is the process by
which a compiler's code generator converts some internal
representation of source code into a form( e.g., machine
code)that can be readily executed by a machine.
Issues in the Design of a Code Generator:-
1. Input to the Code Generator :The input to the code
generator consists of the intermediate representation of the
source program(Optimized IR),together with information
in ST that is used to determine the Run Time Addresses of
the data objects denoted by the names in IR. Finally, the
code generation phase can therefore proceed on the
assumption that its input is free of the errors.
2. Target Programs : The output of the code generator is the
target program. The output code must be Correct and of
high Quality, meaning that it should make effective use of
the resources of the target machine. Like the IR ,this
output may take on a variety of forms:
a. Absolute Machine Language // Producing this form
as output has the advantage that it can placed in a
fixed location in memory and immediately executed.
A small program can be compiled and executed
quickly.
b. Relocatable Machine Language // This form of the
output allows subprograms to be compiled
separately. A set of relocatable object modules can
be linked together and loaded for execution by
linking-loader.
3. Memory Management : Mapping names in the source
program to addresses of data objects in run time memory.
This process is done cooperatively by the Front-end &
code generator.
4. Major tasks in code generation : In addition to the basic
conversion from IR into a linear sequence of machine
instructions, a typical code generator tries to optimize the
generated code in some way. The generator may try to use
MOV y,R0
ADD z,R0
MOV R0,x
If three-address code is :
a=b+c
d=a+e
then the target code is :
MOV b,R0
ADD c,R0
MOV R0,a
MOV a,R0
ADD e,R0
MOV R0,d
Finally, A target machine with "Rich" instruction set
may be provide several ways of implementing a given
operation. For example, if the target machine has an
"increment" instruction ( INC ) ,then the IR a=a+1
may be implemented by the single instruction ( INC a )
rather than by a more obvious sequence :
MOV a,R0
ADD #1,R0
MOV R0,a
ii. Instruction Scheduling : In which order to put
those instructions. Scheduling is a speed optimization.
The order in which computations are performed can