CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
UNIT 3(chapter-2)
Syllabus:
Compiler: Definition of Compiler, Language processing systems ,Phases of Compiler, Lexical
Analysis, Input Buffering.
Compilers
Compiler is a language translator or is a program which translates a program written in
one language (the source language) to an equivalent program in other language (the
target language).
The source language usually is a high-level language like Java, C, Fortran etc. whereas
the target language is machine code that a computer's processor understands.
The source language is optimized for humans. It is more user-friendly, to some extent
platform-independent. They are easier to read, write and maintain. Hence, it is easy to
avoid errors. Ultimately, programs written in a high-level language must be translated into
machine language by a compiler. The target machine language is efficient for hardware
but lacks readability.
Translates from one representation of the program to another.
Typically, from high level source code to low level machine code or object code.
Source code is normally optimized for human readability.
Machine code is optimized for hardware.
Redundancy is reduced.
Information about the intent is lost.
Goals of translation
Good performance for generated code: The metric for the quality of the generated code
is the ratio between the size of handwritten code and compiled machine code for same
program. A better compiler is one which generates smaller code. For optimizing compilers
this ratio will be lesser.
Good compile time performance: A handwritten machine code is more efficient than a
compiled code in terms of the performance it produces.
Correctness: A compiler's most important goal is correctness - all valid programs must
compile correctly.
Pictorial Representation
Compiler is part of program development environment
The other typical components of this environment are editor, assembler, linker,
loader, debugger, profiler etc.
The compiler (and all other tools) must support each other for easy program
[Link]- UNIT-3
development
All development systems are essentially a combination of many tools. For compiler, the
other tools are debugger, assembler, linker, loader, profiler, editor etc. If these tools have
support for each other than the program development becomes a lot easier.
Language processing systems:
Language processing systems in compiler design play a crucial role in transforming high-level
programming languages into machine-readable code
In a language processing system, the source code is first preprocessed. The modified source program is
processed by the compiler to form the target assembly program which is then translated by the assembler
to create relocatable object codes that are processed by linker and loader to create the target program. It is
based on the input the translator takes and the output it produces, and a language translator can be defined
as any of the following.
Preprocessor:
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands
for longer constructs. Eg: #define PI 3.14 Whenever the PI is encountered in a program, it is
replaced by the value 3.
2. File inclusion: A preprocessor may include header files into the program text. Eg: #include
By this statement, the header file stdio.h can be included and user can make use of the
functions in this header file. This task of preprocessor is called file inclusion.
3. Rational preprocessor: these preprocessors augment older languages with more modern
flow-of-control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language by
certain amounts to build-in macro.
[Link]- UNIT-3
Compiler :
Compiler is a program that can read a program in one language the source language and translate it
into an equivalent program in another language the target language;
If some errors are encountered during the process of translation, then compiler displays them as
error messages.
The compiler takes the source program in high level language such as C, PASCAL, FORTRAN and
converts into low level language or machine level language such as assembly language. Assembler .
Programmers found difficult to write or read programs in machine language. They begin to use a
mnemonic (symbols) for each machine instruction, which they would subsequently translate into
machine language. Such a mnemonic machine language is now called an assembly language.
Programs known as assembler were written to automate the translation of assembly language into
machine language.
The input to an assembler program is called source program, the output is a machine language
translation (object program).
Loader/Linker – editor
Loader is a program which performs two functions, loading and link editing.
Loading is a process in which the relocatable machine code is read and the relocatable addresses are
altered.
Then that code with altered instructions and data is placed in the memory at proper location.
The job of link editor is to make a single program from several files of relocatable machine code.
If code in one file refers the location in another file, then such a reference is called external
reference. The link editor resolves such external references also.
[Link]- UNIT-3
PHASES OF A COMPILER:
The compilation process is a sequence of various phases. Each phase takes input from its
previous stage, has its own representation of source program, and feeds its output to the
next phase of the compiler. Let us understand the phases of a compiler.
Compiler operates in various phases , each phase transforms the source program from one
representation to [Link] phase takes input from its previous phase and feeds its
output to the next phase of the compiler.
It is desirable to have relatively few phases, since it takes time to read and write immediate files.
Following diagram depicts the phases of a compiler through which t goes during the compilation.
There are 6 phases in a [Link] of this phase help in converting the high-level langue the
machine code. The phases of a compiler are:
1. Lexical Analyzer (Scanner),
2. Syntax Analyzer (Parser),
3. Semantic Analyzer,
4. Intermediate Code Generator(ICG),
5. Code Optimizer(CO) , and
6. Code Generator(CG).
In addition to these, it also has Symbol table management, and Error handler phases.
The Phases of compiler divided in to two parts, first three phases we are called as Analysis part
remaining three called as Synthesis part.
The synthesis part constructs the desired target program from the intermediate representation and
the information in the symbol table.
The analysis part is often called the front end of the compiler; the synthesis part is the back end.
If we examine the compilation process in more detail, we see that it operates as a sequence of
phases, each of which transforms one representation of the source program to another.
A typical decomposition of a compiler into phases is shown in Fig. In practice, several phases
may be grouped together, and the intermediate representations between the grouped phases need
not be constructed explicitly .
PHASE, PASSES OF A COMPILER:
In some application we can have a compiler that is organized into what is called passes. Where a pass
is a collection of phases that convert the input from one representation to a completely different
representation. Each pass makes a complete scan of the input and produces its output to be processed
by the subsequent [Link] example a two pass Assembler.
[Link]- UNIT-3
Figure : Phases of a Compiler
LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as
interfacebetween the compiler and the Source language program and performs the
following functions:
o Reads the characters in the Source program and groups them into a stream of
tokens in which each token specifies a logically cohesive sequence of characters,
such as an identifier, a Keyword , a punctuation mark, a multi character operator
like := .
o The character sequence forming a token is called a lexeme of the token.
o The Scanner generates a token-id, and also enters that identifiers name in the
Symboltable if it doesn‘t exist.
o If the token is not valid i.e., does not fall into any of the
identifiable groups, then thelexical analyser reports an error.
o Lexical analysis involves recognizing the tokens in the source
program and reporting errors, if any.
o Also removes the Comments, and unnecessary spaces.
Token: A token is a syntactic category. Sentences consist of a string of
tokens. Forexample constants, identifier, special symbols, keyword,
operators etc are tokens.
[Link]- UNIT-3
Lexeme: Sequence of characters in a token is a lexeme. For example,
100.01,counter, const, "How are you?" etc are lexemes.
The format of the token representation < Token name, Attribute value>
The token type tells the category of token and token value gives us the information
regarding token. The token value is also called token attribute.
Lexical analysis created the symbol table. The token value can be a pointer to
symbol table in case of identifier and constant.
The lexical a n a l y z e r reads the input program and generates table for tokens.
Consider the example a= b+c* 60 or position=initial+rate *60
Lexeme Token Token
representation
a Identifier <id,1>
= Assignment <=>
Operator
b Identifier <id,2>
+ Additional <+>
Operator
c Identifier <id,3>
* Multiplication <*>
Operator
60 Constant 60
Then the input of a=b+c* 60 becomes as id1=id2+id3*60 and the information about
the token stored in the symbol table(only identifiers excluding the all other data)
The symbol table is mainly known as the data structure of the compiler. It helps in storing the
identifiers with their name and types. It makes it very easy to operate the searching and fetching process.
The symbol table connects or interacts with all phases of the compiler and error handler for updates.
It is also accountable for scope management.
Example: a= b+c* 60 or position=initial+rate *60
Symbol name Type scope address
a / position float global
b / initial float global
c / rate float global
[Link]- UNIT-3
SYNTAX ANALYZER (PARSER):
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token
streams. The parser analyzes the source code (token stream) against the production rules
to detect any errors in the code. The output of this phase is a parse tree .
The parser (syntax analyzer) receives the source code in the form of tokens from the
lexical analyzer and performs syntax analysis, which create a tree-like intermediate
representation that depicts the grammatical structure of the token stream.
The Parser interacts with the Scanner, and its subsequentphase Semantic Analyzer and
performs the following functions:
A typical representation is a abstract syntax tree in
which each interior node represents an operation
the children of the node represent the arguments of the operation
Parse tree or Syntax tree:
SEMANTICANALYZER: This phase receives the syntax tree as input, and checks
the semantically correctness of the program. Though the tokens are valid and
syntactically correct, itmay happen that they are not correct semantically.
Therefore the semantic analyzer checks the semantics (meaning) of the statements
formed.
Semantic Analyzer will check for
Type mismatches,
incompatible operands,
a function called with improper arguments,
an undeclared variable, etc.
Here in the parse tree id2 ,id3 are float and result stored in float.
[Link]- UNIT-3
Type mismatching with 60 , which is integer, so converted in to real/float
The Syntactically and Semantically correct structures are produced here the updated parse tree.
INTERMEDIATE CODE GENERATOR(ICG): This phase takes the syntactically and
semantically correct structure as input, and produces its equivalent intermediate notation
of the source program. The Intermediate Code should have two important properties
specified below:
o It should be easy to produce, and Easy to translate into the target program.
Example intermediate code forms are:
o Three address codes,
o Polish notations, etc.
The interior nodes corresponding to inttoreal, +, *, are assigned temporary variables.
The temporary variable corresponding to the interior node will be one of the 3 addresses that
comprises of the three address code.
This is typically the left hand side variable of the 3-address code.
The right hand side expression is derived using the left and right children of the interior node and the
operator to join them is based on the interior node. If, there is only one child for the interior node, then
the right hand side of the 3-address code will have only one operand.
In this example, “inttoreal” is an interior node which is assigned temporary variable “t1”.
The operator is “inttoreal” and the only right hand side operand is “60”. So, the corresponding 3 –
address code is given in statement
t1 = inttoreal(60)
Consider one more interior node, “*”. It is assigned temporary variable “t2” and its left and
right children are “id3” and temporary variable “t1”. Hence, the corresponding 3-address
code is given in statement
t2= id3 * t1
for the other two interior nodes, the corresponding three-address codes are given in statement
[Link]- UNIT-3
t3 = id2 + t2
id1 = t3
So output of Intermediate code generator is as follows:
t1 = inttoreal(60)
t2= id3 * t1
t3 = id2 + t2
id1 = t3
CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and
beneficial in terms of saving development time, effort, and cost. This phase performs the
following specific functions:
Attempts to improve the IC so as to have a faster machine code. Typical functions include
–Loop Optimization, Removal of redundant computations, Strength reduction, Frequency
reductions etc.
Sometimes the data structures used in representing the intermediate
forms mayalso be changed.
The modified code is given as:
t1 = id3 * 60.0
id1 = id2 + t1
CODE GENERATOR:
This is the final phase of the compiler and generates the target code, normally consisting of the
relocatable machine code or Assembly code or absolute machine code.
Memory locations are selected for each variable used, and assignment of
variables toregisters is done.
Intermediate instructions are translated into a sequence of machine instructions as
follow:
LDF R2 , id3
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1
The Compiler also performs the Symbol table management and Error handling
throughout thecompilation process. Symbol table is nothing but a data structure that stores
different source language constructs, and tokens generated during the compilation. These
two interact with all phases of the Compiler.
[Link]- UNIT-3
For example the source program is an assignment statement; the following figure shows
how thephases of compiler will process the program.
Fig: Translation of assignment statement for Position=initial +rate*60
The input source program is Position=initial +rate*60
Lexical Analysis
[Link]- UNIT-3
Refer the content from Phase1
INPUT BUFFERING:
The lexical analyzer reads characters from the input and passes the tokens to the
syntax
analyzer whenever it is asked for a token.
Lexical Analysis has to access hard disk/ secondary memory each time to identify tokens. It is time-
consuming and costly. So, the input strings are stored into a buffer and then scanned by Lexical Analysis.
Input buffering in a compiler is a technique used to manage the flow of characters from the source program
to the lexical analyzer efficiently. It ensures that the reading of the source code is performed smoothly
without excessive I/O operations
Lexical Analysis scans input string from left to right one character at a time to identify
tokens. It uses two pointers begin ptr(bp) and forward pointer(fp) to keep track of the pointer
of the input scanned.
Ex:
bp
i n t x = 5 ;
fp
Initially both the pointers point to the first character of the input string
bp
i n t x = 5 ;
fp
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is encountered,
it indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank space the lexeme
[Link]- UNIT-3
“int” is identified.
bp
fp
The fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next token.
The input character is thus read from secondary storage, but reading in this way from secondary
storage is costly. hence buffering technique is used.A block of data is first read into a buffer, and then
second by lexical analyzer. there are two methods used in this context:
1. One Buffer Scheme, and
2. Two Buffer Scheme.
One buffer scheme
In one buffer scheme, only one buffer is used store the input string.
If the lexeme is very long then it crosses the buffer boundary, to scan rest
of thelexeme the buffer has to refilled, that makes the overwriting the first
part of lexeme.
bp
↓
i n t a = a + 5
↑
f
p
one buffer scheme
It is the problem with this scheme.
Two buffer scheme
To overcome the above said problem, two buffers are used to store the input
[Link] first buffer and second buffer are scanned alternately.
When the end of the current buffer is reached the other buffer is filled.
The problem with this scheme is that if length of the lexeme is longer than
[Link]- UNIT-3
length ofthe buffer then scanning input cannot be scanned completely.
Initially, both the pointers bp and fp are pointing the first buffer. Then the fp
movestowards right in search of the end of lexeme.
When blank character is identified, the string between bp and fp is
identified ascorresponding token.
To identify the boundary of the first buffer end of buffer character should be
placedat the end of first buffer.
In the same way, end of second buffer is also recognized by the end of first
buffermark present at the end of second buffer.
When fp encounters first eof, then one can recognize end of first buffer and
hencefilling up of second buffer is started.
In the same way when second eof is obtained then it indicates end of second
buffer.
Alternately, both the buffers can be filled up until end of the input
program andstream of tokens identified.
This eof character introduced at the end is called sentinel which is used to
identifythe end of buffer.
Buffer 1 bp
↓
i n t a = a + 5
; b = b + 1 ; eof
↑
f
p
Code for input buffering
If forward at end
of first half then begin reload second half;
forward := forward + 1 end
else if forward at end of second half then
begin reload second half;
move forward to beginning of first half end
else forward := forward + 1;
[Link]- UNIT-3