0% found this document useful (0 votes)
6 views195 pages

UNIT1 Compiler

The document outlines the fundamentals of compiler design, including the necessity of compilers, their working mechanisms, and various phases such as lexical analysis, syntax analysis, and semantic analysis. It discusses different types of language processors, including compilers, interpreters, and assemblers, and details the structure and functioning of a standard compiler. Additionally, it covers concepts like regular expressions, token recognition, and the role of lexical analyzers in the compilation process.

Uploaded by

numblopunguri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views195 pages

UNIT1 Compiler

The document outlines the fundamentals of compiler design, including the necessity of compilers, their working mechanisms, and various phases such as lexical analysis, syntax analysis, and semantic analysis. It discusses different types of language processors, including compilers, interpreters, and assemblers, and details the structure and functioning of a standard compiler. Additionally, it covers concepts like regular expressions, token recognition, and the role of lexical analyzers in the compilation process.

Uploaded by

numblopunguri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

21CSC304J-Compiler Design

WHY WE NEED COMPILER?

HOW IT IS WORKING?
Pre-requisite
• Basic knowledge of programming languages.
• Basic knowledge of FSA and CFG.
• Knowledge of a high programming language for the programming
assignments.

Textbook:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, “Compilers:
Principles, Techniques, and Tools” Addison-Wesley, 1986.
Course Outline
• Introduction to Compiling
• Lexical Analysis
• Syntax Analysis
– Context Free Grammars
– Top-Down Parsing, LL Parsing
– Bottom-Up Parsing, LR Parsing
• Syntax-Directed Translation
– Attribute Definitions
– Evaluation of Attribute Definitions
• Semantic Analysis, Type Checking
• Run-Time Organization
• Intermediate Code Generation
TRANSLATOR
Translate high level language into equivalent machine language

Types
• Compiler

• Interpreter

• Assembler
COMPILERS
• A compiler is a program takes a program written in a source
language and translates it into an equivalent program in a target
language.

Source COMPILER Target


Program Program

(A program written in ( The equivalent program in


a high-level programming language) machine code – relocatable object
Error message file)

(Take correct actions to remove the errors for further compilation


process)
11
INTERPRETER
• Translates line by line
• Better error diagnostics

Source
Program
Interpreter Output
Input

Error messages

12
ASSEMBLER
Assembler is a translator which is used to translate the assembly
language code into machine language code

Assembly
Assembler Machine
language
language code
code
Language Processor
Language processor
Source Program

Pre processor
Modified source program

Compiler
Target assembly

program

Assembler

Relocatable machine
code
library files, relocatable
Linker/Loader files
Target machine code
Cousins of the Compiler
STRUCTURE OF THE COMPILER
Two Parts:

Compiler

Analysis Synthesis
Part Part
Phases of the
Compiler
Standard Compiler Structure
Source code
(character stream)
Lexical analysis
Token stream
Parsing Front end
(machine-independent)
Abstract syntax tree
Intermediate Code Generation
Intermediate code
Optimization
Intermediate code Back end
(machine-dependent)
Code generation
Assembly code
24
Translation of an Assignment
Statement
LEXICAL ANALYSIS
□ Lexical analysis
• Lexical analysis is the first phase of compiler which is also termed as scanning.
• Source program is scanned to read the stream of characters and those characters are
grouped to form a sequence called lexemes which produces token as output.
• Token: Token is a sequence of characters that represent lexical unit, which matches
with the pattern, such as keywords, operators, identifiers etc.
• Lexeme: Lexeme is instance of a token i.e., group of characters forming a token. ,
• Pattern: Pattern describes the rule that the lexemes of a token takes. It is the
structure that must be matched by strings.
• Once a token is generated the corresponding entry is made in the symbol table.
Input: stream of characters
Output: Token
Token Template: <token-name, attribute-value>
Example : c=a+b*5;
The primary functions of Lexical Phase:
•Identify the lexical units in a source code
•Classify lexical units into classes like constants, reserved words, and
enter them in different tables. It will Ignore comments in the source
program

•Identify token which is not a part of the language


Example:
x = y + 10
Tokens
X identifier

= Assignment operator

Y identifier

+ Addition operator

10 Number
SYNTAX ANALYSIS
□ Syntax Analysis
• Syntax analysis is the second phase of compiler which is also called as parsing.

• Constructing the parse tree with the help of tokens. It also determines the structure of
source language and grammar or syntax of the language.

• Parser converts the tokens produced by lexical analyser into a tree like representation called
parse tree.

• Syntax tree is a compressed representation of the parse tree in which the operators appear
as interior nodes and the operands of the operator are the children of the node for
that
operator.

Input: Tokens
Output: Syntax tree
Here, is a list of tasks performed in this phase:
•Obtain tokens from the lexical analyser
•Checks if the expression is syntactically correct or not
•Report all syntax errors
•Construct a hierarchical structure which is known as a parse tree
SEMANTIC ANALYSIS
□ Semantic Analysis

❖ Type conversion and Type Coercion

EXAMPLE
INTERMEDIATE CODE GENERATOR
□ Intermediate Code Generator
CODE OPTIMIZATION
□ Code Optimization
CODE GENERATOR
□ Code Generator
Symbol Table
Q1

Phases of a compiler
Translation of an assignment statement

EXERCISE NO 1

Position = initial +rate *60


Translation of Statement

EXERCISE NO 1
Grouping of Phases
Front End Back End
Front End
•Front end comprises of phases which are dependent on the input (source language) and
independent on the target machine (Target lang)

•It includes lexical and syntactic analysis, symbol table management, semantic analysis and the
generation of intermediate code.

•Code optimization can also be done by the front end. It also includes error handling at the phases
concerned.

Back End
• Dependent on the target machine and independent on the source language.

• This includes code optimization, code generation.


PASSES
• The phases of compiler can be implemented in a single pass by marking the
primary actions viz. reading of input file and writing to the output file.

• Several phases of compiler are grouped into one pass in such a way that the
operations in each and every phase are incorporated during the pass.

• Lexical analysis, syntax analysis, semantic analysis and intermediate code


generation might be grouped into one pass. If so, the token stream after
lexical analysis may be translated directly into intermediate code.
Types of Compiler

•Single Pass Compilers


•Two Pass Compilers
•Multi pass Compilers
Compiler Construction Tool
1. Scanner Generator
2. Parser Generator
3. Syntax –Directed Translation engine
4. Automatic code generators
5. Data-flow analysis engines
6. Compiler-Construction toolkits
1
2
of a program to another
Role Of Lexical Analyzer
Interaction of Lexical analyzer with parser

token
Source Lexical parser To semantic
program analyzer getNexttoken() analysis

symbol
table
Attributes for Token
Lexical Analysis
• Lexical analyzer: reads input characters and produces a sequence of tokens as
output (nexttoken()).
• Trying to understand each element in a program.
• Token: a group of characters having a collective
meaning. const pi = 3.14159;

Token 1: (const, -)
Token 2: (identifier, ‘pi’)
Token 3: (=, -)
Token 4: (realnumber, 3.14159)
Token 5: (;, -)
Buffering Techniques
Input Buffering
• How to increase the speed of reading the
source program, character by character ?

• A specialized buffering techniques have been


developed to reduce the amount of overhead
required to process a single input character.

• There are two type of buffering scheme –


One buffer scheme and two buffer scheme

79
Cont..,
Cont..,
Cont..,
Main Concept of Input Buffering
• Two pointers to the input are maintained
• Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to determine.
• Pointer forward scans ahead until a pattern match is found;
• Sentinels
• To reload the buffer and to advance the forward pointer effectively we have
to make two tests
• one for the end of the buffer and other to determine what character is read
• This can be done in a combine way if we hold a sentinel character at the end
• A sentinel is a special character that cannot be part of the source program
(Eg: eof)
84
▪ Each buffer is of the same size N, and N is usually the size
of a disk block, e.g., 4096 bytes.

▪ Using one system read command we can read N


characters in a buffer, rather than using one system call
per character.

▪ If fewer than N characters remain in the input file, then a


Conclusion
• Lex is a tool in lexical analysis phase to recognize tokens using regular
expression.

• Lex tool itself is a lex compiler.


• lex.l is an a input file written in a language which describes the generation of
lexical analyzer. The lex compiler transforms lex.l to a C program known

as [Link].c.

• [Link].c is compiled by the C compiler to a file called [Link].

• The output of C compiler is the working lexical analyzer which takes stream of

input characters and produces a stream of tokens.


• yylval is a global variable which is shared by lexical analyzer and
parser to return the name and an attribute value of token.

• The attribute value can be numeric code, pointer to symbol table or


nothing.

• Another tool for lexical analyzer generation is Flex


Structure of Lex Programs
Lex program will be in following form
Declarations This section includes declaration of variables, constants and regular
definitions.
Translation rules It contains regular expressions and code segments.
Form : Pattern {Action}
Pattern is a regular expression or regular definition.
Action refers to segments of code.
▪Auxiliary functions This section holds additional functions which are used in
actions. These functions are compiled separately and loaded with lexical analyzer.
▪Lexical analyzer produced by lex starts its process by reading one character at
a time until a valid match for a pattern is found.
▪Once a match is found, the associated action takes place to produce token.
▪The token is then given to parser for further processing.
Conflict Resolution in Lex
❖ Conflict arises when several prefixes of input matches one or
more patterns. This can be resolved by the following:

❖ Always prefer a longer prefix than a shorter prefix.

❖ If two or more patterns are matched for the longest prefix, then
the first pattern listed in lex program is preferred.
Lookahead Operator
• Lookahead operator is the additional operator that is read by lex in order to distinguish
additional pattern for a token.
• Lexical analyzer is used to read one character ahead of valid lexeme and then retracts to
produce token.
• At times, it is needed to have certain characters at the end of input to match with a
pattern. In such cases, slash (/) is used to indicate end of part of pattern that matches
the lexeme.
❖ (eg.) In some languages keywords are not reserved. So the statements
❖ IF (I, J) = 5 and IF(condition) THEN results in conflict whether to produce IF as an array
name or a keyword. To resolve this the lex rule for keyword IF can be written as,
❖ IF/\ (.* \) { letter }
Issues in Lexical Analysis
1. Lookahead
[Link]
Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular expression, L(a) = {a}

• (r) | (s) is a regular expression denoting the language L(r) ∪ L(s)

• (r)(s) is a regular expression denoting the language L(r)L(s)

• (r)* is a regular expression denoting (L(r))*

• (r) is a regular expression denting L(r)


121

Regular Expressions
• Regular expressions are a notation to represent lexeme pattern for a token.

• We use regular expressions to describe tokens of a programming language.


• A regular expression is built up of simpler regular expressions (using defining

rules)

• Each regular expression denotes a language.

• A language denoted by a regular expression is called as a regular set.


122

Regular Expressions (Rules)


Regular expressions over alphabet Σ

Reg. Expr Language it denotes


ε {ε}
a∈ Σ {a}
(r1) | (r2) L(r1) ∪ L(r2)
(r1) (r2) L(r1) L(r2)
(r)* (L(r))*
(r) L(r)

• (r)+ = (r)(r)*
• (r)? = (r) | ε
123

Regular Expressions (cont.)


• We may remove parentheses by using precedence rules.
• * highest
• concatenation next
• | lowest
• ab*|c means (a(b)*)|(c)

• Ex:
• Σ = {0,1}
• 0|1 => {0,1}
• (0|1)(0|1) => {00,01,10,11}
• 0* => {ε ,0,00,000,0000,....}
• (0|1)* => all strings with 0 and 1, including the empty string
124

Regular Definitions
• To write regular expression for some languages can be difficult, because their regular

expressions can be quite complex. In those cases, we may use regular definitions.
• We can give names to regular expressions, and we can use these names as symbols to

define other regular expressions.

• A regular definition is a sequence of the definitions of the form:

d 1 → r1 where di is a distinct name and


d 2 → r2 ri is a regular expression over symbols in Σ∪{d1,d2,...,di-1}
d n → rn
Regular definitions
d1 -> r1
If we try to write the regular expression
d2 -> r2 representing identifiers without using regular
… definitions, that regular expression will be

dn -> rn complex.

(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *

• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9

id -> letter_ (letter_ | digit)*


• Ex: Unsigned numbers in Pascal
digit → 0 | 1 | ... | 9

digits → digit +
opt-fraction → ( . digits ) ?
opt-exponent → ( E (+|-)? digits ) ?
unsigned-num → digits opt-fraction opt-exponent
Extensions
• One or more instances: (r)+

• Zero or one instances: r?

• Character classes: [abc]

• Example:

• letter_ -> [A-Za-z_]

• digit -> [0-9]


• id -> letter_(letter|digit)*
Recognition of tokens
• Starting point is the language grammar to understand the tokens:

stmt -> if expr then stmt

| if expr then stmt else stmt



expr -> term relop term
| term
term -> id

| number
Recognition of tokens (cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
ws -> (blank | tab | newline)+
131

Operations on Languages
• Concatenation:
• L1L2 = { s1s2 | s1 ∈ L1 and s2 ∈ L2 }

• Union
• L1 ∪ L2 = { s | s ∈ L1 or s ∈ L2 }

• Exponentiation:
• L0 = {ε} L1 = L L2 = LL

• Kleene Closure

• L* =

• Positive Closure

• L+ =
Transition diagrams
• Transition diagram for relop

The pattern are converted


into transition diagram
while constructing the
lexical analyzer
Transition diagrams (cont.)
• Transition diagram for reserved words and identifiers

KEYWORD

t h e n Nonlet|dig
Transition diagrams (cont.)
• Transition diagram for unsigned numbers
Transition diagrams (cont.)
• Transition diagram for whitespace
Finite Automata
• Regular expressions = specification
• Finite automata = implementation
• Recognizer ---A recognizer for a language is a program that takes as
input a string x answers ‘yes’ if x is a sentence of the language and ‘no’
otherwise.

• A better way to convert a regular expression to a recognizer is to


construct a generalized transition diagram from the expression. This
diagram is called a finite automaton.

• Finite Automaton can be


• Deterministic
• Non-deterministic
Finite Automata

• A finite automaton consists of

• An input alphabet Σ

• A set of states S

• A start state q0

• A set of accepting states F ⊆ S

• A set of transitions state →input state


Finite Automata
• Transition
a
s1 → s2
• Is read

In state s1 on input “a” go to state s2

• If end of input

• If in accepting state => accept, otherwise => reject

• If no transition possible => reject


Finite Automata State Graphs
• A state

• The start state

• An accepting state

a
• A transition
CS416 Compiler Design 140

Finite Automata
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of

that language, and “no” otherwise.

• We call the recognizer of the tokens as a finite automaton.

• A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)

• This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer.

• Both deterministic and non-deterministic finite automaton recognize regular sets.

• Which one?

• deterministic – faster recognizer, but it may take more space

• non-deterministic – slower, but it may take less space

• Deterministic automatons are widely used lexical analyzers.


• First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical

analyzer for our tokens.

• Algorithm1: Regular Expression NFA DFA (two steps: first to NFA, then to DFA)
• Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA)
Non-Deterministic Finite Automaton
(NFA)
• A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
• Q - a set of states
• Σ - a set of input symbols (alphabet)
• 𝞭-move – a transition function move to map state-symbol pairs to sets of states.

• q0 - a start (initial) state


• F – a set of accepting states (final states)

• ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to one of

accepting states such that edge labels along this path spell out x.
Deterministic and Nondeterministic Automata
• Deterministic Finite Automata (DFA)

• One transition per input per state

• No ε-moves

• Nondeterministic Finite Automata (NFA)


• Can have multiple transitions for one input in a given state

• Can have ε-moves

• Finite automata have finite memory


• Need only to encode the current state
143
Jeya R 144
DFA
• For every string x, there is a unique path from initial state and associated with x.

x is accepted if and only if this path ends at a final state.


NFA
• For any string x, there may exist none or more than one path from initial state

and associated with x.


A Simple Example
• A finite automaton that accepts only “1”

• A finite automaton accepts a string if we can follow transitions labeled


with the characters in the string from the start to some accepting state
Another Simple Example
• A finite automaton accepting any number of 1’s followed by a single 0
• Alphabet: {0,1}

• Check that “1110” is accepted.


NFA
NFA
Transition Table
CS416 Compiler Design 152

Converting A Regular Expression into A NFA (Thomson’s


Construction)
• This is one way to convert a regular expression into a NFA.

• There can be other ways (much efficient) for the conversion.

• Thomson’s Construction is simple and systematic method. It


guarantees that the resulting NFA will have exactly one final state, and one start
state.

• Construction starts from simplest parts (alphabet symbols).


• To create a NFA for a complex regular expression, NFAs of its
sub-expressions are combined to create its NFA,
CS416 Compiler Design 153

Thomson’s Construction (cont.)


ε
i f
• To recognize an empty string ε

• To recognize a symbol a in the alphabet Σ a


i f

• If N(r1) and N(r2) are NFAs for regular expressions r1 and r2


• For regular expression r1 | r2

ε N(r1) ε
NFA for r1 | r2
i ε f
ε
N(r2)
Thomson’s Construction (Example - (a|b) * a)
CS416 Compiler Design 154

Thomson’s Construction (cont.)


• For regular expression r1 r2

i N(r1) N(r2) f Final state of N(r2) become


final state of N(r 1r2)

NFA for r1 r2

• For regular expression r*

ε ε
i N(r) f

NFA for
r*
CS416 Compiler Design 155

Thomson’s Construction (Example - (a|b) * a )


a (a|b) *
a:
ε

a ε
ε
ε ε
ε ε
b
b: b

(a | b)
(a|b) * a ε
a ε
ε a
ε ε
ε ε ε a
ε
b ε ε
b

ε
156
RE to NFA
(a|b) * abb ε

a ε
ε 2 3
ε ε b b 1
0 1 6 7 a
8
ε ε 9 0
4 b
5
ε
Jeya R 158

(a|b) * abb
RE to DFA –USING THOMPSONS SUBSET
C
GivO
en N
reguS
laT
r eR
xpU
ressC
ioT
n isIO
conN
verted into NFA..
Then. NFA is converted into DFA.
STEPS
1. Convert into NFA using above rules for operators (union, concatenation and closure)and precedence.
2. Find Ɛ-closure of all states.
3. Start with epsilon closure of start state of NFA.
(a|b) * abb
4. Apply the input symbols and find its epsilon closure.
Dtran[state, input symbol] = Ɛ-closure(move(state, input symbol))
where Dtran transition function of DFA

5. Analyze the output state to find whether it is a new state.


6. If new state is found, repeat step 4 and step 5 until no more new states are found.
7. Construct the transition table for Dtran function.
8. Draw the transition diagram with start state as the s-closure (start state of NFA) andfinal state is the
state that contains final state of NFA drawn.
STEP1: RE to NFA
(a|b) * abb

Jeya R 160
Step 2: Start with finding Ɛ-closure of state 0
Ɛ-closure(0) = {0,1,2,4,7} = A

Step 3: Apply input symbols a, b to A


Dtran(A, a) = Ɛ-closure(move(A, a))

= Ɛ-closure(move({0,1,2,4,7), a))= Ɛ-closure(3,8)= {3,6,7,1,2,4,8}= {1,2,3,4,6,7,8} = B


Dtran A, a=B
Dtran A, b] = Ɛ-closure( move(A,b))
= Ɛ-closure(move({0,1,2,4,7), b))
= Ɛ-closure(5)
{5,6,7,1,2,4,7}
{1,2,4,5,6,7} = C
Dtran(A,b] =C
Step 4: Apply input symbols to new state B

Dtran B, a] = Ɛ-closure (move(B, a))


= Ɛ-closure(move({1,2,3,4,6,7,8}, a))
= Ɛ-closure(3,8)
= {1,2,3,4,6,7,8} = B
Dtran [B, a] =B

Dtran(B, b) = Ɛ-closure(move(B, b))


= Ɛ-closure(move({1,2,3,4,6,7,8},b))
= Ɛ-closure(5,9)
={1,2,4,5,6,7,9} =D

Dtran(B, 6] =D
Step 5: Apply input symbols to new state C
Dtran(C, a) = Ɛ-closure(move(C, a))
= Ɛ-closure(move({1,2,4,5,6,7), a))
= Ɛ-closure(3,8)
={1,2,3,4,6,7,8}= B
Dtran(C, a] =B

Dtran [C, b) = Ɛ-closure(move(C, b))


= Ɛ-closure(move({1,2,4,5,6,7),b))
= Ɛ-closure(5)= {1,2,4,5,6,7} = C
Dtran(C, b] =C
Step 6: Apply input symbols to new state D
Dtran[ D, a] = Ɛ-closure(move(D, a))
= Ɛ-closure(move({1,2,4,5,6,7,9}, a))
= Ɛ-closure(3,8)
{1,2,3,4,6,7,8} = B
Dtran[ D, a] = B

Dtran(D, b] = Ɛ-closure(move(D,b))
= Ɛ-closure(move({1,2,4,5,6,7,9},b))
= Ɛ-closure(5, 10)= {1,2,4,5,6,7,10} = E
Dtran[ D, b] =E

Step 7: Apply input symbols to new state E Dtran [E, b] = E-closure(move(E, 6))
Dtran[ E, a) =Ɛ -closure(move(E, a))
= E-closure(move({1,2,4,5,6,7,10),b))
= Ɛ-closure(move({1,2,4,5,6,7,10), a))
= Ɛ-closure(3,8) = E-closure(5)= {1,2,4,5,6,7} = C
= {1,2,3,4,6,7,8} = B Dtran [E, b]=C
Dtran [E, a] =B
Minimization of DFA
Minimization of DFA
Minimization of DFA
Minimization of DFA
Minimization of DFA
Example-Minimization of DFA
Example-Minimization of DFA
Example-Minimization of DFA
Example-Minimization of DFA
Regular Expression to DFA
(Direct Method)
Regular Expression to DFA
(Direct Method)- Example
• Regular Expression: (a/b)*abb
• Augmented Grammar : (a/b)*abb# =? (a/b)*.a.b.b.#
Regular Expression to DFA
(Direct Method)- Example
Computation of Nullable,
Firstpos, LastPos:
Example:
Direct Method
Direct Method
Direct Method
Direct Method
Direct Method
Direct Method
Direct method RE to DFA

(a|b)*abb(a|b)*

Find firstpos,lastpos,followpos
Draw DFA
{1,2,3} {8}
Finding followpos by rules
NFA TO DFA ( SUBSET CONSTRUCTION)

You might also like