21CSC304J-Compiler Design
WHY WE NEED COMPILER?
HOW IT IS WORKING?
Pre-requisite
• Basic knowledge of programming languages.
• Basic knowledge of FSA and CFG.
• Knowledge of a high programming language for the programming
assignments.
Textbook:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, “Compilers:
Principles, Techniques, and Tools” Addison-Wesley, 1986.
Course Outline
• Introduction to Compiling
• Lexical Analysis
• Syntax Analysis
– Context Free Grammars
– Top-Down Parsing, LL Parsing
– Bottom-Up Parsing, LR Parsing
• Syntax-Directed Translation
– Attribute Definitions
– Evaluation of Attribute Definitions
• Semantic Analysis, Type Checking
• Run-Time Organization
• Intermediate Code Generation
TRANSLATOR
Translate high level language into equivalent machine language
Types
• Compiler
• Interpreter
• Assembler
COMPILERS
• A compiler is a program takes a program written in a source
language and translates it into an equivalent program in a target
language.
Source COMPILER Target
Program Program
(A program written in ( The equivalent program in
a high-level programming language) machine code – relocatable object
Error message file)
(Take correct actions to remove the errors for further compilation
process)
11
INTERPRETER
• Translates line by line
• Better error diagnostics
Source
Program
Interpreter Output
Input
Error messages
12
ASSEMBLER
Assembler is a translator which is used to translate the assembly
language code into machine language code
Assembly
Assembler Machine
language
language code
code
Language Processor
Language processor
Source Program
Pre processor
Modified source program
Compiler
Target assembly
program
Assembler
Relocatable machine
code
library files, relocatable
Linker/Loader files
Target machine code
Cousins of the Compiler
STRUCTURE OF THE COMPILER
Two Parts:
Compiler
Analysis Synthesis
Part Part
Phases of the
Compiler
Standard Compiler Structure
Source code
(character stream)
Lexical analysis
Token stream
Parsing Front end
(machine-independent)
Abstract syntax tree
Intermediate Code Generation
Intermediate code
Optimization
Intermediate code Back end
(machine-dependent)
Code generation
Assembly code
24
Translation of an Assignment
Statement
LEXICAL ANALYSIS
□ Lexical analysis
• Lexical analysis is the first phase of compiler which is also termed as scanning.
• Source program is scanned to read the stream of characters and those characters are
grouped to form a sequence called lexemes which produces token as output.
• Token: Token is a sequence of characters that represent lexical unit, which matches
with the pattern, such as keywords, operators, identifiers etc.
• Lexeme: Lexeme is instance of a token i.e., group of characters forming a token. ,
• Pattern: Pattern describes the rule that the lexemes of a token takes. It is the
structure that must be matched by strings.
• Once a token is generated the corresponding entry is made in the symbol table.
Input: stream of characters
Output: Token
Token Template: <token-name, attribute-value>
Example : c=a+b*5;
The primary functions of Lexical Phase:
•Identify the lexical units in a source code
•Classify lexical units into classes like constants, reserved words, and
enter them in different tables. It will Ignore comments in the source
program
•Identify token which is not a part of the language
Example:
x = y + 10
Tokens
X identifier
= Assignment operator
Y identifier
+ Addition operator
10 Number
SYNTAX ANALYSIS
□ Syntax Analysis
• Syntax analysis is the second phase of compiler which is also called as parsing.
• Constructing the parse tree with the help of tokens. It also determines the structure of
source language and grammar or syntax of the language.
• Parser converts the tokens produced by lexical analyser into a tree like representation called
parse tree.
• Syntax tree is a compressed representation of the parse tree in which the operators appear
as interior nodes and the operands of the operator are the children of the node for
that
operator.
Input: Tokens
Output: Syntax tree
Here, is a list of tasks performed in this phase:
•Obtain tokens from the lexical analyser
•Checks if the expression is syntactically correct or not
•Report all syntax errors
•Construct a hierarchical structure which is known as a parse tree
SEMANTIC ANALYSIS
□ Semantic Analysis
❖ Type conversion and Type Coercion
EXAMPLE
INTERMEDIATE CODE GENERATOR
□ Intermediate Code Generator
CODE OPTIMIZATION
□ Code Optimization
CODE GENERATOR
□ Code Generator
Symbol Table
Q1
Phases of a compiler
Translation of an assignment statement
EXERCISE NO 1
Position = initial +rate *60
Translation of Statement
EXERCISE NO 1
Grouping of Phases
Front End Back End
Front End
•Front end comprises of phases which are dependent on the input (source language) and
independent on the target machine (Target lang)
•It includes lexical and syntactic analysis, symbol table management, semantic analysis and the
generation of intermediate code.
•Code optimization can also be done by the front end. It also includes error handling at the phases
concerned.
Back End
• Dependent on the target machine and independent on the source language.
• This includes code optimization, code generation.
PASSES
• The phases of compiler can be implemented in a single pass by marking the
primary actions viz. reading of input file and writing to the output file.
• Several phases of compiler are grouped into one pass in such a way that the
operations in each and every phase are incorporated during the pass.
• Lexical analysis, syntax analysis, semantic analysis and intermediate code
generation might be grouped into one pass. If so, the token stream after
lexical analysis may be translated directly into intermediate code.
Types of Compiler
•Single Pass Compilers
•Two Pass Compilers
•Multi pass Compilers
Compiler Construction Tool
1. Scanner Generator
2. Parser Generator
3. Syntax –Directed Translation engine
4. Automatic code generators
5. Data-flow analysis engines
6. Compiler-Construction toolkits
1
2
of a program to another
Role Of Lexical Analyzer
Interaction of Lexical analyzer with parser
token
Source Lexical parser To semantic
program analyzer getNexttoken() analysis
symbol
table
Attributes for Token
Lexical Analysis
• Lexical analyzer: reads input characters and produces a sequence of tokens as
output (nexttoken()).
• Trying to understand each element in a program.
• Token: a group of characters having a collective
meaning. const pi = 3.14159;
Token 1: (const, -)
Token 2: (identifier, ‘pi’)
Token 3: (=, -)
Token 4: (realnumber, 3.14159)
Token 5: (;, -)
Buffering Techniques
Input Buffering
• How to increase the speed of reading the
source program, character by character ?
• A specialized buffering techniques have been
developed to reduce the amount of overhead
required to process a single input character.
• There are two type of buffering scheme –
One buffer scheme and two buffer scheme
79
Cont..,
Cont..,
Cont..,
Main Concept of Input Buffering
• Two pointers to the input are maintained
• Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to determine.
• Pointer forward scans ahead until a pattern match is found;
• Sentinels
• To reload the buffer and to advance the forward pointer effectively we have
to make two tests
• one for the end of the buffer and other to determine what character is read
• This can be done in a combine way if we hold a sentinel character at the end
• A sentinel is a special character that cannot be part of the source program
(Eg: eof)
84
▪ Each buffer is of the same size N, and N is usually the size
of a disk block, e.g., 4096 bytes.
▪ Using one system read command we can read N
characters in a buffer, rather than using one system call
per character.
▪ If fewer than N characters remain in the input file, then a
Conclusion
• Lex is a tool in lexical analysis phase to recognize tokens using regular
expression.
• Lex tool itself is a lex compiler.
• lex.l is an a input file written in a language which describes the generation of
lexical analyzer. The lex compiler transforms lex.l to a C program known
as [Link].c.
• [Link].c is compiled by the C compiler to a file called [Link].
• The output of C compiler is the working lexical analyzer which takes stream of
input characters and produces a stream of tokens.
• yylval is a global variable which is shared by lexical analyzer and
parser to return the name and an attribute value of token.
• The attribute value can be numeric code, pointer to symbol table or
nothing.
• Another tool for lexical analyzer generation is Flex
Structure of Lex Programs
Lex program will be in following form
Declarations This section includes declaration of variables, constants and regular
definitions.
Translation rules It contains regular expressions and code segments.
Form : Pattern {Action}
Pattern is a regular expression or regular definition.
Action refers to segments of code.
▪Auxiliary functions This section holds additional functions which are used in
actions. These functions are compiled separately and loaded with lexical analyzer.
▪Lexical analyzer produced by lex starts its process by reading one character at
a time until a valid match for a pattern is found.
▪Once a match is found, the associated action takes place to produce token.
▪The token is then given to parser for further processing.
Conflict Resolution in Lex
❖ Conflict arises when several prefixes of input matches one or
more patterns. This can be resolved by the following:
❖ Always prefer a longer prefix than a shorter prefix.
❖ If two or more patterns are matched for the longest prefix, then
the first pattern listed in lex program is preferred.
Lookahead Operator
• Lookahead operator is the additional operator that is read by lex in order to distinguish
additional pattern for a token.
• Lexical analyzer is used to read one character ahead of valid lexeme and then retracts to
produce token.
• At times, it is needed to have certain characters at the end of input to match with a
pattern. In such cases, slash (/) is used to indicate end of part of pattern that matches
the lexeme.
❖ (eg.) In some languages keywords are not reserved. So the statements
❖ IF (I, J) = 5 and IF(condition) THEN results in conflict whether to produce IF as an array
name or a keyword. To resolve this the lex rule for keyword IF can be written as,
❖ IF/\ (.* \) { letter }
Issues in Lexical Analysis
1. Lookahead
[Link]
Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular expression, L(a) = {a}
• (r) | (s) is a regular expression denoting the language L(r) ∪ L(s)
• (r)(s) is a regular expression denoting the language L(r)L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denting L(r)
121
Regular Expressions
• Regular expressions are a notation to represent lexeme pattern for a token.
• We use regular expressions to describe tokens of a programming language.
• A regular expression is built up of simpler regular expressions (using defining
rules)
• Each regular expression denotes a language.
• A language denoted by a regular expression is called as a regular set.
122
Regular Expressions (Rules)
Regular expressions over alphabet Σ
Reg. Expr Language it denotes
ε {ε}
a∈ Σ {a}
(r1) | (r2) L(r1) ∪ L(r2)
(r1) (r2) L(r1) L(r2)
(r)* (L(r))*
(r) L(r)
• (r)+ = (r)(r)*
• (r)? = (r) | ε
123
Regular Expressions (cont.)
• We may remove parentheses by using precedence rules.
• * highest
• concatenation next
• | lowest
• ab*|c means (a(b)*)|(c)
• Ex:
• Σ = {0,1}
• 0|1 => {0,1}
• (0|1)(0|1) => {00,01,10,11}
• 0* => {ε ,0,00,000,0000,....}
• (0|1)* => all strings with 0 and 1, including the empty string
124
Regular Definitions
• To write regular expression for some languages can be difficult, because their regular
expressions can be quite complex. In those cases, we may use regular definitions.
• We can give names to regular expressions, and we can use these names as symbols to
define other regular expressions.
• A regular definition is a sequence of the definitions of the form:
d 1 → r1 where di is a distinct name and
d 2 → r2 ri is a regular expression over symbols in Σ∪{d1,d2,...,di-1}
d n → rn
Regular definitions
d1 -> r1
If we try to write the regular expression
d2 -> r2 representing identifiers without using regular
… definitions, that regular expression will be
dn -> rn complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
• Ex: Unsigned numbers in Pascal
digit → 0 | 1 | ... | 9
digits → digit +
opt-fraction → ( . digits ) ?
opt-exponent → ( E (+|-)? digits ) ?
unsigned-num → digits opt-fraction opt-exponent
Extensions
• One or more instances: (r)+
• Zero or one instances: r?
• Character classes: [abc]
• Example:
• letter_ -> [A-Za-z_]
• digit -> [0-9]
• id -> letter_(letter|digit)*
Recognition of tokens
• Starting point is the language grammar to understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
ws -> (blank | tab | newline)+
131
Operations on Languages
• Concatenation:
• L1L2 = { s1s2 | s1 ∈ L1 and s2 ∈ L2 }
• Union
• L1 ∪ L2 = { s | s ∈ L1 or s ∈ L2 }
• Exponentiation:
• L0 = {ε} L1 = L L2 = LL
• Kleene Closure
• L* =
• Positive Closure
• L+ =
Transition diagrams
• Transition diagram for relop
The pattern are converted
into transition diagram
while constructing the
lexical analyzer
Transition diagrams (cont.)
• Transition diagram for reserved words and identifiers
KEYWORD
t h e n Nonlet|dig
Transition diagrams (cont.)
• Transition diagram for unsigned numbers
Transition diagrams (cont.)
• Transition diagram for whitespace
Finite Automata
• Regular expressions = specification
• Finite automata = implementation
• Recognizer ---A recognizer for a language is a program that takes as
input a string x answers ‘yes’ if x is a sentence of the language and ‘no’
otherwise.
• A better way to convert a regular expression to a recognizer is to
construct a generalized transition diagram from the expression. This
diagram is called a finite automaton.
• Finite Automaton can be
• Deterministic
• Non-deterministic
Finite Automata
• A finite automaton consists of
• An input alphabet Σ
• A set of states S
• A start state q0
• A set of accepting states F ⊆ S
• A set of transitions state →input state
Finite Automata
• Transition
a
s1 → s2
• Is read
In state s1 on input “a” go to state s2
• If end of input
• If in accepting state => accept, otherwise => reject
• If no transition possible => reject
Finite Automata State Graphs
• A state
• The start state
• An accepting state
a
• A transition
CS416 Compiler Design 140
Finite Automata
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of
that language, and “no” otherwise.
• We call the recognizer of the tokens as a finite automaton.
• A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer.
• Both deterministic and non-deterministic finite automaton recognize regular sets.
• Which one?
• deterministic – faster recognizer, but it may take more space
• non-deterministic – slower, but it may take less space
• Deterministic automatons are widely used lexical analyzers.
• First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical
analyzer for our tokens.
• Algorithm1: Regular Expression NFA DFA (two steps: first to NFA, then to DFA)
• Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA)
Non-Deterministic Finite Automaton
(NFA)
• A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
• Q - a set of states
• Σ - a set of input symbols (alphabet)
• 𝞭-move – a transition function move to map state-symbol pairs to sets of states.
• q0 - a start (initial) state
• F – a set of accepting states (final states)
• ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.
Deterministic and Nondeterministic Automata
• Deterministic Finite Automata (DFA)
• One transition per input per state
• No ε-moves
• Nondeterministic Finite Automata (NFA)
• Can have multiple transitions for one input in a given state
• Can have ε-moves
• Finite automata have finite memory
• Need only to encode the current state
143
Jeya R 144
DFA
• For every string x, there is a unique path from initial state and associated with x.
x is accepted if and only if this path ends at a final state.
NFA
• For any string x, there may exist none or more than one path from initial state
and associated with x.
A Simple Example
• A finite automaton that accepts only “1”
• A finite automaton accepts a string if we can follow transitions labeled
with the characters in the string from the start to some accepting state
Another Simple Example
• A finite automaton accepting any number of 1’s followed by a single 0
• Alphabet: {0,1}
• Check that “1110” is accepted.
NFA
NFA
Transition Table
CS416 Compiler Design 152
Converting A Regular Expression into A NFA (Thomson’s
Construction)
• This is one way to convert a regular expression into a NFA.
• There can be other ways (much efficient) for the conversion.
• Thomson’s Construction is simple and systematic method. It
guarantees that the resulting NFA will have exactly one final state, and one start
state.
• Construction starts from simplest parts (alphabet symbols).
• To create a NFA for a complex regular expression, NFAs of its
sub-expressions are combined to create its NFA,
CS416 Compiler Design 153
Thomson’s Construction (cont.)
ε
i f
• To recognize an empty string ε
• To recognize a symbol a in the alphabet Σ a
i f
• If N(r1) and N(r2) are NFAs for regular expressions r1 and r2
• For regular expression r1 | r2
ε N(r1) ε
NFA for r1 | r2
i ε f
ε
N(r2)
Thomson’s Construction (Example - (a|b) * a)
CS416 Compiler Design 154
Thomson’s Construction (cont.)
• For regular expression r1 r2
i N(r1) N(r2) f Final state of N(r2) become
final state of N(r 1r2)
NFA for r1 r2
• For regular expression r*
ε ε
i N(r) f
NFA for
r*
CS416 Compiler Design 155
Thomson’s Construction (Example - (a|b) * a )
a (a|b) *
a:
ε
a ε
ε
ε ε
ε ε
b
b: b
(a | b)
(a|b) * a ε
a ε
ε a
ε ε
ε ε ε a
ε
b ε ε
b
ε
156
RE to NFA
(a|b) * abb ε
a ε
ε 2 3
ε ε b b 1
0 1 6 7 a
8
ε ε 9 0
4 b
5
ε
Jeya R 158
(a|b) * abb
RE to DFA –USING THOMPSONS SUBSET
C
GivO
en N
reguS
laT
r eR
xpU
ressC
ioT
n isIO
conN
verted into NFA..
Then. NFA is converted into DFA.
STEPS
1. Convert into NFA using above rules for operators (union, concatenation and closure)and precedence.
2. Find Ɛ-closure of all states.
3. Start with epsilon closure of start state of NFA.
(a|b) * abb
4. Apply the input symbols and find its epsilon closure.
Dtran[state, input symbol] = Ɛ-closure(move(state, input symbol))
where Dtran transition function of DFA
5. Analyze the output state to find whether it is a new state.
6. If new state is found, repeat step 4 and step 5 until no more new states are found.
7. Construct the transition table for Dtran function.
8. Draw the transition diagram with start state as the s-closure (start state of NFA) andfinal state is the
state that contains final state of NFA drawn.
STEP1: RE to NFA
(a|b) * abb
Jeya R 160
Step 2: Start with finding Ɛ-closure of state 0
Ɛ-closure(0) = {0,1,2,4,7} = A
Step 3: Apply input symbols a, b to A
Dtran(A, a) = Ɛ-closure(move(A, a))
= Ɛ-closure(move({0,1,2,4,7), a))= Ɛ-closure(3,8)= {3,6,7,1,2,4,8}= {1,2,3,4,6,7,8} = B
Dtran A, a=B
Dtran A, b] = Ɛ-closure( move(A,b))
= Ɛ-closure(move({0,1,2,4,7), b))
= Ɛ-closure(5)
{5,6,7,1,2,4,7}
{1,2,4,5,6,7} = C
Dtran(A,b] =C
Step 4: Apply input symbols to new state B
Dtran B, a] = Ɛ-closure (move(B, a))
= Ɛ-closure(move({1,2,3,4,6,7,8}, a))
= Ɛ-closure(3,8)
= {1,2,3,4,6,7,8} = B
Dtran [B, a] =B
Dtran(B, b) = Ɛ-closure(move(B, b))
= Ɛ-closure(move({1,2,3,4,6,7,8},b))
= Ɛ-closure(5,9)
={1,2,4,5,6,7,9} =D
Dtran(B, 6] =D
Step 5: Apply input symbols to new state C
Dtran(C, a) = Ɛ-closure(move(C, a))
= Ɛ-closure(move({1,2,4,5,6,7), a))
= Ɛ-closure(3,8)
={1,2,3,4,6,7,8}= B
Dtran(C, a] =B
Dtran [C, b) = Ɛ-closure(move(C, b))
= Ɛ-closure(move({1,2,4,5,6,7),b))
= Ɛ-closure(5)= {1,2,4,5,6,7} = C
Dtran(C, b] =C
Step 6: Apply input symbols to new state D
Dtran[ D, a] = Ɛ-closure(move(D, a))
= Ɛ-closure(move({1,2,4,5,6,7,9}, a))
= Ɛ-closure(3,8)
{1,2,3,4,6,7,8} = B
Dtran[ D, a] = B
Dtran(D, b] = Ɛ-closure(move(D,b))
= Ɛ-closure(move({1,2,4,5,6,7,9},b))
= Ɛ-closure(5, 10)= {1,2,4,5,6,7,10} = E
Dtran[ D, b] =E
Step 7: Apply input symbols to new state E Dtran [E, b] = E-closure(move(E, 6))
Dtran[ E, a) =Ɛ -closure(move(E, a))
= E-closure(move({1,2,4,5,6,7,10),b))
= Ɛ-closure(move({1,2,4,5,6,7,10), a))
= Ɛ-closure(3,8) = E-closure(5)= {1,2,4,5,6,7} = C
= {1,2,3,4,6,7,8} = B Dtran [E, b]=C
Dtran [E, a] =B
Minimization of DFA
Minimization of DFA
Minimization of DFA
Minimization of DFA
Minimization of DFA
Example-Minimization of DFA
Example-Minimization of DFA
Example-Minimization of DFA
Example-Minimization of DFA
Regular Expression to DFA
(Direct Method)
Regular Expression to DFA
(Direct Method)- Example
• Regular Expression: (a/b)*abb
• Augmented Grammar : (a/b)*abb# =? (a/b)*.a.b.b.#
Regular Expression to DFA
(Direct Method)- Example
Computation of Nullable,
Firstpos, LastPos:
Example:
Direct Method
Direct Method
Direct Method
Direct Method
Direct Method
Direct Method
Direct method RE to DFA
(a|b)*abb(a|b)*
Find firstpos,lastpos,followpos
Draw DFA
{1,2,3} {8}
Finding followpos by rules
NFA TO DFA ( SUBSET CONSTRUCTION)