0% found this document useful (0 votes)
10 views78 pages

Module 5

The document provides an overview of compilers, detailing their phases including lexical analysis, syntax analysis, and semantic analysis. It explains the role of finite state automata in lexical analysis, types of parsers in syntax analysis, and introduces concepts like tokens, patterns, and lexemes. Additionally, it covers the design and working of a lexical analyzer using tools like Lex, along with the importance of context-free grammars and directed acyclic graphs in compiler construction.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views78 pages

Module 5

The document provides an overview of compilers, detailing their phases including lexical analysis, syntax analysis, and semantic analysis. It explains the role of finite state automata in lexical analysis, types of parsers in syntax analysis, and introduces concepts like tokens, patterns, and lexemes. Additionally, it covers the design and working of a lexical analyzer using tools like Lex, along with the importance of context-free grammars and directed acyclic graphs in compiler construction.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

System Programming and

Compiler construction Subject Code


:CSC 601

Mrs. Pallavi Patil


(Computer Engineering)
Module 5. Compilers: Analysis
Phase
• Introduction to compilers, Phases of compilers:
Lexical Analysis- Role of Finite State Automata in
Lexical Analysis, Design of Lexical analyser, data
structures used .
• Syntax Analysis- Role of Context Free Grammar
in Syntax analysis, Types of Parsers: Top down
parser- LL(1), Bottom up parser- Operator
precedence parser, SLR
• Semantic Analysis, Syntax directed definitions
What is Compilers?
• A compiler is a translator that converts the
high-level language into the machine
language.
• High-level language is written by a
developer and machine language can be
understood by the processor.
• Compiler is used to show errors to the
programmer.
• In the first part, the source program
compiled and translated into the object
program (low level language).
• In the second part, object program
translated into the target program through
the assembler.
Features of a Compiler
Phases of compilers
Lexical Analysis
• Lexical Analysis is the first phase of the compiler also known as a
scanner. It converts the High level input program into a sequence of
Tokens.

• Lexical Analysis can be implemented with the Deterministic finite


Automata.
• The output is a sequence of tokens that is sent to the parser for
syntax analysis
What is a token
• What is a token? A lexical token is a sequence of characters that can
be treated as a unit in the grammar of the programming languages.
Example of tokens:

• Type token (id, number, real, . . . )


• Punctuation tokens (IF, void, return, . . . )
• Alphabetic tokens (keywords)
• Keywords; Examples-for, while, if etc.
• Identifier; Examples-Variable name, function name, etc.
• Operators; Examples '+', '++', '-' etc.
• Separators; Examples ',' ';' etc
How a Lexical Analyzer Works
Reads the source code: It reads the program character by
character.
Groups characters into tokens:It combines characters to form
meaningful words called tokens.
Ignores unnecessary items:Removes spaces, tabs, comments, and
new lines.
Identifies token types:Each token is classified into categories like:
○ Keywords (if, while, int)
○ Identifiers (variable names)
○ Operators (+, -, =)
○ Constants (10, 5.5)
Sends tokens to the Syntax Analyzer: The tokens are passed to the
next phase of the compiler.
How Lexical Analyzer works-
• Input preprocessing: This stage involves cleaning up the input text and
preparing it for lexical analysis. This may include removing comments,
whitespace, and other non-essential characters from the input text.
• Tokenization: This is the process of breaking the input text into a sequence of
tokens. This is usually done by matching the characters in the input text
against a set of patterns or regular expressions that define the different types
of tokens.
• Token classification: In this stage, the lexer determines the type of each
token. For example, in a programming language, the lexer might classify
keywords, identifiers, operators, and punctuation symbols as separate token
types.
• Token validation: In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For example, it might
check that a variable name is a valid identifier, or that an operator has the
correct syntax.
• Output generation: In this final stage, the lexer generates the output of the
lexical analysis process, which is typically a list of tokens. This list of tokens
can then be passed to the next stage of compilation or interpretation.
Compare Pattern,Lexeme and token with Example
Criteria Token Lexeme Pattern
Definition Token is basically a It is a sequence of characters in the It specifies a set of
sequence of characters that source code that are matched by rules that a scanner
are treated as a unit as it given predefined language rules follows to create a
cannot be further broken for every lexeme to be specified as token.
down. a valid token.

Interpretation of all the reserved keywords of int, goto The sequence of


type Keyword that language(main, printf, characters that make
etc.) the keyword.
Interpretation of name of a variable, main, a it must start with the
type Identifier function, etc alphabet, followed by
the alphabet or a digit.
Interpretation of all the operators are +, = +, =
type Operator considered tokens.

Interpretation of each kind of punctuation is (, ), {, } (, ), {, }


type considered a token. e.g.
Punctuation semicolon, bracket, comma,
etc.
Interpretation of a grammar rule or boolean “Welcome to GeeksforGeeks!” any string of
type Literal literal. characters (except ‘ ‘)
between ” and “
Compare Pattern,Lexeme and token with Example
Design of Lexical analyser.
Lex is a tool officially known for Lexical Analysis It's main job is to break
up an inputstream into more usable elements called as tokens. Or in, other
words, to identify the "interesting bits" in a text file. It uses regular
expression matching; typically to ‘tokenize’ the contents of the file.
working process of a LEX program step by
step.
Step 1: Lex Source Program
● You write a LEX program (for example: program.l).
● This file contains rules and patterns for token recognition.
Step 2: Lexical Compiler (LEX Tool)
● The Lex tool processes the source program.
● It generates a C file called [Link].c.
Step 3: C Compiler
● The C compiler compiles [Link].c.
● It produces an executable file called [Link].
working process of a LEX program step by
step.
Step 4: Input Stream
● When you run [Link], you give it some input
(text/program).
Step 5: Output – Sequence of Tokens
● The program reads the input.
● It breaks the input into tokens (like keywords,
identifiers, numbers, operators).
● These tokens are the output.
Structure of Lex
• Structure of Lex lex file looks like
• ...definition section...
• %%...
• rules section...
• %%...
• code section...
Definition Section
• %{Definition Section}%

• The definition section, introduces any initial C Program


code we want to copy into the final program For eg we
want header files that must be included for code later in
the file to work Lex copies the material between “%{“
and“}%” directly to the generated C file ([Link].c) The
%% marks the end of this section
• Rules Section • Code Section
• The rules section is made up of • The final section is the
two parts: A pattern An Action user subroutines section
separated by whitespace The or code section, which
lexer that lex generates will consist of any legal C
execute the action when it code.
recognizes the pattern These • Lex copies it to the C file
patterns are UNIX-style regular after the end of the lex
expressions The %% marks the generated code
end of this section • The lexer produced by lex
is a C routine called
Simple Lex Program
%{/* simple C program*/%}%% • yylex().
“Hello" {
printf
(“Welcome");}
%%
Syntax Analysis
Syntax Analysis
• Syntax Analysis or Parsing is the second phase, i.e. after lexical
analysis. It checks the syntactic structure of the given input, i.e.
whether the given input is in the correct syntax or not. It does so by
building a data structure, called a Parse tree or Syntax tree.
• The parse tree is constructed by using the predefined Grammar of the
language and the input string. If the given input string can be produced
with the help of the syntax tree the input string is found to be in the
correct syntax. if not, the error is reported by the syntax analyzer.
• Syntax analysis, also known as parsing, is a process in compiler design
where the compiler checks if the source code follows the grammatical
rules of the programming language. This is typically the second stage
of the compilation process, following lexical analysis.
• The main goal of syntax analysis is to create a parse tree or abstract
syntax tree (AST) of the source code, which is a hierarchical
representation of the source code that reflects the grammatical
structure of the program.
Type Of Parser
Basis Top-Down Parsing Bottom-Up Parsing

Meaning Starts parsing from the start symbol Starts parsing from the input string and
and tries to derive the input string. tries to reduce it to the start symbol.

Parse Tree Constructed from root to leaves. Constructed from leaves to root.

Derivation Uses Leftmost derivation. Uses Rightmost derivation in reverse.


Used

Main Idea Expands non-terminals into Reduces productions to non-terminals.


productions.

Backtracki May require backtracking. Does not require backtracking (in


ng shift-reduce parsing).

Complexity Simple and easy to understand. More powerful but slightly complex.

Grammar Cannot handle left recursion. Can handle left recursion.


Restriction

Examples Recursive Descent Parser, LL(1) Parser Shift-Reduce Parser, LR(0), SLR, LALR
Parser
Basic For Syntax Analysis(Parser)
• Context Free Grammars
• Derivation
Left-most Derivation:
Right-most Derivation:
• Left Recursive
• Left Factoring
• First and Follows
Context Free Grammars
A Context Free Grammar (CFG) is a formal grammar used to describe the syntax of
programming languages.
It is called context free because the production rules can be applied without considering the
surrounding symbols (context).
2⃣ Terminals
G = (V, T, P, S)
● Actual symbols of the language.
Where:
● Example: a, b, +, *, id
● V → Set of Variables (Non-terminals)

● T → Set of Terminals 3⃣ Production Rules


● Rules that replace non-terminals.
● P → Set of Production Rules ● Written using →
● Example: S → aSb
● S → Start Symbol
1⃣ Non-Terminals (Variables) 4⃣ Start Symbol
● Represented by capital letters. ● The symbol from which derivation
begins.
● Example: S, A, B ● Usually S
Derivation
• Derivation is a sequence of production rules. It is used to get the input
string through these production rules. During parsing we have to take
two decisions. These are as follows:
• We have to decide the non-terminal which is to be replaced.
• We have to decide the production rule by which the non-terminal will be
replaced.
• We have two options to decide which non-terminal to be replaced with
production rule.
Left-most Derivation
• In the leftmost derivation, the input is scanned and replaced with the
production rule from left to right. So in left most derivatives we read the
input string from left to right.
Right-most Derivation
• In the right most derivation, the input is scanned and replaced with the
production rule from right to left. So in right most derivatives we read
the input string from right to left.
Derivation Examples
[Link] the following [Link] the following
Grammer Grammer
S=S+S
S=S+S
S=S-S
S=S-S S = a | b |c derive a - b + c
S = a | b |c derive using RMD
a - b + c using LMD ans;
ans: S=S-S
S=S+S S=S-S+S
S=S-S+c
S=S-S+S
S=S-b+c
S=a-S+S S=a-b+c
S=a-b+S
S=a-b+c
Left Recursive
• A Grammar G (V, T, P, S) is left recursive if it has a production in the form.
• A → A α |β.
• A → A α 1|A α 2|........|β1β2..........
• The above Grammar is left recursive because the left of production is occurring at
a first position on the right side of production. It can eliminate left recursion by
replacing a pair of production with
• A → βA′
• A → αA′|ϵ
Elimination of Left Recursion
Left Recursion can be eliminated by introducing new non-terminal A such that.
This type of recursion is also called Immediate Left Recursion.
Left Recursive Examples
Ex-1 − Eliminate the left recursion from the Ex-2− Remove the left recursion from the grammar
grammar
E → E(T)|T
E → E + T|T T → T(F)|F
T → T * F|F F → id
F → (E)|id Solution
Solution:
The production after removing the left recursion will Eliminating immediate left-recursion among all Aα
be productions, we obtain

E → TE′
E → TE′
E′ → +TE′| ∈
E → (T)E′|ε
T → FT′
T → FT′
T′ →∗ FT′| ∈
T′ → (F)T′|ε
F → (E)|id
F → id
Left Factoring
Left Factoring
Left Factoring is a grammar transformation technique
used to remove common prefixes from production rules.
It is mainly used to make a grammar suitable for
Top-Down Parsing (like LL(1) parsing).
Why Left Factoring is Needed?
When two or more productions of a non-terminal begin
with the same prefix, the parser cannot decide which
production to use.
Left factoring removes this confusion.
First and Follows
FIRST(X) = Set of terminals that can appear at the beginning of strings derived from X.

FOLLOW(A) = Set of terminals that can appear immediately after A in some sentential
form. FOLLOW is defined only for non-terminals.

• Consider the production rule- Rule-01:


• A → abc / def / ghi For the start symbol S, place $
Then, we have- in Follow(S).
• First(A) = { a , d , g }
• Rules For Calculating First Rule-02:
Function-
Rule-01 For any production rule A →
• :For a production rule X → ∈, αB,
• First(X) = { ∈ } Follow(B) = Follow(A)
Rule-02:
For any terminal symbol ‘a’,
DAG (Directed Acyclic Graph)
A Directed Acyclic Graph (DAG) is a directed graph with no
cycles used in compiler design to represent expressions and
eliminate common sub-expressions for [Link] a graph that:
● Has directed edges (arrows show direction)
● Has no cycles (no path returns to the same node)
In Compiler Design, a DAG is used to represent expressions and
helps in optimization by removing repeated calculations.
● To detect common sub-expressions
● To eliminate duplicate computations
● To optimize intermediate code
DAG (Directed Acyclic Graph)
Consider the expression:
a = (b + c) * (b + c)
Step 1: Normal Expression Tree
Normally, (b + c) is calculated two times.
Step 2: DAG Representation
In DAG, (b + c) is calculated only once and reused.

(*)
/ \
(+) (+) ❌ (In tree – repeated)
/ \ / \
b cb c
DAG (Directed Acyclic Graph)
(*)
/ \
+ (same +)
/\
b c
Here:

● Node (b + c) is created once


● It is shared by multiplication
● Duplicate calculation is removed
Construct DAG for the Following Expression
X=M+P/Q-T+P/Q*Y
step 1: Identify Common Sub-Expression
In the given expression:
P/QP/Q
appears two times.
So in DAG, we will create the node for P/Q only once
and reuse it.
Step 2: Break the Expression
Expression can be written as:
X=((M+(P/Q))−T)+((P/Q)∗Y)
Construct DAG for the Following Expression
X=M+P/Q-T+P/Q*Y
(+) ← X
Step 3: Construct DAG
/ \
Leaf Nodes: (-) (*)
M, P, Q, T, Y / \ / \
Internal Nodes: (+) T (/) Y
/ \ /\
/, +, −, *, +
M (/) P Q
🔹 DAG Structure (Text / \
Form) P Q
Construct DAG for the Following Expression
X=M+P/Q-T+P/Q*Y
Important Observation
● The node (P/Q) is created only once
● It is shared by:
○ M + P/Q
○ P/Q * Y

● This removes duplicate computation


● That is why DAG is useful in code optimization
🔹 Final Answer (Short Exam Format)
A DAG is constructed by creating leaf nodes for M, P, Q, T, Y,
creating one node for P/Q, and reusing it in both places where it appears.
This eliminates the common sub-expression and optimizes computation.
Ambiguous Grammar
An ambiguous grammar is a context-free grammar (CFG) in
which a single string can have more than one parse tree (or more
than one leftmost/rightmost derivation).
Example
E→E+E
E→E*E
E → id Now take the string:id + id * id
This string can be parsed in two different ways:
1. (id + id) * id
2. id + (id * id)
So it has two different parse trees, therefore this grammar is
ambiguous.
Ambiguous Grammar
Why Ambiguity is a Problem?
● Compiler cannot decide which parse tree is correct.
● It creates confusion in syntax analysis phase.
● It may lead to wrong meaning of expressions.
🔹 How to Remove Ambiguity?
We rewrite the grammar by giving precedence and associativity:
E→E+T|T
T→T*F|F
F → id
Now * has higher precedence than +, so the grammar becomes
unambiguous.
Type Of Parser
Recursive Descent Parser
It is a kind of Top-Down Parser. A top-down parser builds
the parse tree from the top to down, starting with the start
non-terminal. A Predictive Parser is a special case of
Recursive Descent Parser, where no Backtracking is
required.
By carefully writing a grammar means eliminating left
recursion and left factoring from it, the resulting grammar
will be a grammar that can be parsed by a recursive descent
parser.
Ex. Grammer:
S---> cAd
A--->ab|a
String S=cad
Top-down parser(LL(1)
• LL(1) Parsing: Here the 1st L represents that the
scanning of the Input will be done from the Left to
Right manner and the second L shows that in this
parsing technique, we are going to use the Leftmost
Derivation Tree. And finally, the 1 represents the
number of look-ahead, which means how many
symbols are you going to see when you want to make a
decision.
• Essential conditions to check first are as follows:
• The grammar is free from left recursion.
• The grammar should not be ambiguous.
• The grammar has to be left factored in so that the
grammar is deterministic grammar.
Algorithm to construct LL(1) Parsing
Step 1:First check all the essential conditions mentioned above and
go to step2.
Step 2: Calculate First() and Follow() for all non-terminals.
First(): If there is a variable, and from that variable, if we try to
drive all the strings then the beginning Terminal Symbol is called the
First.
Follow(): What is the Terminal Symbol which follows a variable in
the process of derivation.
Step 3: For each production A –> α. (A tends to alpha)
Find First(α) and for each terminal in First(α), make entry A –> α in
the table.
If First(α) contains ε (epsilon) as terminal, then find the Follow(A)
and for each terminal in Follow(A), make entry A –> ε in the table.
If the First(α) contains ε and Follow(A) contains $ as terminal, then
make entry A –> ε in the table for the $.
Example 1: Consider the Grammar:
E --> TE'
E' --> +TE' | ε
T --> FT'
T' --> *FT' | ε
F --> id | (E)
Step 1: The grammar satisfies all
properties in step 1.
Step 2: Calculate first() and follow().
Find their First and Follow sets:
Step 3: Make a parser table.
Now, the LL(1) Parsing Table is:
Since each cell of the parsing table
contains at most one production,
there is no conflict.
Therefore, the grammar is LL(1).
What is Operator Precedence Parsing?
Operator Precedence Parsing is also a type of Bottom-Up Parsing that can be
used to a class of Grammars known as Operator Grammar.
A Grammar G is Operator Grammar if it has the following properties −
Production should not contain ϵ on its right side.
There should not be two adjacent non-terminals at the right side of production
Example1 − Verify whether the following Grammar is operator Grammar or
not.
E → E A E |(E)|id
A → +| − | ∗
Solution
No, it is not an operator Grammar as it does not satisfy property 2 of operator
Grammar.
As it contains two adjacent Non-terminals on R.H.S of production E → E A E.
We can convert it into the operator Grammar by substituting the value of A in E
→ E A E.
E → E + E |E − E |E * E |(E) | id.
Operator Precedence Relations
Three precedence relations exist between the pair of terminals.

Relation Meaning
p <. q p has less precedence than q.
p >. q p has more precedence than q.
p =. q p has equal precedence than q.
Depending upon these precedence Relations, we can decide which operations
will be executed or parsed first.
Association and Precedence Rules
If operators have different precedence
Since * has higher precedence than +
Example−
In a statement a + b * c
∴ + <. *
In statement a * b + c
∴∗.>+
LR Parser -LR(0)
The LR parser is an efficient bottom-up syntax analysis technique that can be used for
a large class of context-free grammar. This technique is also called LR(0) parsing.
L stands for the left to right scanning
R stands for rightmost derivation in reverse
0 stands for no. of input symbols of lookahead.
Augmented grammar :
If G is a grammar with starting symbol S, then G’ (augmented grammar for G) is a
grammar with a new starting symbol S‘ and productions S’-> .S . The purpose of this
new starting production is to indicate to the parser when it should stop parsing. The ‘ .
‘ before S indicates the left side of ‘ . ‘ has been read by a compiler and the right side
of ‘ . ‘ is yet to be read by a compiler.

Steps for constructing the LR parsing table :


Writing augmented grammar
LR(0) collection of items to be found
Defining 2 functions: goto(list of non-terminals) and action(list of terminals) in the
parsing table.
Ex. S → AA
A → aA | b
Step [Link] Augment Production and insert '•' symbol at the first position for every production in G

S` → •S
S → •AA
A → •aA
A → •b Drawing DFA:
SLR (1) Parsing
• SLR (1) refers to simple LR Parsing. It is same as LR(0) parsing. The only
difference is in the parsing [Link] construct SLR (1) parsing table, we use
canonical collection of LR (0) item.

• In the SLR (1) parsing, we place the reduce move only in the follow of left hand
side.

• Various steps involved in the SLR (1) Parsing:


• For the given input string write a context free grammar
• Check the ambiguity of the grammar
• Add Augment production in the given grammar
• Create Canonical collection of LR (0) items
• Draw a data flow diagram (DFA)
• Construct a SLR (1) parsing table
SLR ( 1 ) Grammar
S→E
E→E+T|T
T→T*F|F
F → id

Add Augment Production and insert '•' symbol at the first position for every
production in G

S` → •E
E → •E + T
E → •T
T → •T * F
T → •F
F → •id

Drawing DFA:
CLR (1) Parsing
• CLR refers to canonical lookahead. CLR parsing use the canonical
collection of LR (1) items to build the CLR (1) parsing table. CLR (1)
parsing table produces the more number of states as compare to the
SLR (1) parsing.

• In the CLR (1), we place the reduce node only in the lookahead
symbols.
• Various steps involved in the CLR (1) Parsing:
• For the given input string write a context free grammar
• Check the ambiguity of the grammar
• Add Augment production in the given grammar
• Create Canonical collection of LR (0) items
• Draw a data flow diagram (DFA)
• Construct a CLR (1) parsing table
CLR ( 1 ) Grammar
• S → AA
• A → aA
• A→b
• Add Augment Production, insert '•' symbol at the first position for every
production in G and also add the lookahead. Drawing DFA:
S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
CLR (1) Parsing table:
LALR (1) Parsing:
• LALR refers to the lookahead LR. To construct the LALR (1) parsing
table, we use the canonical collection of LR (1) items.
• In the LALR (1) parsing, the LR (1) items which have same
productions but different look ahead are combined to form a single
set of items
• LALR (1) parsing is same as the CLR (1) parsing, only difference in
the parsing table.
Example LALR ( 1 ) Grammar
• S → AA
• A → aA
• A→b
Add Augment Production, insert '•' symbol at the first position for every
production in G and also add the look ahead.
Condu...
• S` → •S, $
• S → •AA, $
• A → •aA, a/b
• A → •b, a/b
Syntax Directed Translation
• In syntax directed translation, along with the grammar we associate
some informal notations and these notations are called as semantic
rules. So we can say that
• Grammar + semantic rule = SDT (syntax directed translation)
• In syntax directed translation, every non-terminal can get one or more
than one attribute or sometimes 0 attribute depending on the type of
the attribute. The value of these attributes is evaluated by the
semantic rules associated with the production rule.
• In the semantic rule, attribute is VAL and an attribute may hold
anything like a string, a number, a memory location and a complex
record
• In Syntax directed translation, whenever a construct encounters in the
programming language then it is translated according to the semantic
rules define in that particular programming language.
Example
• Production Semantic Rules
• E→E+T [Link] := [Link] + [Link]
• E→T [Link] := [Link]
• T→T*F [Link] := [Link] + [Link]
• T→F [Link] := [Link]
• F → (F) [Link] := [Link]
• F → num [Link] := [Link]

• [Link] is one of the attributes of E.


• [Link] is the attribute returned by the lexical
analyzer.

You might also like