Compiler Design
Compiler: A system program that translates high-level source code written in one programming language
into machine code or another target language. The compilation happens before program execution,
producing an executable file that can run independently.
Operations of Compiler
It breaks source programs into smaller parts.
It enables the creation of symbol tables and intermediate representations.
It helps in code compilation and error detection.
it saves all codes and variables.
It analyses the full program and translates it.
Convert source code to machine code.
Advantages of Compiler Design
Efficiency
Portability
Error Checking
Optimizations
Translator: A program that converts code from one language to another. This includes compilers,
interpreters, and assemblers. Translators bridge the gap between human-readable code and machine-
executable instructions.
Interpreter: A program that directly executes source code line-by-line or statement-by-statement
without prior translation to machine code. It reads, analyzes, and executes code simultaneously during
runtime. A compiler scans the entire program and translates it as a whole into machine
code whereas an interpreter translates the program one statement at a time.
Phases of Compiler:
Lexical Analysis (Scanning): Breaks the source code into tokens (keywords, identifiers, operators,
literals). Removes whitespace and comments while identifying the basic building blocks of the program.
Syntax Analysis (Parsing): Checks if tokens follow the grammar rules of the programming language.
Creates a parse tree or abstract syntax tree (AST) representing the program's structure.
Semantic Analysis: Verifies semantic correctness, including type checking, scope resolution, and
declaration validation. Ensures the program follows language semantics beyond just syntax.
Intermediate Code Generation: Produces an intermediate representation (IR) of the source code that is
independent of the target machine. This facilitates optimization and portability.
Code Optimization: Improves the intermediate code for better performance by eliminating
redundancies, reducing execution time, and minimizing memory usage while preserving program
semantics.
Code Generation: Converts optimized intermediate code into target machine code or assembly language
specific to the target processor architecture.
Symbol Table Management: Maintains information about identifiers (variables, functions, classes)
throughout compilation, including their types, scope, and memory locations.
Error Handling: Detects and reports errors at each phase, providing meaningful error messages to help
developers identify and fix issues.
Bootstrapping: Bootstrapping is an important technique in compiler design, where a basic compiler
is used to create a more advanced version of itself. This process helps in building compilers for new
programming languages and improving the ones already in use.
step-by-step look at how bootstrapping works in compiler design:
Step 1: Start with a Basic Compiler
The first step is to create a basic compiler that can handle the most essential features of a
programming language.
Step 2: Use the Basic Compiler to Create a More Advanced Version
Once the basic compiler is ready, it is used to compile a more advanced version of itself.
Step 3: Gradually Improve the Compiler
With each new version, the compiler becomes more capable. The process is repeated, and each
iteration adds more features, making the compiler stronger and more efficient.
Advantages
Improved Efficiency
Portability
Reduced Dependency
Finite Automata in Lexical
Analysis
Definition:
Finite Automata (FA) are mathematical models used in lexical
analyzers to recognize patterns in input strings. They consist of
states, transitions, and accept/reject mechanisms to identify valid
tokens according to specified rules.
🔄 Finite Automata Processing Flow
Input String → Start State → Read Character → State Transition → Check
Accept State → Token/Reject
Advantages:
Efficient pattern matching with linear time complexity
Memory efficient - requires minimal storage
Deterministic behavior ensures consistent results
Easy to implement in hardware or software
Mathematical foundation provides formal verification
Disadvantages:
Limited to regular languages only
Cannot handle nested structures or context-sensitive patterns
State explosion problem for complex patterns
Difficult to modify once constructed
Cannot count or remember arbitrary amounts of information
Uses:
Keyword recognition in programming languages
Identifier and number pattern matching
String literal processing
Comment detection and removal
Operator and delimiter recognition
Input Processing in Lexical
Analysis
Definition:
Input processing involves reading and managing the source code
stream, handling character-by-character analysis, and maintaining
position information for error reporting and token generation.
📥 Input Processing Flow
Source File→ Buffer Loading → Character Reading → Position
Tracking → Lookahead → Token Formation
Advantages:
Efficient memory usage through buffering
Supports lookahead for complex token recognition
Maintains line and column information for debugging
Handles different character encodings
Can process large files without loading entirely into memory
Disadvantages:
I/O operations can be performance bottlenecks
Buffer management complexity
Character encoding issues
End-of-file handling complications
Memory overhead for position tracking
Uses:
Reading source code from files or streams
Preprocessing directives handling
Multi-file compilation support
Interactive compiler environments
IDE integration for real-time analysis
Recognition of Tokens
Definition:
Token recognition is the process of identifying and classifying
meaningful sequences of characters in source code into predefined
categories like keywords, identifiers, operators, and literals.
Token Recognition Flow
Character Stream → Pattern Matching → Longest Match → Token
Classification → Attribute Assignment → Token Output
Token Categories Example: Keywords: if, while, for,
class, public Identifiers: variable_name, functionName,
Class1 Operators: +, -, *, /, ==, !=, &&, || Literals:
123, 3.14, "hello", 'a', true Delimiters: ;, {, }, (, ),
[, ], ,
Advantages:
Separates lexical concerns from syntax analysis
Simplifies parser design and implementation
Enables efficient symbol table management
Supports different token types with attributes
Facilitates error recovery and reporting
Disadvantages:
Ambiguity resolution can be complex
Context-sensitive tokens are challenging
Requires careful precedence handling
May need backtracking for some patterns
Token attribute management overhead
🎯 Uses:
Programming language compilation
Text processing and analysis tools
Configuration file parsing
Query language processing
Data format validation
LEX: A Lexical Analyzer
Generator
Definition:
LEX is a tool that automatically generates lexical analyzers from
regular expression specifications. It takes pattern-action rules as
input and produces C code for a scanner that recognizes those
patterns.
⚙️LEX Processing Flow
LEX Specification → LEX Compiler → [Link].c → C Compiler → Executable
Scanner → Token Stream
LEX Specification Example: %{ #include <stdio.h> %} %%
[0-9]+ { printf("NUMBER: %s\n", yytext); } [a-zA-Z_][a-
zA-Z0-9_]* { printf("IDENTIFIER: %s\n", yytext); }
"if"|"while"|"for" { printf("KEYWORD: %s\n", yytext); } [
\t\n]+ { /* ignore whitespace */ } . { printf("UNKNOWN:
%s\n", yytext); } %% int main() { yylex(); return 0; }
Advantages:
Rapid development of lexical analyzers
Regular expression support for pattern specification
Automatic DFA generation and optimization
Integration with YACC parser generator
Well-tested and reliable code generation
Handles complex tokenization rules automatically
Disadvantages:
Generated code may not be optimal for all cases
Limited debugging capabilities
Platform dependency (primarily UNIX-based)
Learning curve for specification syntax
Conflicts resolution can be non-intuitive
Limited error reporting customization
🎯 Uses:
Compiler front-end development
Text processing applications
Configuration file parsers
Log file analyzers
Data extraction tools
Protocol analyzers and validators
Error Handling in Lexical Analysis
Definition:
Error handling in lexical analysis involves detecting, reporting, and
recovering from lexical errors such as invalid characters, malformed
tokens, and unterminated strings while maintaining compilation
continuity.
🚨 Error Handling Flow
Error Detection→ Error Classification → Error Reporting → Recovery
Strategy → Continue Analysis → Error Summary
Common Lexical Errors: • Invalid Characters: @#$% (in
languages that don't support them) • Unterminated
Strings: "hello world • Invalid Numbers: 3.14.15, 0x,
123abc • Comment Errors: /* unterminated comment • Escape
Sequences: "\q" (invalid escape)
Advantages:
Early error detection saves compilation time
Clear error messages improve developer experience
Recovery mechanisms allow continued analysis
Position tracking aids in debugging
Prevents cascading errors in later phases
Disadvantages:
Complex error recovery logic
May mask subsequent errors
Recovery can lead to incorrect token streams
Performance overhead for error checking
Difficult to balance error detail vs. clarity
Uses:
Compiler error reporting systems
IDE real-time error highlighting
Code quality tools and linters
Educational programming environments
Automated testing and validation tools
Summary & Integration
Integration Overview:
All these components work together to form a complete lexical
analysis system. Finite automata provide the theoretical foundation,
input processing manages the source stream, token recognition
identifies meaningful units, LEX automates the implementation, and
error handling ensures robustness.
🔄 Complete Lexical Analysis System
Phase 1: Source Code → Input Processing → Character Stream
Phase 2: Character Stream → Finite Automata → Pattern Matching
Phase 3: Pattern Matching → Token Recognition → Classified Tokens
Phase 4: Error Detection → Error Handling → Recovery/Reporting
Output: Token Stream → Parser → Syntax Analysis
🎯 Key Takeaways:
Lexical analysis is the first and fundamental phase of compilation
Finite automata provide efficient pattern recognition capabilities
Proper error Syntax-directed translation
Syntax Trees: Tree structures representing the grammatical structure of source code where nodes are
operators/constructs and leaves are operands/variables rob
A syntax tree (or parse tree) is an abstract representation of the source code's structure,
showing how grammar rules are applied.
Internal nodes represent operators or non-terminals.
Leaf nodes represent operands, tokens, or terminals.
Construction of syntax trees: b
Syntax Trees: Tree data structures that represent the hierarchical grammatical structure of
source code, where internal nodes represent operations/constructs and leaf nodes represent
operands/identifiers.
Detailed Explanation
What are Syntax Trees?
Syntax trees (also called parse trees or abstract syntax trees) are tree representations of the
syntactic structure of source code according to a formal grammar. They show how the grammar
rules are applied to generate the program.
Components:
Root: Start symbol of the grammar
Internal Nodes: Non-terminals (operators, statements, expressions)
Leaf Nodes: Terminals (identifiers, constants, keywords)
Edges: Represent derivation relationships
Construction Process Flowchart
Input Expression/Code
↓
Parse according to Grammar Rules
↓
Identify Operators and Operands
↓
Create Internal Nodes (Operators)
↓
Create Leaf Nodes (Operands)
↓
Connect based on Precedence/Associativity
↓
Complete Syntax Tree
Example Construction
Input: a + b * c
Grammar Rules:
E→E+T|T
T→T*F|F
F → id | (E)
Step-by-step Construction:
1. Start with expression a + b * c
2. Apply precedence (* higher than +)
3. Build tree bottom-up:
+
/\
a *
/\
b c
Types of Syntax Trees
1. Concrete Syntax Tree (Parse Tree)
Shows exact grammar derivation
Includes all grammar symbols
More detailed but larger
2. Abstract Syntax Tree (AST)
Simplified version
Removes unnecessary details
More compact and efficient
Construction Methods
Bottom-Up Construction
1. Start with tokens
2. Combine using grammar rules
3. Build upward to root
Top-Down Construction
1. Start with start symbol
2. Expand using productions
3. Build downward to leaves
Advantages (Simple Words)
Visual: Easy to understand program structure
Hierarchical: Shows operator precedence clearly
Evaluable: Can be directly evaluated
Analyzable: Supports semantic analysis
Transformable: Easy to modify and optimize
Disadvantages (Simple Words)
Memory-intensive: Requires significant storage
Complex: Difficult for large programs
Overhead: Extra processing time needed
Space-consuming: Takes more memory than linear forms
Uses (Simple Words)
Compilers: Code generation and optimization
Interpreters: Direct execution of programs
Analyzers: Static analysis tools
IDEs: Syntax highlighting and error detection
Refactoring: Code transformation tools
Debuggers: Program structure visualization
Construction Algorithm (Simplified)
1. Tokenize input source code
2. For each token/expression:
- If operand: Create leaf node
- If operator: Create internal node
3. Connect nodes based on:
- Grammar rules
- Operator precedence
- Associativity rules
4. Build tree recursively
5. Return root node
Practical Example
Code: if (x > 0) y = x + 1;
Syntax Tree:
if-statement
/ | \
> = ;
/\ /\
x 0 y +
/\
x 1
Memory Representation
Each node typically contains:
Node Type: Operator, identifier, constant
Value: For leaf nodes
Children Pointers: Links to child nodes
Attributes: Type information, line numbers
Syntax trees are fundamental in compiler construction as they provide a structured representation
that bridges the gap between textual source code and executable machine code, enabling various
compiler phases like semantic analysis, optimization, and code generation to work effectively.
[Link] Synthesized Attributes Inherited Attributes
An attribute is said to be Synthesized An attribute is said to be Inherited
attribute if its parse tree node value is attribute if its parse tree node value is
determined by the attribute value at child determined by the attribute value at parent
1. nodes. and/or siblings node.
The production must have non-terminal The production must have non-terminal as
2. as its head. a symbol in its body.
A synthesized attribute at node n is A Inherited attribute at node n is defined
defined only in terms of attribute values only in terms of attribute values of n's
3. at the children of n itself. parent, n itself, and n's siblings.
It can be evaluated during a single top-
It can be evaluated during a single down and sideways traversal of parse
4. bottom-up traversal of parse tree. tree.
5. Synthesized attributes can be contained Inherited attributes can't be contained by
both, It is only contained by non-
by both the terminals or non-terminals. terminals.
Synthesized attribute is used by both S- Inherited attribute is used by only L-
6. attributed SDT and L-attributed SDT. attributed SDT.
7.
Top-down translation: Top-Down Translation is a method in compiler design where the syntax
tree and corresponding intermediate code are generated during a top-down parse (like a
recursive descent parser).
Key Characteristics:
Direction: Root to leaves expansion
Prediction: Uses lookahead to decide productions
Simultaneous: Parsing and translation happen together
Recursive: Often implemented using recursive procedures
Advantages (Simple Words)
Intuitive: Natural programming approach
Predictive: No backtracking needed
Controllable: Easy to add semantic actions
Flexible: Can handle various translation tasks
Debuggable: Easy to trace execution
Efficient: Linear time complexity
Disadvantages (Simple Words)
Left-recursive: Cannot handle left recursion directly
Backtracking: Some grammars need backtracking
Limited: Not all grammars are suitable
Predictive-only: Requires LL(k) grammars
Lookahead: May need multiple symbol lookahead
Grammar-dependent: Grammar must be specially designed
Algorithm Steps
1. Initialize: Set up parsing environment
2. Predict: Choose production based on lookahead
3. Expand: Replace non-terminal with production body
4. Translate: Execute semantic actions
5. Recurse: Process child nodes
6. Match: Verify terminal symbols
7. Return: Pass results upward
S->AS{PRINTF(1)} INPUT:aadbc
S->AB{PRINTF(2)}
A->a{PRINTF(3)}
B->bC{PRINTF(4)}
B->dB{PRINTF(5)}
C->c{PRINTF(6)}
Top down approach
/ \ \
A S 1
/ \ / \ \
a 3A B 2
/ \ \
d B 5
/ \ \
b C 4
/ \
c 6
Output:3364521
Bottom up approach
/ \
A S
/ / \
a A B
/ \
d B
/ \
b C
Intermediate Code Forms Using Postfix Notation
Simple Definition
Postfix Notation: A way of writing expressions where operators come after their operands (e.g.,
a b + instead of a + b). It's used as an intermediate code form because it's easier for machines
to evaluate and doesn't need parentheses.
What is Postfix Notation?
Regular vs Postfix
Infix: a + b * c (operator between operands)
Postfix: a b c * + (operator after operands)
Why Use Postfix in Compilers?
No parentheses needed
Easy to evaluate using a stack
Unambiguous operator precedence
Efficient for code generation
Conversion Process Flowchart
Infix Expression
↓
Parse with Operator Precedence
↓
Use Stack for Operators
↓
Output Operands Immediately
↓
Pop Operators by Precedence
↓
Postfix Expression Generated
Algorithm for Infix to Postfix Conversion
Steps:
1. Scan infix expression left to right
2. Operands: Output directly
3. Operators: Push to stack based on precedence
4. Parentheses: Handle special cases
5. End: Pop remaining operators
Precedence Rules:
Highest: ( )
High: */%
Medium: + -
Low: = (assignment)
Examples
Example 1: Simple Expression
Input: a + b * c
Step-by-step:
Symbol | Stack | Output
-------|-------|--------
a |[] |a
+ | [+] | a
b | [+] | a b
* | [+*] | a b
c | [+*] | a b c
End |[] |abc*+
Result: a b c * +
Example 2: With Parentheses
Input: (a + b) * c
Step-by-step:
Symbol | Stack | Output
-------|-------|--------
( | [(] |
a | [(] | a
+ | [(+] | a
b | [(+] | a b
) |[] |ab+
* | [*] | a b +
c | [*] | a b + c
End |[] |ab+c*
Result: a b + c *
Evaluation of Postfix Expression
Algorithm:
1. Scan postfix expression left to right
2. Operand: Push to stack
3. Operator: Pop two operands, compute, push result
4. End: Stack contains final result
Example Evaluation:
Expression: a b c * + where a=2, b=3, c=4
Symbol | Stack Operation | Stack
-------|----------------|-------
a | push 2 | [2]
b | push 3 | [2,3]
c | push 4 | [2,3,4]
* | pop 4,3 → 3*4=12, push 12 | [2,12]
+ | pop 12,2 → 2+12=14, push 14 | [14]
Result: 14
Intermediate Code Generation
Three-Address Code from Postfix
Postfix: a b c * +
Three-Address Code:
t1 = b * c
t2 = a + t1
Assembly Code Generation
Postfix: a b +
Assembly:
LOAD a
LOAD b
ADD
STORE result
Advantages (Simple Words)
Stack-friendly: Easy evaluation using stack
Unambiguous: No precedence confusion
Parentheses-free: No brackets needed
Efficient: Fast evaluation
Machine-oriented: Good for code generation
Disadvantages (Simple Words)
Human-unfriendly: Hard to read
Debugging: Difficult to trace
Complex-expressions: Long for nested expressions
Learning-curve: Takes time to understand
Uses (Simple Words)
Compilers: Intermediate code representation
Calculators: HP calculators use RPN
Interpreters: Easy expression evaluation
Code-generators: Assembly code production
Optimization: Easy to optimize expressions
DAG (Directed Acyclic Graph)
Simple Definition
DAG: A tree-like structure where nodes can have multiple parents, used in compilers to
represent expressions efficiently by sharing common subexpressions. Unlike trees, DAGs
eliminate duplicate computations by reusing nodes.
What is DAG?
Basic Concept:
Directed: Edges have direction (parent → child)
Acyclic: No cycles or loops
Graph: Nodes connected by edges
Shared Nodes: Common subexpressions share same node
Difference from Syntax Tree:
Syntax Tree: Each subexpression has separate node
DAG: Common subexpressions share single node
DAG Construction Flowchart
Expression Input
↓
Parse into Subexpressions
↓
Check for Existing Nodes
↓
Reuse if Found, Create if New
↓
Connect with Directed Edges
↓
Optimized DAG Structure
Example Construction
Input Expression: a + a * (b - c) + (b - c) * d
Step-by-step DAG Construction:
1. Parse subexpressions:
o a
o b - c (appears twice)
o a * (b - c)
o (b - c) * d
Final addition
o
2. DAG Structure:
+
/|\
+*d
/| |/
a * (b-c) ← Shared node
/| /|\
a (b-c) b - c
/|\
b-c
Benefit: (b - c) computed only once, not twice!
Construction Algorithm
Steps:
1. Initialize: Create empty DAG
2. For each operand: Check if node exists
o If exists: Reuse node
o If new: Create node
3. For each operator: Create operator node
4. Connect: Link operator to operand nodes
5. Optimize: Merge identical subexpressions
Algorithm Pseudocode:
function buildDAG(expression) {
nodeTable = empty hashtable
for each subexpression in expression {
key = computeKey(subexpression)
if (nodeTable contains key) {
node = nodeTable[key] // Reuse
} else {
node = createNode(subexpression)
nodeTable[key] = node // Store
}
connectToParents(node)
}
return rootNode
}
Types of Nodes in DAG
1. Leaf Nodes
Variables: a, b, c
Constants: 123, 3.14, "hello"
2. Internal Nodes
Operators: +, -, *, /, %
Functions: sin, cos, sqrt
Assignments: =
3. Root Node
Final result of expression
Advantages (Simple Words)
Space-efficient: Less memory usage
Optimization-friendly: Easy to spot common subexpressions
Code-quality: Better generated code
Elimination: Removes redundant computations
Sharing: Reuses computed values
Compact: Smaller representation
Disadvantages (Simple Words)
Complex: Harder to construct than trees
Maintenance: Difficult to modify
Debugging: Complex to trace
Implementation: More coding required
Memory-management: Pointer complexity
Analysis: Harder static analysis
Uses (Simple Words)
Compilers: Expression optimization
Interpreters: Efficient evaluation
Optimizers: Common subexpression elimination
Calculators: Avoiding recalculation
Code-generators: Better assembly code
Analyzers: Program analysis tools
Practical Example
Expression: x = a + b; y = a + b + c;
Without DAG (Syntax Trees):
Tree 1: = Tree 2: =
/\ /\
x + y +
/\ /\
a b + c
/\
a b
Computation: a + b calculated twice
With DAG:
= =
/\ /\
x \ y +
\ /\
+ ←----+ c
/\
a b
Computation: a + b calculated once, reused
Three Address Code
Simple Definition
Three Address Code: An intermediate code form where each instruction has at most three
addresses (two operands and one result). It's like simplified assembly language that's easier for
compilers to work with than complex expressions.
What is Three Address Code?
Basic Format:
result = operand1 operator operand2
Examples:
t1 = a + b (addition)
t2 = x * y (multiplication)
z = t1 - t2 (using temporary variables)
Key Features:
At most 3 addresses per instruction
One operator per instruction
Temporary variables (t1, t2, t3...) for intermediate results
Linear sequence of instructions
Three Address Code Generation Flowchart
High-Level Expression
↓
Parse Expression Tree/DAG
↓
Break Complex Operations
↓
Generate Temporary Variables
↓
Create Linear Instructions
↓
Three Address Code Sequence
Types of Three Address Instructions
1. Assignment Statements
x = y op z (binary operation)
x = op y (unary operation)
x=y (copy)
2. Control Flow
goto L (unconditional jump)
if x goto L (conditional jump)
if x op y goto L (conditional with operation)
3. Procedure Calls
call p, n (call procedure p with n parameters)
return y (return with value)
param x (parameter passing)
4. Array Operations
x = y[i] (array access)
x[i] = y (array assignment)
5. Pointer Operations
x = &y (address of)
x = *y (dereference)
*x = y (indirect assignment)
Example Conversions
Example 1: Simple Expression
Input: a = b + c * d
Three Address Code:
t1 = c * d
t2 = b + t1
a = t2
Example 2: Complex Expression
Input: a = (b + c) * (d - e)
Three Address Code:
t1 = b + c
t2 = d - e
t3 = t1 * t2
a = t3
TAC for Control Structures, Triples, Quadruples, and
Boolean Expressions
Simple Definitions
TAC for Control Structures: Three Address Code representation of if-else, loops, and switch
statements using labels and conditional jumps.
Triples: TAC representation using 3 fields - operator and two operand positions (no explicit
result field).
Quadruples: TAC representation using 4 fields - operator, two operands, and result.
Boolean Expressions: Logical expressions (AND, OR, NOT) converted to TAC with short-
circuit evaluation.
1. TAC for Various Control Structures
IF-ELSE Statement
Input Code:
c
if (a > b)
x = y + z;
else
x = y - z;
TAC:
if a <= b goto L1
t1 = y + z
x = t1
goto L2
L1: t2 = y - z
x = t2
L2: // continue
Flowchart:
Condition Check → True Branch → False Branch → Merge Point
WHILE Loop
Input Code:
c
while (i < n) {
sum = sum + i;
i = i + 1;
}
TAC:
L1: if i >= n goto L2
t1 = sum + i
sum = t1
t2 = i + 1
i = t2
goto L1
L2: // exit loop
FOR Loop
Input Code:
c
for (i = 0; i < n; i++) {
a[i] = b[i] + c[i];
}
TAC:
i=0
L1: if i >= n goto L3
t1 = 4 * i
t2 = b[t1]
t3 = c[t1]
t4 = t2 + t3
a[t1] = t4
t5 = i + 1
i = t5
goto L1
L3: // exit loop
SWITCH Statement
Input Code:
c
switch (x) {
case 1: y = a; break;
case 2: y = b; break;
default: y = c;
}
TAC:
if x == 1 goto L1
if x == 2 goto L2
goto L3
L1: y = a
goto L4
L2: y = b
goto L4
L3: y = c
L4: // continue
2. Representing TAC using Triples and Quadruples
Example Expression: a = b + c * d
Quadruples (4 fields: op, arg1, arg2, result)
Index Op Arg1 Arg2 Result
0 * c d t1
1 + b t1 t2
2 = t2 a
Structure:
cpp
struct Quadruple {
string op; // operator
string arg1; // first operand
string arg2; // second operand
string result; // result
};
Triples (3 fields: op, arg1, arg2)
Index Op Arg1 Arg2
0 * c d
1 + b (0)
2 = (1) a
Structure:
cpp
struct Triple {
string op; // operator
string arg1; // first operand
string arg2; // second operand
// Result is implicit (instruction index)
};
Comparison: Triples vs Quadruples
Advantages:
Triples:
Space-efficient: Uses less memory
Compact: No explicit result field
Index-based: References by position
Quadruples:
Clear: Explicit result field
Relocatable: Easy to reorder
Readable: Self-contained instructions
Disadvantages:
Triples:
Reordering-difficult: Hard to optimize
Index-dependent: Position matters
Complex: Indirect references
Quadruples:
Space-intensive: More memory needed
Redundant: Extra result field
Temporaries: Many temp variables
3. Boolean Expressions
Simple Boolean Expression
Input: a && b || c
TAC with Short-Circuit:
if a == false goto L1
if b == true goto L2
L1: if c == true goto L2
t1 = false
goto L3
L2: t1 = true
L3: // result in t1
Complex Boolean Expression
Input: (a > b) && (c < d) || (e == f)
TAC:
if a <= b goto L1
if c >= d goto L1
goto L2
L1: if e != f goto L3
L2: t1 = true
goto L4
L3: t1 = false
L4: // result in t1
Boolean Expression Flowchart:
First Condition → Short-Circuit Check → Second Condition → Final Result
4. Control Structures with Boolean Expressions
Nested IF-ELSE
Input Code:
c
if (a > b && c < d) {
if (x == y)
z = 1;
else
z = 2;
} else {
z = 3;
}
TAC:
if a <= b goto L3
if c >= d goto L3
if x != y goto L2
z=1
goto L4
L2: z = 2
goto L4
L3: z = 3
L4: // continue
WHILE with Complex Condition
Input Code:
c
while (i < n && array[i] != 0) {
sum += array[i];
i++;
}
TAC:
L1: if i >= n goto L3
t1 = 4 * i
t2 = array[t1]
if t2 == 0 goto L3
t3 = sum + t2
sum = t3
t4 = i + 1
i = t4
goto L1
L3: // exit loop
Implementation Examples
Quadruple Generation:
cpp
class QuadrupleGenerator {
vector<Quadruple> quads;
int tempCounter = 1;
string newTemp() {
return "t" + to_string(tempCounter++);
}
void emit(string op, string arg1, string arg2, string result) {
quads.push_back({op, arg1, arg2, result});
}
string genExpression(Node* node) {
if (node->isLeaf()) {
return node->name;
}
string left = genExpression(node->left);
string right = genExpression(node->right);
string temp = newTemp();
emit(node->op, left, right, temp);
return temp;
}
};
Triple Generation:
cpp
class TripleGenerator {
vector<Triple> triples;
int emit(string op, string arg1, string arg2) {
triples.push_back({op, arg1, arg2});
return [Link]() - 1; // Return index
}
string genExpression(Node* node) {
if (node->isLeaf()) {
return node->name;
}
string left = genExpression(node->left);
string right = genExpression(node->right);
int index = emit(node->op, left, right);
return "(" + to_string(index) + ")";
}
};
Memory Representations
Quadruple Table:
┌───┬────┬──────┬──────┬────────┐
│ # │ Op │ Arg1 │ Arg2 │ Result │
├───┼────┼──────┼──────┼────────┤
│0│+ │a │b │ t1 │
│ 1 │ * │ t1 │ c │ t2 │
│ 2 │ = │ t2 │ │x │
└───┴────┴──────┴──────┴────────┘
Triple Table:
┌───┬────┬──────┬──────┐
│ # │ Op │ Arg1 │ Arg2 │
├───┼────┼──────┼──────┤
│0│+ │a │b │
│ 1 │ * │ (0) │ c │
│ 2 │ = │ (1) │ x │
└───┴────┴──────┴──────┘
Uses in Compiler Design
Advantages (Simple Words):
Optimization-friendly: Easy to apply optimizations
Machine-independent: Portable representation
Analysis-suitable: Good for data flow analysis
Debugging: Step-by-step execution tracking
Disadvantages (Simple Words):
Verbose: Many instructions generated
Temporaries: Lots of temporary variables
Complexity: Simple code becomes multi-step
Uses (Simple Words):
Compilers: Intermediate code generation
Optimizers: Code optimization phase
Interpreters: Direct execution engine
Analyzers: Program analysis tools
This comprehensive approach to TAC representation provides the foundation for efficient
compilation, optimization, and code generation in modern compiler systems.
develop Runtime environments
1. Storage Allocation
Definition:
Management of memory space for program variables, constants, and temporary data during
execution.
Storage Areas:
Code Area (Text Segment)
Content: Machine instructions
Properties: Read-only, shared
Size: Fixed at compile time
Static/Global Area
Content: Global variables, constants
Properties: Fixed location, entire program lifetime
Size: Known at compile time
Stack Area
Content: Local variables, parameters, return addresses
Properties: LIFO, automatic management
Size: Dynamic, grows/shrinks with calls
Heap Area
Content: Dynamic memory allocation
Properties: Manual management, fragmentation
Size: Dynamic, programmer controlled
Storage Allocation Flowchart:
Program Start → Initialize Static Area → Stack Frame Creation →
Local Variable Allocation → Execution → Stack Frame Destruction → Program End
Advantages (Simple Words):
Organized: Clear memory layout
Efficient: Optimized access patterns
Automatic: Stack management handled automatically
Predictable: Known memory requirements
Disadvantages (Simple Words):
Limited: Stack size constraints
Fragmentation: Heap memory fragmentation
Overhead: Management overhead
Complexity: Multiple allocation strategies
Uses (Simple Words):
Compilers: Memory layout planning
Runtime-systems: Memory management
Operating-systems: Process memory management
Garbage-collectors: Automatic memory cleanup
2. Storage Allocation Strategies
Static Allocation
Characteristics:
Memory allocated at compile time
Fixed locations throughout execution
No runtime allocation overhead
Example:
c
int global_var = 10; // Static allocation
static int count = 0; // Static allocation
Advantages:
Fast: No allocation overhead
Simple: Direct addressing
Predictable: Known memory layout
Disadvantages:
Inflexible: Cannot change size
Wasteful: Memory always allocated
Limited: No dynamic structures
Stack Allocation
Characteristics:
Memory allocated on procedure calls
LIFO (Last In, First Out) order
Automatic deallocation on return
Example:
c
void function() {
int local_var; // Stack allocation
char buffer[100]; // Stack allocation
}
Stack Growth:
High Memory
┌─────────────┐
│ Free │
├─────────────┤ ← Stack Pointer
│ Function C │
├─────────────┤
│ Function B │
├─────────────┤
│ Function A │
├─────────────┤
│ Main │
└─────────────┘
Low Memory
Dynamic/Heap Allocation
Characteristics:
Memory allocated at runtime
Programmer controlled
Explicit allocation/deallocation
Example:
c
int* ptr = malloc(sizeof(int) * 100); // Heap allocation
free(ptr); // Explicit deallocation
3. Heap Management
Definition:
Management of dynamically allocated memory including allocation, deallocation, and garbage
collection.
Heap Management Flowchart:
Allocation Request → Find Free Block → Split if Necessary →
Return Pointer → Program Use → Deallocation → Merge Adjacent Blocks
Allocation Strategies:
First Fit
Method: Use first block that fits
Advantage: Fast allocation
Disadvantage: External fragmentation
Best Fit
Method: Use smallest block that fits
Advantage: Minimizes waste
Disadvantage: Slow search, fragmentation
Worst Fit
Method: Use largest available block
Advantage: Large remaining blocks
Disadvantage: Wastes large blocks
Fragmentation Types:
Internal Fragmentation
Cause: Allocated block larger than needed
Example: Request 100 bytes, get 128 bytes
Solution: Better size classes
External Fragmentation
Cause: Free blocks too small
Example: Many small scattered free blocks
Solution: Compaction, coalescing
Garbage Collection:
Mark and Sweep
Phase 1 (Mark): Traverse reachable objects from roots
Phase 2 (Sweep): Deallocate unmarked objects
Copying Collection
From-Space → Scan Live Objects → Copy to To-Space → Swap Spaces
Advantages (Simple Words):
Flexible: Dynamic memory allocation
Efficient: Memory reuse
Automatic: Garbage collection available
Scalable: Handles varying memory needs
Disadvantages (Simple Words):
Fragmentation: Memory becomes fragmented
Overhead: Management overhead
Unpredictable: Allocation time varies
Leaks: Memory leak possibilities
Uses (Simple Words):
Dynamic-structures: Lists, trees, graphs
Object-oriented: Object allocation
Interpreters: Runtime object creation
Databases: Buffer management
4. Activation Records
Definition:
Stack frame containing all information needed for a procedure call including parameters, local
variables, and control information.
Activation Record Structure:
┌─────────────────┐ ← Frame Pointer (FP)
│ Return Address │
├─────────────────┤
│ Saved Registers │
├─────────────────┤
│ Parameters │
├─────────────────┤
│ Local Variables │
├─────────────────┤
│ Temporaries │
└─────────────────┘ ← Stack Pointer (SP)
Components:
Return Address
Purpose: Where to return after procedure
Size: Typically one word
Usage: Restored on procedure exit
Parameters
Purpose: Arguments passed to procedure
Order: Usually reverse order
Access: Via frame pointer offset
Local Variables
Purpose: Procedure's local data
Allocation: Space reserved at call
Access: Via frame pointer offset
Temporaries
Purpose: Intermediate computations
Allocation: Dynamic during execution
Management: Compiler responsibility
Activation Record Creation Flowchart:
Procedure Call → Push Parameters → Save Return Address →
Allocate Locals → Execute Body → Restore State → Return
Example:
c
int factorial(int n) { // Parameter: n
int result; // Local variable: result
if (n <= 1) // Temporary for comparison
return 1;
else
return n * factorial(n-1);
}
Activation Record:
┌─────────────────┐
│ Return Address │ (where to return)
├─────────────────┤
│ Parameter: n │ (input value)
├─────────────────┤
│ Local: result │ (local variable)
├─────────────────┤
│ Temporaries │ (n-1, comparison result)
└─────────────────┘
5. Accessing Local and Non-Local Names
Block Structured Languages:
Languages where procedures can be nested, creating scope hierarchies.
Example:
pascal
program Main;
var x: integer; // Level 0 (global)
procedure A;
var y: integer; // Level 1
procedure B;
var z: integer; // Level 2
begin
z := x + y; // Access: local z, non-local y, global x
end;
begin
B;
end;
begin
A;
end.
Access Methods:
Static Links
Method: Each frame points to enclosing scope
Access: Follow chain to find variable
Cost: O(nesting depth)
┌─────────────────┐
│ Procedure B │
│ Static Link ────┼──┐
├─────────────────┤ │
│ Procedure A │ ←┘
│ Static Link ────┼──┐
├─────────────────┤ │
│ Main Program │ ←┘
└─────────────────┘
Display
Method: Array of pointers to each nesting level
Access: Direct access via display[level]
Cost: O(1) access time
Display Array:
┌───┬─────────────┐
│ 0 │ Main Frame │
├───┼─────────────┤
│ 1 │ Proc A Frame│
├───┼─────────────┤
│ 2 │ Proc B Frame│
└───┴─────────────┘
Access Flowchart:
Variable Reference → Check Local Scope → Check Static Link/Display →
Find Variable → Calculate Address → Access Value
6. Parameter Passing
Definition:
Methods of transferring arguments from caller to called procedure.
Parameter Passing Methods:
Call by Value
Method: Copy argument value
Modification: Changes don't affect original
Cost: Copying overhead for large structures
c
void func(int x) {
x = 10; // Original unchanged
}
int main() {
int a = 5;
func(a); // a still equals 5
}
Call by Reference
Method: Pass address of argument
Modification: Changes affect original
Cost: Indirection overhead
c
void func(int* x) {
*x = 10; // Original changed
}
int main() {
int a = 5;
func(&a); // a now equals 10
}
Call by Name
Method: Substitute argument expression
Evaluation: Each access re-evaluates
Cost: High evaluation overhead
Call by Value-Result
Method: Copy in, copy out
Modification: Changes copied back
Cost: Two copying operations
Parameter Passing Flowchart:
Procedure Call → Evaluate Arguments → Apply Passing Method →
Set Up Parameters → Execute Procedure → Handle Results
Advantages (Simple Words):
Call by Value: Safe, predictable, simple Call by Reference: Efficient, allows modification Call
by Name: Flexible, lazy evaluation
Disadvantages (Simple Words):
Call by Value: Copying overhead, no modification Call by Reference: Aliasing issues, safety
concerns Call by Name: Expensive, complex semantics
Uses (Simple Words):
Call by Value: Default in C, Java Call by Reference: C pointers, Pascal var parameters Call by
Name: Algol 60, macro systems
7. Symbol Table Organization
Definition:
Data structure storing information about identifiers including their names, types, scope, and
memory locations.
Symbol Table Entry Structure:
cpp
struct SymbolEntry {
string name; // Identifier name
Type type; // Data type
int scope; // Scope level
int offset; // Memory offset
bool isFunction; // Function flag
int paramCount; // Parameter count (if function)
};
Organization Methods:
Linear List
Structure: Sequential array/list
Search: Linear search O(n)
Insert: O(1) at end
Suitable: Small programs
Hash Table
Structure: Hash function + buckets
Search: Average O(1)
Insert: Average O(1)
Suitable: Large programs
Binary Search Tree
Structure: Ordered tree
Search: O(log n) average
Insert: O(log n) average
Suitable: Medium programs
Scope Stack
Structure: Stack of symbol tables
Search: Current scope first
Insert: Current scope only
Suitable: Block-structured languages
Symbol Table Flowchart:
Identifier Declaration → Check Current Scope → Insert if New →
Identifier Reference → Search Scopes → Return Information/Error
8. Data Structures Used in Symbol Tables
Hash Table Implementation:
cpp
class SymbolTable {
static const int TABLE_SIZE = 211;
vector<SymbolEntry> table[TABLE_SIZE];
int hash(string name) {
int sum = 0;
for (char c : name) {
sum += c;
}
return sum % TABLE_SIZE;
}
public:
void insert(SymbolEntry entry) {
int index = hash([Link]);
table[index].push_back(entry);
}
SymbolEntry* lookup(string name) {
int index = hash(name);
for (auto& entry : table[index]) {
if ([Link] == name) {
return &entry;
}
}
return nullptr;
}
};
Scope Management:
cpp
class ScopeManager {
stack<SymbolTable*> scopeStack;
public:
void enterScope() {
[Link](new SymbolTable());
}
void exitScope() {
delete [Link]();
[Link]();
}
SymbolEntry* lookup(string name) {
// Search from current scope outward
stack<SymbolTable*> temp = scopeStack;
while (![Link]()) {
SymbolEntry* entry = [Link]()->lookup(name);
if (entry) return entry;
[Link]();
}
return nullptr;
}
};
Memory Layout Example:
Symbol Table Memory:
┌─────────────────┐
│ Hash Table │
│ ┌─────┬───────┐ │
│ │ 0 │ var a │ │
│ ├─────┼───────┤ │
│ │ 1 │ func f│ │
│ ├─────┼───────┤ │
│ │ ... │ ... │ │
│ └─────┴───────┘ │
└─────────────────┘
Scope Stack:
┌─────────────────┐ ← Current Scope
│ Local Symbols │
├─────────────────┤
│ Function Scope │
├─────────────────┤
│ Global Scope │
└─────────────────┘
Advantages (Simple Words):
Fast-access: Quick identifier lookup
Organized: Structured information storage
Scalable: Handles large programs
Flexible: Supports various language features
Disadvantages (Simple Words):
Memory-overhead: Extra storage needed
Complexity: Multiple data structures
Maintenance: Requires careful management
Conflicts: Name collision handling
Uses (Simple Words):
Compilers: Identifier resolution
IDEs: Code completion, error checking
Debuggers: Variable inspection
Interpreters: Runtime symbol resolution
Integration Example
Complete Runtime Environment:
┌─────────────────────────────────────┐
│ Code Segment │
├─────────────────────────────────────┤
│ Static Data │
├─────────────────────────────────────┤
│ │
│ Stack │ ← Growing Down
│ ┌─────────────────────┐ │
│ │ Activation Record │ │
│ │ ┌───────────────┐ │ │
│ │ │ Return Addr │ │ │
│ │ ├───────────────┤ │ │
│ │ │ Parameters │ │ │
│ │ ├───────────────┤ │ │
│ │ │ Local Vars │ │ │
│ │ └───────────────┘ │ │
│ └─────────────────────┘ │
│ │
├─────────────────────────────────────┤
│ │
│ Heap │ ← Growing Up
│ │
└─────────────────────────────────────┘
This comprehensive runtime environment framework provides the foundation for efficient
program execution, memory management, and symbol resolution in compiled and interpreted
languages.
Definition of basic block control flow
graphs
Definition of Basic Block Control Flow Graphs
Basic Block: A sequence of consecutive statements with one entry point and one exit point. No
jumps in or out of the middle.
Control Flow Graph (CFG): A directed graph where nodes are basic blocks and edges
represent possible flow of control between blocks.
DAG Representation of Basic Block
DAG (Directed Acyclic Graph): A tree-like structure representing computations within a basic
block.
Simple Definition: Shows how variables and expressions are computed, avoiding redundant
calculations.
Example:
a=b+c
d=b+c
e=d*2
DAG shows b + c computed once, used for both a and d.
Flowchart Structure: Input → Parse Statements → Build DAG Nodes → Connect
Dependencies → Output Optimized Code
Advantages of DAG
1. Eliminates redundant computations
2. Identifies common subexpressions
3. Reduces code size
4. Improves execution speed
5. Simplifies further optimizations
Disadvantages:
1. Memory overhead for DAG construction
2. Complex for large basic blocks
3. Time consuming to build
Sources of Optimization
1. Source Level: High-level language optimizations
2. Intermediate Code: Three-address code optimizations
3. Machine Level: Assembly/machine code optimizations
4. Target Specific: Architecture-dependent optimizations
Use: Improve program performance at different compilation stages.
Loop Optimization
Definition: Techniques to improve loop performance since loops execute repeatedly.
Types:
1. Code Motion: Move invariant code outside loop
2. Induction Variables: Optimize loop counters
3. Loop Unrolling: Reduce loop overhead
4. Loop Fusion: Combine similar loops
Flowchart: Identify Loop → Analyze Dependencies → Apply Optimization → Generate
Optimized Loop
Advantages: Faster execution, reduced iterations Disadvantages: Increased code size, complex
analysis
Loop Invariant Computation
Definition: Expressions whose values don't change during loop execution.
Example:
for(i=0; i<n; i++)
a[i] = x + y * z; // x + y * z is loop invariant
Optimization: Move temp = x + y * z outside loop.
Advantages: Reduces redundant calculations Disadvantages: Requires careful dependency
analysis
Peephole Optimization
Definition: Local optimization technique examining small "peephole" of instructions.
Common Optimizations:
1. Eliminate redundant loads/stores
2. Combine operations
3. Remove unnecessary jumps
4. Simplify algebraic expressions
Example:
Before: MOV R1, R2
MOV R2, R1
After: MOV R1, R2 (second instruction removed)
Advantages: Simple, effective for local improvements Disadvantages: Limited scope, misses
global optimizations
Issues in Code Generator Design
1. Instruction Selection: Choose appropriate machine instructions
2. Register Allocation: Assign variables to registers efficiently
3. Instruction Ordering: Sequence instructions optimally
4. Memory Management: Handle stack and heap efficiently
Challenges: Limited registers, complex instruction sets, target dependencies
A Simple Code Generator
Definition: Converts intermediate code to target machine code.
Steps:
1. Parse intermediate code
2. Select machine instructions
3. Allocate registers
4. Generate final code
Flowchart: Intermediate Code → Instruction Selection → Register Allocation → Code
Emission → Machine Code
Example: Three-address code a = b + c → ADD R1, R2, R3
Code Generation from DAG
Process: Convert DAG representation directly to efficient machine code.
Steps:
1. Traverse DAG in topological order
2. Generate code for each node
3. Reuse computed values
4. Optimize register usage
Advantages: Eliminates redundancy, efficient code Disadvantages: Complex traversal
algorithms
Machine Independent Optimization
Definition: Optimizations not specific to target architecture.
Global Data Flow Analysis
Definition: Analysis of how data flows through entire program across basic blocks.
Purpose: Understand variable usage patterns for optimization.
Flowchart: Program → Build CFG → Compute Data Flow → Apply Optimizations →
Optimized Program
Uses: Enables advanced optimizations like dead code elimination
Constant Propagation
Definition: Replace variables with constant values when possible.
Example:
x = 5;
y = x + 3; // becomes y = 8;
Advantages: Reduces computations, enables further optimizations Disadvantages: Requires
complex analysis for conditionals
Liveness Analysis
Definition: Determines which variables are "live" (will be used later) at each program point.
Live Variable: A variable whose value may be used before redefinition.
Uses:
Register allocation
Dead code elimination
Flowchart: CFG → Backward Analysis → Compute Live Sets → Optimization Decisions
Advantages: Efficient register use, removes dead code Disadvantages: Complex for large
programs
Common Subexpression Elimination
Definition: Identify and eliminate redundant computations of same expression.
Example:
a = b + c;
d = b + c; // eliminated, use value of a
Types:
1. Local: Within basic block
2. Global: Across basic blocks
Advantages: Reduces computations, faster execution Disadvantages: Increased register
pressure, complex analysis
Flowchart: Code → Identify Common Expressions → Build Expression Table → Replace
Redundant Computations → Optimized Code
All these optimizations work together to produce efficient, fast-executing machine code while
maintaining program correctness.
Token recognition bridges the gap between text and structured
analysis
Review of CFG, Ambiguity, and Parsing Techniques
Simple Definitions
CFG (Context Free Grammar): A formal grammar with production rules where left side has
single non-terminal and right side has any combination of terminals and non-terminals.
Ambiguity: A grammar that can generate the same string with multiple different parse trees.
Parsing: Process of analyzing input string to determine its grammatical structure according to
grammar rules.
Top-Down Parsing: Building parse tree from root (start symbol) to leaves (terminals).
Bottom-Up Parsing: Building parse tree from leaves (terminals) to root (start symbol).
1. Context Free Grammar (CFG)
Definition:
Grammar consisting of terminals, non-terminals, production rules, and start symbol.
CFG Components:
Terminals (T): Basic symbols (keywords, operators, identifiers)
Non-terminals (N): Syntactic categories (Expr, Stmt, Term)
Productions (P): Rules of form A → α
Start Symbol (S): Root non-terminal
CFG Example:
E→E+T|T
T→T*F|F
F → (E) | id
CFG Flowchart:
Start Symbol → Apply Production → Expand Non-terminals →
Continue Until Terminals → Parse Tree Complete
Advantages (Simple Words):
Expressive: Handles most programming constructs
Recursive: Supports nested structures
Parseable: Efficient parsing algorithms exist
Standard: Widely used in compilers
Disadvantages (Simple Words):
Limited: Cannot handle context-sensitive rules
Ambiguous: May have multiple interpretations
Complex: Can become unwieldy for large languages
Uses (Simple Words):
Compilers: Language syntax definition
Parsers: Syntax analysis tools
Validators: Input validation systems
Generators: Parser generator tools
2. Ambiguity of Grammars
Definition:
A grammar is ambiguous if there exists at least one string that can be derived using two or more
different parse trees.
Ambiguous Grammar Example:
E → E + E | E * E | (E) | id
String: id + id * id
Parse Tree 1 (+ first):
E
/|\
E+E
| /|\
id E * E
| |
id id
Parse Tree 2 (* first):
E
/|\
E*E
/|\ |
E + E id
| |
id id
Removing Ambiguity:
E→E+T|T (+ is left associative, lower precedence)
T→T*F|F (* is left associative, higher precedence)
F → (E) | id (Factors: parentheses and identifiers)
Ambiguity Detection Flowchart:
Grammar Input → Generate Strings → Check Multiple Parse Trees →
If Multiple Trees → Ambiguous | If Single Tree → Unambiguous
Types of Ambiguity:
Operator Precedence Ambiguity
Problem: Unclear which operator binds first
Solution: Define precedence levels
Associativity Ambiguity
Problem: Unclear grouping for same precedence
Solution: Define left/right associativity
Dangling Else Problem
stmt → if expr then stmt
| if expr then stmt else stmt
| other
Ambiguous: if E1 then if E2 then S1 else S2
Advantages of Removing Ambiguity:
Deterministic: Single interpretation
Predictable: Consistent parsing
Implementable: Can build parsers
Disadvantages of Ambiguous Grammar:
Confusion: Multiple meanings
Unparseable: Cannot build deterministic parser
Errors: Unexpected behavior
3. Introduction to Parsing
Definition:
Process of analyzing input token sequence to determine syntactic structure according to grammar
rules.
Parsing Process Flowchart:
Token Stream → Parser → Parse Tree/AST → Semantic Analysis
Types of Parsers:
Syntax-Directed
Build explicit parse tree
Good for educational purposes
Memory intensive
Table-Driven
Use parsing tables
Efficient execution
Complex construction
Recursive Descent
Use recursive procedures
Easy to understand
Manual construction
Parser Classification:
Parsers
├── Top-Down
│ ├── Recursive Descent
│ ├── LL(k)
│ └── Predictive
└── Bottom-Up
├── LR(k)
├── SLR
├── LALR
└── Operator Precedence
4. Top-Down Parsing
Definition:
Parsing technique that starts from start symbol and tries to derive the input string by applying
production rules.
Top-Down Parsing Flowchart:
Start Symbol → Choose Production → Expand Non-terminal →
Match Terminals → Backtrack if Mismatch → Success/Failure
Problems with Top-Down Parsing:
Left Recursion
Problem: A → Aa | b Issue: Infinite recursion
Solution: Convert to right recursion
A → bA'
A' → aA' | ε
Left Factoring
Problem: A → ab | ac Issue: Cannot decide which production
Solution: Factor common prefix
A → aA'
A' → b | c
Advantages (Simple Words):
Intuitive: Natural top-down approach
Predictive: Can predict productions
Error-friendly: Good error recovery
Disadvantages (Simple Words):
Left-recursion: Cannot handle left recursion
Backtracking: May need backtracking
Inefficient: Can be slow
Uses (Simple Words):
Compilers: Simple language parsers
Interpreters: Expression evaluators
Validators: Input format checkers
5. LL Grammars & Parsers
Definition:
LL(k) means Left-to-right scan, Leftmost derivation, k symbols lookahead.
LL(1) Grammar Requirements:
1. No left recursion
2. Left factored
3. Unique productions for each (A, a) pair
FIRST and FOLLOW Sets:
FIRST(α)
Set of terminals that can begin strings derived from α.
Algorithm:
1. If α is terminal, FIRST(α) = {α}
2. If α → ε, add ε to FIRST(α)
3. If α → X₁X₂...Xₙ, add FIRST(X₁) to FIRST(α)
4. If ε ∈ FIRST(X₁), add FIRST(X₂), and so on
FOLLOW(A)
Set of terminals that can appear immediately after non-terminal A.
Algorithm:
1. Add $ to FOLLOW(start symbol)
2. If A → αBβ, add FIRST(β) - {ε} to FOLLOW(B)
3. If A → αB or A → αBβ where ε ∈ FIRST(β),
add FOLLOW(A) to FOLLOW(B)
LL(1) Parsing Table Construction:
For each production A → α:
1. For each terminal a in FIRST(α), add A → α to M[A, a]
2. If ε in FIRST(α), for each terminal b in FOLLOW(A),
add A → α to M[A, b]
LL(1) Parsing Algorithm:
stack = [S, $]
input = w$
while stack ≠ [$]:
if top(stack) = input[0]:
pop stack, advance input
else if top(stack) is non-terminal A:
use table entry M[A, input[0]]
pop A, push production right side
else:
error
6. Error Handling in LL Parsers
Error Recovery Strategies:
Panic Mode
Method: Skip tokens until synchronizing symbol
Synchronizing: Statement terminators (;, }, end)
Advantage: Simple implementation
Disadvantage: May skip too much
Phrase Level
Method: Replace/insert/delete tokens locally
Example: Insert missing semicolon
Advantage: Precise recovery
Disadvantage: Complex implementation
Error Productions
Method: Add productions for common errors
Example: stmt → if expr then stmt else
Advantage: Specific error messages
Disadvantage: Grammar bloat
Global Correction
Method: Find minimum token changes
Advantage: Optimal correction
Disadvantage: Computationally expensive
Error Handling Flowchart:
Parse Error → Identify Error Type → Apply Recovery Strategy →
Resume Parsing → Continue Until Success/Fatal Error
7. Recursive Descent Parsing
Definition:
Top-down parsing technique using recursive procedures for each non-terminal.
Implementation Structure:
cpp
class RecursiveDescentParser {
Token lookahead;
void expression() {
term();
while ([Link] == PLUS) {
match(PLUS);
term();
}
}
void term() {
factor();
while ([Link] == MULT) {
match(MULT);
factor();
}
}
void factor() {
if ([Link] == ID) {
match(ID);
} else if ([Link] == LPAREN) {
match(LPAREN);
expression();
match(RPAREN);
} else {
error();
}
}
void match(TokenType expected) {
if ([Link] == expected) {
lookahead = nextToken();
} else {
error();
}
}
};
Grammar to Code Translation:
Grammar Rule: A → aBc
Code:
void A() {
a();
B();
match('c');
}
Grammar Rule: A → a | b
Code:
void A() {
if (lookahead == 'a') {
match('a');
} else if (lookahead == 'b') {
match('b');
} else {
error();
}
}
Advantages (Simple Words):
Simple: Easy to write and understand
Flexible: Can add semantic actions easily
Debuggable: Easy to trace execution
Hand-coded: Full control over parsing
Disadvantages (Simple Words):
Manual: Must write by hand
Limited: Only works for LL grammars
Maintenance: Hard to modify grammar
8. Predictive Parsers
Definition:
LL(1) parsers that use parsing table to predict which production to use.
Predictive Parser Structure:
┌─────────────┐
│Input Buffer │ → a + b $ ...
└─────────────┘
↓
┌─────────────┐ ┌─────────────┐
│ Stack │ ←→ │Parsing Table│
│ E │ │ M[A,a] │
│ $ │ └─────────────┘
└─────────────┘
↓
┌─────────────┐
│ Output │ → Production sequence
└─────────────┘
Parsing Table Example:
│ id │ + │ * │ ( │ ) │ $
─────┼─────┼─────┼─────┼─────┼─────┼─────
E │E→TE'│ │ │E→TE'│ │
E' │ │E'→+TE'│ │ │E'→ε │E'→ε
T │T→FT'│ │ │T→FT'│ │
T' │ │T'→ε │T'→*FT'│ │T'→ε │T'→ε
F │F→id │ │ │F→(E)│ │
Advantages (Simple Words):
Efficient: Linear time parsing
Deterministic: No backtracking
Table-driven: Systematic approach
Automated: Can be generated automatically
Disadvantages (Simple Words):
Grammar-restricted: Only LL(1) grammars
Table-size: Large tables for complex grammars
Construction: Complex table construction
9. Bottom-Up Parsing
Definition:
Parsing technique that starts from input tokens and builds parse tree toward start symbol.
Bottom-Up Process:
Input: id + id * id
Step 1: id + id * id (reduce id to F)
Step 2: F + id * id (reduce F to T)
Step 3: T + id * id (reduce T to E)
Step 4: E + id * id (reduce id to F)
Step 5: E + F * id (reduce F to T)
Step 6: E + T * id (reduce id to F)
Step 7: E + T * F (reduce T*F to T)
Step 8: E + T (reduce E+T to E)
Step 9: E (accept)
Bottom-Up Flowchart:
Input Tokens → Shift to Stack → Look for Handle →
Reduce by Production → Continue Until Start Symbol
Key Concepts:
Handle
Definition: Substring matching right side of production
Property: Can be reduced to left side
Detection: Core of bottom-up parsing
Viable Prefix
Definition: Prefix of right-sentential form
Property: No handle extends past right end
Use: Determines when to shift vs reduce
10. Shift-Reduce Parsing
Definition:
Bottom-up parsing using shift and reduce operations on a stack.
Operations:
Shift
Move next input symbol onto stack
Used when no handle is found
Reduce
Replace handle on stack with left side of production
Used when handle is identified
Accept
Input consumed and stack contains start symbol
Successful parse completion
Error
No valid action available
Parse failure
Shift-Reduce Example:
Stack Input Action
$ id+id$ shift
$id +id$ reduce F→id
$F +id$ reduce T→F
$T +id$ reduce E→T
$E +id$ shift
$E+ id$ shift
$E+id $ reduce F→id
$E+F $ reduce T→F
$E+T $ reduce E→E+T
$E $ accept
Conflicts:
Shift-Reduce Conflict
Problem: Can either shift or reduce
Example: Dangling else problem
Resolution: Define precedence/associativity
Reduce-Reduce Conflict
Problem: Multiple reductions possible
Cause: Grammar ambiguity
Resolution: Rewrite grammar
11. LR Parsers
Definition:
LR(k) means Left-to-right scan, Rightmost derivation in reverse, k symbols lookahead.
LR Parser Structure:
┌─────────────┐
│Input Buffer │ → a b $ ...
└─────────────┘
↓
┌─────────────┐ ┌─────────────┐
│ Stack │ ←→ │ Action/Goto │
│ s₀s₁s₂... │ │ Table │
└─────────────┘ └─────────────┘
LR Parsing Algorithm:
stack = [0]
while true:
state = top(stack)
symbol = lookahead
action = ACTION[state, symbol]
if action = "shift s":
push symbol, push s
advance input
elif action = "reduce A → β":
pop 2*|β| symbols
state = top(stack)
push A, push GOTO[state, A]
elif action = "accept":
return success
else:
error
LR Item:
A→α•β
where • indicates current position in production
Item Types:
Kernel Items: Core items that define state
Closure Items: Items derived from kernel items
12. Construction of SLR Parsing Tables
SLR (Simple LR):
Extension of LR(0) using FOLLOW sets to resolve conflicts.
SLR Construction Steps:
1. Construct LR(0) Items
I₀: S' → •S, S → •CC, C → •cC, C → •d
2. Compute GOTO Function
GOTO(I₀, S) = {S' → S•}
GOTO(I₀, C) = {S → C•C, C → •cC, C → •d}
3. Build ACTION Table
If A → α•aβ in Iᵢ and GOTO(Iᵢ, a) = Iⱼ:
ACTION[i, a] = shift j
If A → α• in Iᵢ and A ≠ S':
For all a in FOLLOW(A):
ACTION[i, a] = reduce A → α
SLR Example:
Grammar: S → CC, C → cC | d
States:
I₀: S' → •S, S → •CC, C → •cC, C → •d
I₁: S' → S•
I₂: S → C•C, C → •cC, C → •d
I₃: C → c•C, C → •cC, C → •d
I₄: C → d•
I₅: S → CC•
I₆: C → cC•
13. Canonical LR & LALR Parsing Tables
Canonical LR (CLR):
Uses full LR(1) items with complete lookahead information.
LR(1) Item:
[A → α•β, a]
where a is lookahead symbol
LALR (Look-Ahead LR):
Merges CLR states with same core but different lookaheads.
Comparison:
States Table Size Power
SLR Small Small Weak
LALR Medium Medium Medium
CLR Large Large Strong
LALR Construction:
1. Build CLR parser
2. Merge states with identical cores
3. Union lookahead sets
4. Check for conflicts
14. Parsing with Ambiguous Grammar
Precedence and Associativity:
Used to resolve conflicts in ambiguous grammars.
Precedence Declaration:
yacc
%left '+' '-'
%left '*' '/'
%right UMINUS
Conflict Resolution:
Shift-Reduce: Use operator precedence
Reduce-Reduce: Use rule ordering
Example:
E → E + E | E * E | -E | (E) | id
Conflicts resolved by:
- * has higher precedence than +
- + and * are left associative
- - (unary) has highest precedence
15. Operator Precedence Parsing
Definition:
Bottom-up parsing using operator precedence relations.
Precedence Relations:
a ⋖ b: a has lower precedence than b
a ≐ b: a has same precedence as b
a ⋗ b: a has higher precedence than b
Precedence Table:
│ + │ * │ ( │ ) │ id│ $
────┼───┼───┼───┼───┼───┼───
+ │⋗│⋖│⋖│⋗│⋖│⋗
* │⋗│⋗│⋖│⋗│⋖│⋗
( │⋖│⋖│⋖│≐│⋖│
) │⋗│⋗│ │⋗│ │⋗
id │ ⋗ │ ⋗ │ │ ⋗ │ │ ⋗
$ │⋖│⋖│⋖│ │⋖│
Parsing Algorithm:
1. Find handle using precedence relations
2. Reduce when right end is found
3. Continue until input consumed
16. YACC (Yet Another Compiler Compiler)
Definition:
Automatic parser generator that creates LALR parsers from grammar specifications.
YACC File Structure:
yacc
%{
/* C declarations */
%}
/* YACC declarations */
%token NUMBER ID
%left '+' '-'
%left '*' '/'
%%
/* Grammar rules */
expr: expr '+' expr { $$ = $1 + $3; }
| expr '*' expr { $$ = $1 * $3; }
| NUMBER { $$ = $1; }
| ID { $$ = lookup($1); }
;
%%
/* C subroutines */
YACC Features:
Automatic: Generates parser code
LALR: Uses LALR parsing algorithm
Actions: Supports semantic actions
Error: Built-in error recovery
YACC Usage:
bash
yacc grammar.y # Generate [Link].c
cc [Link].c -ly # Compile with YACC library
17. Error Handling in LR Parsers
Error Detection:
Error detected when no valid action in parsing table.
Error Recovery Strategies:
Panic Mode
Method: Pop stack until error recovery state
Synchronizing: Use statement separators
Implementation: Error productions in grammar
Phrase Level
Method: Modify stack/input locally
Actions: Insert, delete, or replace tokens
Goal: Resume normal parsing quickly
Error Productions
Method: Add productions for common errors
Example: Missing semicolon, unmatched parentheses
Advantage: Specific error messages
Global Correction
Method: Find minimal cost correction
Algorithm: Dynamic programming approach
Cost: Computationally expensive
Error Recovery Example:
yacc
stmt: expr ';'
| if_stmt
| error ';' { yyerror("Invalid statement"); }
;
expr: term
| expr '+' term
| expr error term { yyerror("Missing operator"); }
;
LR Error Recovery Flowchart:
Error Detected → Pop Stack States → Find Recovery State →
Discard Input Tokens → Insert Error Token → Resume Parsing
Summary Comparison
Parser Comparison Table:
Feature │ LL(1) │ SLR │ LALR │ CLR │ Precedence
───────────────┼───────┼──────┼──────┼──────┼───────────
Grammar Class │ Small │ Med │ Large│ Very │ Limited
Table Size │ Small │ Med │ Med │ Large│ Small
Parse Power │ Weak │ Med │ Good │ Best │ Limited
Construction │ Easy │ Med │ Hard │ Hard │ Easy
Error Handling │ Good │ Med │ Good │ Good │ Poor
Advantages and Disadvantages Summary:
Top-Down:
Advantages: Intuitive, good-errors, predictive
Disadvantages: Left-recursion, limited-grammars
Bottom-Up:
Advantages: Powerful, handles-more grammars, efficient
Disadvantages: Complex-construction, harder-errors
Uses: Compilers, interpreters, syntax-validators, language-processors
This comprehensive coverage provides the foundation for understanding parsing techniques and
their applications in compiler design and language processing systems.