0% found this document useful (0 votes)
9 views73 pages

Understanding Compiler Design Basics

The document provides an overview of compiler design, detailing the functions and phases of compilers, including lexical analysis, syntax analysis, and code generation. It discusses the roles of compilers, interpreters, and translators, as well as techniques like bootstrapping and finite automata in lexical analysis. Additionally, it covers syntax trees, attributes in parsing, and methods of translation, highlighting their advantages and disadvantages.

Uploaded by

Nistha choudhary
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views73 pages

Understanding Compiler Design Basics

The document provides an overview of compiler design, detailing the functions and phases of compilers, including lexical analysis, syntax analysis, and code generation. It discusses the roles of compilers, interpreters, and translators, as well as techniques like bootstrapping and finite automata in lexical analysis. Additionally, it covers syntax trees, attributes in parsing, and methods of translation, highlighting their advantages and disadvantages.

Uploaded by

Nistha choudhary
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Compiler Design

Compiler: A system program that translates high-level source code written in one programming language
into machine code or another target language. The compilation happens before program execution,
producing an executable file that can run independently.

Operations of Compiler
 It breaks source programs into smaller parts.
 It enables the creation of symbol tables and intermediate representations.
 It helps in code compilation and error detection.
 it saves all codes and variables.
 It analyses the full program and translates it.
 Convert source code to machine code.
Advantages of Compiler Design
Efficiency

Portability

Error Checking

Optimizations

Translator: A program that converts code from one language to another. This includes compilers,
interpreters, and assemblers. Translators bridge the gap between human-readable code and machine-
executable instructions.

Interpreter: A program that directly executes source code line-by-line or statement-by-statement


without prior translation to machine code. It reads, analyzes, and executes code simultaneously during
runtime. A compiler scans the entire program and translates it as a whole into machine
code whereas an interpreter translates the program one statement at a time.

Phases of Compiler:

Lexical Analysis (Scanning): Breaks the source code into tokens (keywords, identifiers, operators,
literals). Removes whitespace and comments while identifying the basic building blocks of the program.

Syntax Analysis (Parsing): Checks if tokens follow the grammar rules of the programming language.
Creates a parse tree or abstract syntax tree (AST) representing the program's structure.

Semantic Analysis: Verifies semantic correctness, including type checking, scope resolution, and
declaration validation. Ensures the program follows language semantics beyond just syntax.

Intermediate Code Generation: Produces an intermediate representation (IR) of the source code that is
independent of the target machine. This facilitates optimization and portability.

Code Optimization: Improves the intermediate code for better performance by eliminating
redundancies, reducing execution time, and minimizing memory usage while preserving program
semantics.

Code Generation: Converts optimized intermediate code into target machine code or assembly language
specific to the target processor architecture.
Symbol Table Management: Maintains information about identifiers (variables, functions, classes)
throughout compilation, including their types, scope, and memory locations.

Error Handling: Detects and reports errors at each phase, providing meaningful error messages to help
developers identify and fix issues.

Bootstrapping: Bootstrapping is an important technique in compiler design, where a basic compiler


is used to create a more advanced version of itself. This process helps in building compilers for new
programming languages and improving the ones already in use.

step-by-step look at how bootstrapping works in compiler design:

Step 1: Start with a Basic Compiler


The first step is to create a basic compiler that can handle the most essential features of a
programming language.
Step 2: Use the Basic Compiler to Create a More Advanced Version
Once the basic compiler is ready, it is used to compile a more advanced version of itself.
Step 3: Gradually Improve the Compiler
With each new version, the compiler becomes more capable. The process is repeated, and each
iteration adds more features, making the compiler stronger and more efficient.
Advantages
Improved Efficiency
Portability
Reduced Dependency

Finite Automata in Lexical


Analysis
Definition:
Finite Automata (FA) are mathematical models used in lexical
analyzers to recognize patterns in input strings. They consist of
states, transitions, and accept/reject mechanisms to identify valid
tokens according to specified rules.
🔄 Finite Automata Processing Flow

Input String → Start State → Read Character → State Transition → Check


Accept State → Token/Reject

Advantages:
 Efficient pattern matching with linear time complexity
 Memory efficient - requires minimal storage
 Deterministic behavior ensures consistent results
 Easy to implement in hardware or software
 Mathematical foundation provides formal verification
Disadvantages:
 Limited to regular languages only
 Cannot handle nested structures or context-sensitive patterns
 State explosion problem for complex patterns
 Difficult to modify once constructed
 Cannot count or remember arbitrary amounts of information
 Uses:
 Keyword recognition in programming languages
 Identifier and number pattern matching
 String literal processing
 Comment detection and removal
 Operator and delimiter recognition

Input Processing in Lexical


Analysis
Definition:
Input processing involves reading and managing the source code
stream, handling character-by-character analysis, and maintaining
position information for error reporting and token generation.
📥 Input Processing Flow

Source File→ Buffer Loading → Character Reading → Position


Tracking → Lookahead → Token Formation

Advantages:
 Efficient memory usage through buffering
 Supports lookahead for complex token recognition
 Maintains line and column information for debugging
 Handles different character encodings
 Can process large files without loading entirely into memory
Disadvantages:
 I/O operations can be performance bottlenecks
 Buffer management complexity
 Character encoding issues
 End-of-file handling complications
 Memory overhead for position tracking
Uses:
 Reading source code from files or streams
 Preprocessing directives handling
 Multi-file compilation support
 Interactive compiler environments
 IDE integration for real-time analysis

Recognition of Tokens
Definition:
Token recognition is the process of identifying and classifying
meaningful sequences of characters in source code into predefined
categories like keywords, identifiers, operators, and literals.
Token Recognition Flow

Character Stream → Pattern Matching → Longest Match → Token


Classification → Attribute Assignment → Token Output

Token Categories Example: Keywords: if, while, for,


class, public Identifiers: variable_name, functionName,
Class1 Operators: +, -, *, /, ==, !=, &&, || Literals:
123, 3.14, "hello", 'a', true Delimiters: ;, {, }, (, ),
[, ], ,

Advantages:
 Separates lexical concerns from syntax analysis
 Simplifies parser design and implementation
 Enables efficient symbol table management
 Supports different token types with attributes
 Facilitates error recovery and reporting
Disadvantages:
 Ambiguity resolution can be complex
 Context-sensitive tokens are challenging
 Requires careful precedence handling
 May need backtracking for some patterns
 Token attribute management overhead
🎯 Uses:
 Programming language compilation
 Text processing and analysis tools
 Configuration file parsing
 Query language processing
 Data format validation

LEX: A Lexical Analyzer


Generator
Definition:
LEX is a tool that automatically generates lexical analyzers from
regular expression specifications. It takes pattern-action rules as
input and produces C code for a scanner that recognizes those
patterns.
⚙️LEX Processing Flow

LEX Specification → LEX Compiler → [Link].c → C Compiler → Executable


Scanner → Token Stream

LEX Specification Example: %{ #include <stdio.h> %} %%


[0-9]+ { printf("NUMBER: %s\n", yytext); } [a-zA-Z_][a-
zA-Z0-9_]* { printf("IDENTIFIER: %s\n", yytext); }
"if"|"while"|"for" { printf("KEYWORD: %s\n", yytext); } [
\t\n]+ { /* ignore whitespace */ } . { printf("UNKNOWN:
%s\n", yytext); } %% int main() { yylex(); return 0; }

Advantages:
 Rapid development of lexical analyzers
 Regular expression support for pattern specification
 Automatic DFA generation and optimization
 Integration with YACC parser generator
 Well-tested and reliable code generation
 Handles complex tokenization rules automatically
Disadvantages:
 Generated code may not be optimal for all cases
 Limited debugging capabilities
 Platform dependency (primarily UNIX-based)
 Learning curve for specification syntax
 Conflicts resolution can be non-intuitive
 Limited error reporting customization
🎯 Uses:
 Compiler front-end development
 Text processing applications
 Configuration file parsers
 Log file analyzers
 Data extraction tools
 Protocol analyzers and validators

Error Handling in Lexical Analysis


Definition:
Error handling in lexical analysis involves detecting, reporting, and
recovering from lexical errors such as invalid characters, malformed
tokens, and unterminated strings while maintaining compilation
continuity.
🚨 Error Handling Flow

Error Detection→ Error Classification → Error Reporting → Recovery


Strategy → Continue Analysis → Error Summary

Common Lexical Errors: • Invalid Characters: @#$% (in


languages that don't support them) • Unterminated
Strings: "hello world • Invalid Numbers: 3.14.15, 0x,
123abc • Comment Errors: /* unterminated comment • Escape
Sequences: "\q" (invalid escape)

Advantages:
 Early error detection saves compilation time
 Clear error messages improve developer experience
 Recovery mechanisms allow continued analysis
 Position tracking aids in debugging
 Prevents cascading errors in later phases
Disadvantages:
 Complex error recovery logic
 May mask subsequent errors
 Recovery can lead to incorrect token streams
 Performance overhead for error checking
 Difficult to balance error detail vs. clarity
Uses:
 Compiler error reporting systems
 IDE real-time error highlighting
 Code quality tools and linters
 Educational programming environments
 Automated testing and validation tools

Summary & Integration


Integration Overview:
All these components work together to form a complete lexical
analysis system. Finite automata provide the theoretical foundation,
input processing manages the source stream, token recognition
identifies meaningful units, LEX automates the implementation, and
error handling ensures robustness.
🔄 Complete Lexical Analysis System
Phase 1: Source Code → Input Processing → Character Stream
Phase 2: Character Stream → Finite Automata → Pattern Matching
Phase 3: Pattern Matching → Token Recognition → Classified Tokens
Phase 4: Error Detection → Error Handling → Recovery/Reporting
Output: Token Stream → Parser → Syntax Analysis
🎯 Key Takeaways:
 Lexical analysis is the first and fundamental phase of compilation
 Finite automata provide efficient pattern recognition capabilities

 Proper error Syntax-directed translation


 Syntax Trees: Tree structures representing the grammatical structure of source code where nodes are
operators/constructs and leaves are operands/variables rob

A syntax tree (or parse tree) is an abstract representation of the source code's structure,
showing how grammar rules are applied.

 Internal nodes represent operators or non-terminals.


 Leaf nodes represent operands, tokens, or terminals.
 Construction of syntax trees: b
Syntax Trees: Tree data structures that represent the hierarchical grammatical structure of
source code, where internal nodes represent operations/constructs and leaf nodes represent
operands/identifiers.

Detailed Explanation
What are Syntax Trees?

Syntax trees (also called parse trees or abstract syntax trees) are tree representations of the
syntactic structure of source code according to a formal grammar. They show how the grammar
rules are applied to generate the program.

Components:

 Root: Start symbol of the grammar


 Internal Nodes: Non-terminals (operators, statements, expressions)
 Leaf Nodes: Terminals (identifiers, constants, keywords)
 Edges: Represent derivation relationships

Construction Process Flowchart


Input Expression/Code

Parse according to Grammar Rules

Identify Operators and Operands

Create Internal Nodes (Operators)

Create Leaf Nodes (Operands)

Connect based on Precedence/Associativity

Complete Syntax Tree

Example Construction
Input: a + b * c

Grammar Rules:

E→E+T|T
T→T*F|F
F → id | (E)

Step-by-step Construction:

1. Start with expression a + b * c


2. Apply precedence (* higher than +)
3. Build tree bottom-up:

+
/\
a *
/\
b c

Types of Syntax Trees


1. Concrete Syntax Tree (Parse Tree)

 Shows exact grammar derivation


 Includes all grammar symbols
 More detailed but larger

2. Abstract Syntax Tree (AST)

 Simplified version
 Removes unnecessary details
 More compact and efficient

Construction Methods
Bottom-Up Construction

1. Start with tokens


2. Combine using grammar rules
3. Build upward to root

Top-Down Construction

1. Start with start symbol


2. Expand using productions
3. Build downward to leaves

Advantages (Simple Words)


 Visual: Easy to understand program structure
 Hierarchical: Shows operator precedence clearly
 Evaluable: Can be directly evaluated
 Analyzable: Supports semantic analysis
 Transformable: Easy to modify and optimize

Disadvantages (Simple Words)


 Memory-intensive: Requires significant storage
 Complex: Difficult for large programs
 Overhead: Extra processing time needed
 Space-consuming: Takes more memory than linear forms

Uses (Simple Words)


 Compilers: Code generation and optimization
 Interpreters: Direct execution of programs
 Analyzers: Static analysis tools
 IDEs: Syntax highlighting and error detection
 Refactoring: Code transformation tools
 Debuggers: Program structure visualization

Construction Algorithm (Simplified)


1. Tokenize input source code
2. For each token/expression:
- If operand: Create leaf node
- If operator: Create internal node
3. Connect nodes based on:
- Grammar rules
- Operator precedence
- Associativity rules
4. Build tree recursively
5. Return root node

Practical Example
Code: if (x > 0) y = x + 1;

Syntax Tree:

if-statement
/ | \
> = ;
/\ /\
x 0 y +
/\
x 1

Memory Representation
Each node typically contains:

 Node Type: Operator, identifier, constant


 Value: For leaf nodes
 Children Pointers: Links to child nodes
 Attributes: Type information, line numbers

Syntax trees are fundamental in compiler construction as they provide a structured representation
that bridges the gap between textual source code and executable machine code, enabling various
compiler phases like semantic analysis, optimization, and code generation to work effectively.

[Link] Synthesized Attributes Inherited Attributes

An attribute is said to be Synthesized An attribute is said to be Inherited


attribute if its parse tree node value is attribute if its parse tree node value is
determined by the attribute value at child determined by the attribute value at parent
1. nodes. and/or siblings node.

The production must have non-terminal The production must have non-terminal as
2. as its head. a symbol in its body.

A synthesized attribute at node n is A Inherited attribute at node n is defined


defined only in terms of attribute values only in terms of attribute values of n's
3. at the children of n itself. parent, n itself, and n's siblings.

It can be evaluated during a single top-


It can be evaluated during a single down and sideways traversal of parse
4. bottom-up traversal of parse tree. tree.

5. Synthesized attributes can be contained Inherited attributes can't be contained by


both, It is only contained by non-
by both the terminals or non-terminals. terminals.

Synthesized attribute is used by both S- Inherited attribute is used by only L-


6. attributed SDT and L-attributed SDT. attributed SDT.

7.

Top-down translation: Top-Down Translation is a method in compiler design where the syntax
tree and corresponding intermediate code are generated during a top-down parse (like a
recursive descent parser).

Key Characteristics:

 Direction: Root to leaves expansion


 Prediction: Uses lookahead to decide productions
 Simultaneous: Parsing and translation happen together
 Recursive: Often implemented using recursive procedures

Advantages (Simple Words)


 Intuitive: Natural programming approach
 Predictive: No backtracking needed
 Controllable: Easy to add semantic actions
 Flexible: Can handle various translation tasks
 Debuggable: Easy to trace execution
 Efficient: Linear time complexity

Disadvantages (Simple Words)


 Left-recursive: Cannot handle left recursion directly
 Backtracking: Some grammars need backtracking
 Limited: Not all grammars are suitable
 Predictive-only: Requires LL(k) grammars
 Lookahead: May need multiple symbol lookahead
 Grammar-dependent: Grammar must be specially designed
Algorithm Steps
1. Initialize: Set up parsing environment
2. Predict: Choose production based on lookahead
3. Expand: Replace non-terminal with production body
4. Translate: Execute semantic actions
5. Recurse: Process child nodes
6. Match: Verify terminal symbols
7. Return: Pass results upward

S->AS{PRINTF(1)} INPUT:aadbc

S->AB{PRINTF(2)}

A->a{PRINTF(3)}

B->bC{PRINTF(4)}

B->dB{PRINTF(5)}

C->c{PRINTF(6)}

Top down approach

/ \ \

A S 1

/ \ / \ \

a 3A B 2

/ \ \

d B 5

/ \ \

b C 4

/ \
c 6

Output:3364521

Bottom up approach

/ \

A S

/ / \

a A B

/ \

d B

/ \

b C

Intermediate Code Forms Using Postfix Notation

Simple Definition
Postfix Notation: A way of writing expressions where operators come after their operands (e.g.,
a b + instead of a + b). It's used as an intermediate code form because it's easier for machines
to evaluate and doesn't need parentheses.

What is Postfix Notation?


Regular vs Postfix

 Infix: a + b * c (operator between operands)


 Postfix: a b c * + (operator after operands)
Why Use Postfix in Compilers?

 No parentheses needed
 Easy to evaluate using a stack
 Unambiguous operator precedence
 Efficient for code generation

Conversion Process Flowchart


Infix Expression

Parse with Operator Precedence

Use Stack for Operators

Output Operands Immediately

Pop Operators by Precedence

Postfix Expression Generated

Algorithm for Infix to Postfix Conversion


Steps:

1. Scan infix expression left to right


2. Operands: Output directly
3. Operators: Push to stack based on precedence
4. Parentheses: Handle special cases
5. End: Pop remaining operators

Precedence Rules:
Highest: ( )
High: */%
Medium: + -
Low: = (assignment)

Examples
Example 1: Simple Expression

Input: a + b * c

Step-by-step:

Symbol | Stack | Output


-------|-------|--------
a |[] |a
+ | [+] | a
b | [+] | a b
* | [+*] | a b
c | [+*] | a b c
End |[] |abc*+

Result: a b c * +

Example 2: With Parentheses

Input: (a + b) * c

Step-by-step:

Symbol | Stack | Output


-------|-------|--------
( | [(] |
a | [(] | a
+ | [(+] | a
b | [(+] | a b
) |[] |ab+
* | [*] | a b +
c | [*] | a b + c
End |[] |ab+c*

Result: a b + c *

Evaluation of Postfix Expression


Algorithm:

1. Scan postfix expression left to right


2. Operand: Push to stack
3. Operator: Pop two operands, compute, push result
4. End: Stack contains final result

Example Evaluation:

Expression: a b c * + where a=2, b=3, c=4

Symbol | Stack Operation | Stack


-------|----------------|-------
a | push 2 | [2]
b | push 3 | [2,3]
c | push 4 | [2,3,4]
* | pop 4,3 → 3*4=12, push 12 | [2,12]
+ | pop 12,2 → 2+12=14, push 14 | [14]

Result: 14

Intermediate Code Generation


Three-Address Code from Postfix

Postfix: a b c * +

Three-Address Code:

t1 = b * c
t2 = a + t1

Assembly Code Generation

Postfix: a b +

Assembly:

LOAD a
LOAD b
ADD
STORE result

Advantages (Simple Words)


 Stack-friendly: Easy evaluation using stack
 Unambiguous: No precedence confusion
 Parentheses-free: No brackets needed
 Efficient: Fast evaluation
 Machine-oriented: Good for code generation

Disadvantages (Simple Words)


 Human-unfriendly: Hard to read
 Debugging: Difficult to trace
 Complex-expressions: Long for nested expressions
 Learning-curve: Takes time to understand

Uses (Simple Words)


 Compilers: Intermediate code representation
 Calculators: HP calculators use RPN
 Interpreters: Easy expression evaluation
 Code-generators: Assembly code production
 Optimization: Easy to optimize expressions

DAG (Directed Acyclic Graph)

Simple Definition
DAG: A tree-like structure where nodes can have multiple parents, used in compilers to
represent expressions efficiently by sharing common subexpressions. Unlike trees, DAGs
eliminate duplicate computations by reusing nodes.

What is DAG?
Basic Concept:

 Directed: Edges have direction (parent → child)


 Acyclic: No cycles or loops
 Graph: Nodes connected by edges
 Shared Nodes: Common subexpressions share same node

Difference from Syntax Tree:

 Syntax Tree: Each subexpression has separate node


 DAG: Common subexpressions share single node

DAG Construction Flowchart


Expression Input

Parse into Subexpressions

Check for Existing Nodes

Reuse if Found, Create if New

Connect with Directed Edges

Optimized DAG Structure

Example Construction
Input Expression: a + a * (b - c) + (b - c) * d

Step-by-step DAG Construction:

1. Parse subexpressions:
o a
o b - c (appears twice)
o a * (b - c)
o (b - c) * d
Final addition
o
2. DAG Structure:

+
/|\
+*d
/| |/
a * (b-c) ← Shared node
/| /|\
a (b-c) b - c
/|\
b-c

Benefit: (b - c) computed only once, not twice!

Construction Algorithm
Steps:

1. Initialize: Create empty DAG


2. For each operand: Check if node exists
o If exists: Reuse node
o If new: Create node
3. For each operator: Create operator node
4. Connect: Link operator to operand nodes
5. Optimize: Merge identical subexpressions

Algorithm Pseudocode:
function buildDAG(expression) {
nodeTable = empty hashtable

for each subexpression in expression {


key = computeKey(subexpression)

if (nodeTable contains key) {


node = nodeTable[key] // Reuse
} else {
node = createNode(subexpression)
nodeTable[key] = node // Store
}

connectToParents(node)
}

return rootNode
}

Types of Nodes in DAG


1. Leaf Nodes

 Variables: a, b, c
 Constants: 123, 3.14, "hello"

2. Internal Nodes

 Operators: +, -, *, /, %
 Functions: sin, cos, sqrt
 Assignments: =

3. Root Node

 Final result of expression

Advantages (Simple Words)


 Space-efficient: Less memory usage
 Optimization-friendly: Easy to spot common subexpressions
 Code-quality: Better generated code
 Elimination: Removes redundant computations
 Sharing: Reuses computed values
 Compact: Smaller representation

Disadvantages (Simple Words)


 Complex: Harder to construct than trees
 Maintenance: Difficult to modify
 Debugging: Complex to trace
 Implementation: More coding required
 Memory-management: Pointer complexity
 Analysis: Harder static analysis

Uses (Simple Words)


 Compilers: Expression optimization
 Interpreters: Efficient evaluation
 Optimizers: Common subexpression elimination
 Calculators: Avoiding recalculation
 Code-generators: Better assembly code
 Analyzers: Program analysis tools

Practical Example
Expression: x = a + b; y = a + b + c;

Without DAG (Syntax Trees):

Tree 1: = Tree 2: =
/\ /\
x + y +
/\ /\
a b + c
/\
a b

Computation: a + b calculated twice

With DAG:

= =
/\ /\
x \ y +
\ /\
+ ←----+ c
/\
a b

Computation: a + b calculated once, reused

Three Address Code

Simple Definition
Three Address Code: An intermediate code form where each instruction has at most three
addresses (two operands and one result). It's like simplified assembly language that's easier for
compilers to work with than complex expressions.

What is Three Address Code?


Basic Format:
result = operand1 operator operand2

Examples:

 t1 = a + b (addition)
 t2 = x * y (multiplication)
 z = t1 - t2 (using temporary variables)

Key Features:

 At most 3 addresses per instruction


 One operator per instruction
 Temporary variables (t1, t2, t3...) for intermediate results
 Linear sequence of instructions
Three Address Code Generation Flowchart
High-Level Expression

Parse Expression Tree/DAG

Break Complex Operations

Generate Temporary Variables

Create Linear Instructions

Three Address Code Sequence

Types of Three Address Instructions


1. Assignment Statements
x = y op z (binary operation)
x = op y (unary operation)
x=y (copy)

2. Control Flow
goto L (unconditional jump)
if x goto L (conditional jump)
if x op y goto L (conditional with operation)

3. Procedure Calls
call p, n (call procedure p with n parameters)
return y (return with value)
param x (parameter passing)

4. Array Operations
x = y[i] (array access)
x[i] = y (array assignment)

5. Pointer Operations
x = &y (address of)
x = *y (dereference)
*x = y (indirect assignment)
Example Conversions
Example 1: Simple Expression

Input: a = b + c * d

Three Address Code:

t1 = c * d
t2 = b + t1
a = t2

Example 2: Complex Expression

Input: a = (b + c) * (d - e)

Three Address Code:

t1 = b + c
t2 = d - e
t3 = t1 * t2
a = t3

TAC for Control Structures, Triples, Quadruples, and


Boolean Expressions

Simple Definitions
TAC for Control Structures: Three Address Code representation of if-else, loops, and switch
statements using labels and conditional jumps.

Triples: TAC representation using 3 fields - operator and two operand positions (no explicit
result field).

Quadruples: TAC representation using 4 fields - operator, two operands, and result.

Boolean Expressions: Logical expressions (AND, OR, NOT) converted to TAC with short-
circuit evaluation.

1. TAC for Various Control Structures


IF-ELSE Statement

Input Code:
c
if (a > b)
x = y + z;
else
x = y - z;

TAC:
if a <= b goto L1
t1 = y + z
x = t1
goto L2
L1: t2 = y - z
x = t2
L2: // continue

Flowchart:
Condition Check → True Branch → False Branch → Merge Point

WHILE Loop

Input Code:
c
while (i < n) {
sum = sum + i;
i = i + 1;
}

TAC:
L1: if i >= n goto L2
t1 = sum + i
sum = t1
t2 = i + 1
i = t2
goto L1
L2: // exit loop
FOR Loop

Input Code:
c
for (i = 0; i < n; i++) {
a[i] = b[i] + c[i];
}

TAC:
i=0
L1: if i >= n goto L3
t1 = 4 * i
t2 = b[t1]
t3 = c[t1]
t4 = t2 + t3
a[t1] = t4
t5 = i + 1
i = t5
goto L1
L3: // exit loop

SWITCH Statement

Input Code:
c
switch (x) {
case 1: y = a; break;
case 2: y = b; break;
default: y = c;
}

TAC:
if x == 1 goto L1
if x == 2 goto L2
goto L3
L1: y = a
goto L4
L2: y = b
goto L4
L3: y = c
L4: // continue

2. Representing TAC using Triples and Quadruples


Example Expression: a = b + c * d

Quadruples (4 fields: op, arg1, arg2, result)


Index Op Arg1 Arg2 Result

0 * c d t1

1 + b t1 t2

2 = t2 a

Structure:

cpp
struct Quadruple {
string op; // operator
string arg1; // first operand
string arg2; // second operand
string result; // result
};

Triples (3 fields: op, arg1, arg2)


Index Op Arg1 Arg2

0 * c d

1 + b (0)

2 = (1) a

Structure:

cpp
struct Triple {
string op; // operator
string arg1; // first operand
string arg2; // second operand
// Result is implicit (instruction index)
};

Comparison: Triples vs Quadruples

Advantages:

Triples:

 Space-efficient: Uses less memory


 Compact: No explicit result field
 Index-based: References by position

Quadruples:

 Clear: Explicit result field


 Relocatable: Easy to reorder
 Readable: Self-contained instructions

Disadvantages:

Triples:

 Reordering-difficult: Hard to optimize


 Index-dependent: Position matters
 Complex: Indirect references

Quadruples:

 Space-intensive: More memory needed


 Redundant: Extra result field
 Temporaries: Many temp variables

3. Boolean Expressions
Simple Boolean Expression

Input: a && b || c

TAC with Short-Circuit:


if a == false goto L1
if b == true goto L2
L1: if c == true goto L2
t1 = false
goto L3
L2: t1 = true
L3: // result in t1

Complex Boolean Expression

Input: (a > b) && (c < d) || (e == f)

TAC:
if a <= b goto L1
if c >= d goto L1
goto L2
L1: if e != f goto L3
L2: t1 = true
goto L4
L3: t1 = false
L4: // result in t1

Boolean Expression Flowchart:


First Condition → Short-Circuit Check → Second Condition → Final Result

4. Control Structures with Boolean Expressions


Nested IF-ELSE

Input Code:
c
if (a > b && c < d) {
if (x == y)
z = 1;
else
z = 2;
} else {
z = 3;
}

TAC:
if a <= b goto L3
if c >= d goto L3
if x != y goto L2
z=1
goto L4
L2: z = 2
goto L4
L3: z = 3
L4: // continue

WHILE with Complex Condition

Input Code:
c
while (i < n && array[i] != 0) {
sum += array[i];
i++;
}

TAC:
L1: if i >= n goto L3
t1 = 4 * i
t2 = array[t1]
if t2 == 0 goto L3
t3 = sum + t2
sum = t3
t4 = i + 1
i = t4
goto L1
L3: // exit loop

Implementation Examples
Quadruple Generation:
cpp
class QuadrupleGenerator {
vector<Quadruple> quads;
int tempCounter = 1;

string newTemp() {
return "t" + to_string(tempCounter++);
}
void emit(string op, string arg1, string arg2, string result) {
quads.push_back({op, arg1, arg2, result});
}

string genExpression(Node* node) {


if (node->isLeaf()) {
return node->name;
}

string left = genExpression(node->left);


string right = genExpression(node->right);
string temp = newTemp();

emit(node->op, left, right, temp);


return temp;
}
};

Triple Generation:
cpp
class TripleGenerator {
vector<Triple> triples;

int emit(string op, string arg1, string arg2) {


triples.push_back({op, arg1, arg2});
return [Link]() - 1; // Return index
}

string genExpression(Node* node) {


if (node->isLeaf()) {
return node->name;
}

string left = genExpression(node->left);


string right = genExpression(node->right);
int index = emit(node->op, left, right);
return "(" + to_string(index) + ")";
}
};

Memory Representations
Quadruple Table:
┌───┬────┬──────┬──────┬────────┐
│ # │ Op │ Arg1 │ Arg2 │ Result │
├───┼────┼──────┼──────┼────────┤
│0│+ │a │b │ t1 │
│ 1 │ * │ t1 │ c │ t2 │
│ 2 │ = │ t2 │ │x │
└───┴────┴──────┴──────┴────────┘

Triple Table:
┌───┬────┬──────┬──────┐
│ # │ Op │ Arg1 │ Arg2 │
├───┼────┼──────┼──────┤
│0│+ │a │b │
│ 1 │ * │ (0) │ c │
│ 2 │ = │ (1) │ x │
└───┴────┴──────┴──────┘

Uses in Compiler Design


Advantages (Simple Words):

 Optimization-friendly: Easy to apply optimizations


 Machine-independent: Portable representation
 Analysis-suitable: Good for data flow analysis
 Debugging: Step-by-step execution tracking

Disadvantages (Simple Words):

 Verbose: Many instructions generated


 Temporaries: Lots of temporary variables
 Complexity: Simple code becomes multi-step

Uses (Simple Words):

 Compilers: Intermediate code generation


 Optimizers: Code optimization phase
 Interpreters: Direct execution engine
 Analyzers: Program analysis tools

This comprehensive approach to TAC representation provides the foundation for efficient
compilation, optimization, and code generation in modern compiler systems.

 develop Runtime environments


1. Storage Allocation
Definition:

Management of memory space for program variables, constants, and temporary data during
execution.

Storage Areas:

Code Area (Text Segment)

 Content: Machine instructions


 Properties: Read-only, shared
 Size: Fixed at compile time

Static/Global Area

 Content: Global variables, constants


 Properties: Fixed location, entire program lifetime
 Size: Known at compile time

Stack Area

 Content: Local variables, parameters, return addresses


 Properties: LIFO, automatic management
 Size: Dynamic, grows/shrinks with calls
Heap Area

 Content: Dynamic memory allocation


 Properties: Manual management, fragmentation
 Size: Dynamic, programmer controlled

Storage Allocation Flowchart:


Program Start → Initialize Static Area → Stack Frame Creation →
Local Variable Allocation → Execution → Stack Frame Destruction → Program End

Advantages (Simple Words):

 Organized: Clear memory layout


 Efficient: Optimized access patterns
 Automatic: Stack management handled automatically
 Predictable: Known memory requirements

Disadvantages (Simple Words):

 Limited: Stack size constraints


 Fragmentation: Heap memory fragmentation
 Overhead: Management overhead
 Complexity: Multiple allocation strategies

Uses (Simple Words):

 Compilers: Memory layout planning


 Runtime-systems: Memory management
 Operating-systems: Process memory management
 Garbage-collectors: Automatic memory cleanup

2. Storage Allocation Strategies


Static Allocation

Characteristics:

 Memory allocated at compile time


 Fixed locations throughout execution
 No runtime allocation overhead

Example:
c
int global_var = 10; // Static allocation
static int count = 0; // Static allocation

Advantages:

 Fast: No allocation overhead


 Simple: Direct addressing
 Predictable: Known memory layout

Disadvantages:

 Inflexible: Cannot change size


 Wasteful: Memory always allocated
 Limited: No dynamic structures

Stack Allocation

Characteristics:

 Memory allocated on procedure calls


 LIFO (Last In, First Out) order
 Automatic deallocation on return

Example:
c
void function() {
int local_var; // Stack allocation
char buffer[100]; // Stack allocation
}

Stack Growth:
High Memory
┌─────────────┐
│ Free │
├─────────────┤ ← Stack Pointer
│ Function C │
├─────────────┤
│ Function B │
├─────────────┤
│ Function A │
├─────────────┤
│ Main │
└─────────────┘
Low Memory
Dynamic/Heap Allocation

Characteristics:

 Memory allocated at runtime


 Programmer controlled
 Explicit allocation/deallocation

Example:
c
int* ptr = malloc(sizeof(int) * 100); // Heap allocation
free(ptr); // Explicit deallocation

3. Heap Management
Definition:

Management of dynamically allocated memory including allocation, deallocation, and garbage


collection.

Heap Management Flowchart:


Allocation Request → Find Free Block → Split if Necessary →
Return Pointer → Program Use → Deallocation → Merge Adjacent Blocks

Allocation Strategies:

First Fit

 Method: Use first block that fits


 Advantage: Fast allocation
 Disadvantage: External fragmentation

Best Fit

 Method: Use smallest block that fits


 Advantage: Minimizes waste
 Disadvantage: Slow search, fragmentation

Worst Fit

 Method: Use largest available block


 Advantage: Large remaining blocks
 Disadvantage: Wastes large blocks
Fragmentation Types:

Internal Fragmentation

 Cause: Allocated block larger than needed


 Example: Request 100 bytes, get 128 bytes
 Solution: Better size classes

External Fragmentation

 Cause: Free blocks too small


 Example: Many small scattered free blocks
 Solution: Compaction, coalescing

Garbage Collection:

Mark and Sweep


Phase 1 (Mark): Traverse reachable objects from roots
Phase 2 (Sweep): Deallocate unmarked objects

Copying Collection
From-Space → Scan Live Objects → Copy to To-Space → Swap Spaces

Advantages (Simple Words):

 Flexible: Dynamic memory allocation


 Efficient: Memory reuse
 Automatic: Garbage collection available
 Scalable: Handles varying memory needs

Disadvantages (Simple Words):

 Fragmentation: Memory becomes fragmented


 Overhead: Management overhead
 Unpredictable: Allocation time varies
 Leaks: Memory leak possibilities

Uses (Simple Words):

 Dynamic-structures: Lists, trees, graphs


 Object-oriented: Object allocation
 Interpreters: Runtime object creation
 Databases: Buffer management
4. Activation Records
Definition:

Stack frame containing all information needed for a procedure call including parameters, local
variables, and control information.

Activation Record Structure:


┌─────────────────┐ ← Frame Pointer (FP)
│ Return Address │
├─────────────────┤
│ Saved Registers │
├─────────────────┤
│ Parameters │
├─────────────────┤
│ Local Variables │
├─────────────────┤
│ Temporaries │
└─────────────────┘ ← Stack Pointer (SP)

Components:

Return Address

 Purpose: Where to return after procedure


 Size: Typically one word
 Usage: Restored on procedure exit

Parameters

 Purpose: Arguments passed to procedure


 Order: Usually reverse order
 Access: Via frame pointer offset

Local Variables

 Purpose: Procedure's local data


 Allocation: Space reserved at call
 Access: Via frame pointer offset

Temporaries

 Purpose: Intermediate computations


 Allocation: Dynamic during execution
 Management: Compiler responsibility

Activation Record Creation Flowchart:


Procedure Call → Push Parameters → Save Return Address →
Allocate Locals → Execute Body → Restore State → Return

Example:
c
int factorial(int n) { // Parameter: n
int result; // Local variable: result
if (n <= 1) // Temporary for comparison
return 1;
else
return n * factorial(n-1);
}

Activation Record:

┌─────────────────┐
│ Return Address │ (where to return)
├─────────────────┤
│ Parameter: n │ (input value)
├─────────────────┤
│ Local: result │ (local variable)
├─────────────────┤
│ Temporaries │ (n-1, comparison result)
└─────────────────┘

5. Accessing Local and Non-Local Names


Block Structured Languages:

Languages where procedures can be nested, creating scope hierarchies.

Example:
pascal
program Main;
var x: integer; // Level 0 (global)
procedure A;
var y: integer; // Level 1

procedure B;
var z: integer; // Level 2
begin
z := x + y; // Access: local z, non-local y, global x
end;

begin
B;
end;

begin
A;
end.

Access Methods:

Static Links

 Method: Each frame points to enclosing scope


 Access: Follow chain to find variable
 Cost: O(nesting depth)

┌─────────────────┐
│ Procedure B │
│ Static Link ────┼──┐
├─────────────────┤ │
│ Procedure A │ ←┘
│ Static Link ────┼──┐
├─────────────────┤ │
│ Main Program │ ←┘
└─────────────────┘

Display

 Method: Array of pointers to each nesting level


 Access: Direct access via display[level]
 Cost: O(1) access time
Display Array:
┌───┬─────────────┐
│ 0 │ Main Frame │
├───┼─────────────┤
│ 1 │ Proc A Frame│
├───┼─────────────┤
│ 2 │ Proc B Frame│
└───┴─────────────┘

Access Flowchart:
Variable Reference → Check Local Scope → Check Static Link/Display →
Find Variable → Calculate Address → Access Value

6. Parameter Passing
Definition:

Methods of transferring arguments from caller to called procedure.

Parameter Passing Methods:

Call by Value

 Method: Copy argument value


 Modification: Changes don't affect original
 Cost: Copying overhead for large structures

c
void func(int x) {
x = 10; // Original unchanged
}

int main() {
int a = 5;
func(a); // a still equals 5
}

Call by Reference

 Method: Pass address of argument


 Modification: Changes affect original
 Cost: Indirection overhead

c
void func(int* x) {
*x = 10; // Original changed
}

int main() {
int a = 5;
func(&a); // a now equals 10
}

Call by Name

 Method: Substitute argument expression


 Evaluation: Each access re-evaluates
 Cost: High evaluation overhead

Call by Value-Result

 Method: Copy in, copy out


 Modification: Changes copied back
 Cost: Two copying operations

Parameter Passing Flowchart:


Procedure Call → Evaluate Arguments → Apply Passing Method →
Set Up Parameters → Execute Procedure → Handle Results

Advantages (Simple Words):

Call by Value: Safe, predictable, simple Call by Reference: Efficient, allows modification Call
by Name: Flexible, lazy evaluation

Disadvantages (Simple Words):

Call by Value: Copying overhead, no modification Call by Reference: Aliasing issues, safety
concerns Call by Name: Expensive, complex semantics

Uses (Simple Words):

Call by Value: Default in C, Java Call by Reference: C pointers, Pascal var parameters Call by
Name: Algol 60, macro systems
7. Symbol Table Organization
Definition:

Data structure storing information about identifiers including their names, types, scope, and
memory locations.

Symbol Table Entry Structure:


cpp
struct SymbolEntry {
string name; // Identifier name
Type type; // Data type
int scope; // Scope level
int offset; // Memory offset
bool isFunction; // Function flag
int paramCount; // Parameter count (if function)
};

Organization Methods:

Linear List

 Structure: Sequential array/list


 Search: Linear search O(n)
 Insert: O(1) at end
 Suitable: Small programs

Hash Table

 Structure: Hash function + buckets


 Search: Average O(1)
 Insert: Average O(1)
 Suitable: Large programs

Binary Search Tree

 Structure: Ordered tree


 Search: O(log n) average
 Insert: O(log n) average
 Suitable: Medium programs

Scope Stack

 Structure: Stack of symbol tables


 Search: Current scope first
 Insert: Current scope only
 Suitable: Block-structured languages

Symbol Table Flowchart:


Identifier Declaration → Check Current Scope → Insert if New →
Identifier Reference → Search Scopes → Return Information/Error

8. Data Structures Used in Symbol Tables


Hash Table Implementation:
cpp
class SymbolTable {
static const int TABLE_SIZE = 211;
vector<SymbolEntry> table[TABLE_SIZE];

int hash(string name) {


int sum = 0;
for (char c : name) {
sum += c;
}
return sum % TABLE_SIZE;
}

public:
void insert(SymbolEntry entry) {
int index = hash([Link]);
table[index].push_back(entry);
}

SymbolEntry* lookup(string name) {


int index = hash(name);
for (auto& entry : table[index]) {
if ([Link] == name) {
return &entry;
}
}
return nullptr;
}
};

Scope Management:
cpp
class ScopeManager {
stack<SymbolTable*> scopeStack;

public:
void enterScope() {
[Link](new SymbolTable());
}

void exitScope() {
delete [Link]();
[Link]();
}

SymbolEntry* lookup(string name) {


// Search from current scope outward
stack<SymbolTable*> temp = scopeStack;
while (![Link]()) {
SymbolEntry* entry = [Link]()->lookup(name);
if (entry) return entry;
[Link]();
}
return nullptr;
}
};

Memory Layout Example:


Symbol Table Memory:
┌─────────────────┐
│ Hash Table │
│ ┌─────┬───────┐ │
│ │ 0 │ var a │ │
│ ├─────┼───────┤ │
│ │ 1 │ func f│ │
│ ├─────┼───────┤ │
│ │ ... │ ... │ │
│ └─────┴───────┘ │
└─────────────────┘

Scope Stack:
┌─────────────────┐ ← Current Scope
│ Local Symbols │
├─────────────────┤
│ Function Scope │
├─────────────────┤
│ Global Scope │
└─────────────────┘

Advantages (Simple Words):

 Fast-access: Quick identifier lookup


 Organized: Structured information storage
 Scalable: Handles large programs
 Flexible: Supports various language features

Disadvantages (Simple Words):

 Memory-overhead: Extra storage needed


 Complexity: Multiple data structures
 Maintenance: Requires careful management
 Conflicts: Name collision handling

Uses (Simple Words):

 Compilers: Identifier resolution


 IDEs: Code completion, error checking
 Debuggers: Variable inspection
 Interpreters: Runtime symbol resolution

Integration Example
Complete Runtime Environment:
┌─────────────────────────────────────┐
│ Code Segment │
├─────────────────────────────────────┤
│ Static Data │
├─────────────────────────────────────┤
│ │
│ Stack │ ← Growing Down
│ ┌─────────────────────┐ │
│ │ Activation Record │ │
│ │ ┌───────────────┐ │ │
│ │ │ Return Addr │ │ │
│ │ ├───────────────┤ │ │
│ │ │ Parameters │ │ │
│ │ ├───────────────┤ │ │
│ │ │ Local Vars │ │ │
│ │ └───────────────┘ │ │
│ └─────────────────────┘ │
│ │
├─────────────────────────────────────┤
│ │
│ Heap │ ← Growing Up
│ │
└─────────────────────────────────────┘

This comprehensive runtime environment framework provides the foundation for efficient
program execution, memory management, and symbol resolution in compiled and interpreted
languages.

 Definition of basic block control flow


graphs
Definition of Basic Block Control Flow Graphs
Basic Block: A sequence of consecutive statements with one entry point and one exit point. No
jumps in or out of the middle.

Control Flow Graph (CFG): A directed graph where nodes are basic blocks and edges
represent possible flow of control between blocks.
DAG Representation of Basic Block
DAG (Directed Acyclic Graph): A tree-like structure representing computations within a basic
block.

Simple Definition: Shows how variables and expressions are computed, avoiding redundant
calculations.

Example:

a=b+c
d=b+c
e=d*2

DAG shows b + c computed once, used for both a and d.

Flowchart Structure: Input → Parse Statements → Build DAG Nodes → Connect


Dependencies → Output Optimized Code

Advantages of DAG
1. Eliminates redundant computations
2. Identifies common subexpressions
3. Reduces code size
4. Improves execution speed
5. Simplifies further optimizations

Disadvantages:

1. Memory overhead for DAG construction


2. Complex for large basic blocks
3. Time consuming to build

Sources of Optimization
1. Source Level: High-level language optimizations
2. Intermediate Code: Three-address code optimizations
3. Machine Level: Assembly/machine code optimizations
4. Target Specific: Architecture-dependent optimizations

Use: Improve program performance at different compilation stages.

Loop Optimization
Definition: Techniques to improve loop performance since loops execute repeatedly.

Types:

1. Code Motion: Move invariant code outside loop


2. Induction Variables: Optimize loop counters
3. Loop Unrolling: Reduce loop overhead
4. Loop Fusion: Combine similar loops

Flowchart: Identify Loop → Analyze Dependencies → Apply Optimization → Generate


Optimized Loop

Advantages: Faster execution, reduced iterations Disadvantages: Increased code size, complex
analysis

Loop Invariant Computation


Definition: Expressions whose values don't change during loop execution.

Example:

for(i=0; i<n; i++)


a[i] = x + y * z; // x + y * z is loop invariant

Optimization: Move temp = x + y * z outside loop.

Advantages: Reduces redundant calculations Disadvantages: Requires careful dependency


analysis

Peephole Optimization
Definition: Local optimization technique examining small "peephole" of instructions.
Common Optimizations:

1. Eliminate redundant loads/stores


2. Combine operations
3. Remove unnecessary jumps
4. Simplify algebraic expressions

Example:

Before: MOV R1, R2


MOV R2, R1
After: MOV R1, R2 (second instruction removed)

Advantages: Simple, effective for local improvements Disadvantages: Limited scope, misses
global optimizations

Issues in Code Generator Design


1. Instruction Selection: Choose appropriate machine instructions
2. Register Allocation: Assign variables to registers efficiently
3. Instruction Ordering: Sequence instructions optimally
4. Memory Management: Handle stack and heap efficiently

Challenges: Limited registers, complex instruction sets, target dependencies

A Simple Code Generator


Definition: Converts intermediate code to target machine code.

Steps:

1. Parse intermediate code


2. Select machine instructions
3. Allocate registers
4. Generate final code

Flowchart: Intermediate Code → Instruction Selection → Register Allocation → Code


Emission → Machine Code

Example: Three-address code a = b + c → ADD R1, R2, R3


Code Generation from DAG
Process: Convert DAG representation directly to efficient machine code.

Steps:

1. Traverse DAG in topological order


2. Generate code for each node
3. Reuse computed values
4. Optimize register usage

Advantages: Eliminates redundancy, efficient code Disadvantages: Complex traversal


algorithms

Machine Independent Optimization


Definition: Optimizations not specific to target architecture.

Global Data Flow Analysis


Definition: Analysis of how data flows through entire program across basic blocks.

Purpose: Understand variable usage patterns for optimization.

Flowchart: Program → Build CFG → Compute Data Flow → Apply Optimizations →


Optimized Program

Uses: Enables advanced optimizations like dead code elimination

Constant Propagation
Definition: Replace variables with constant values when possible.

Example:

x = 5;
y = x + 3; // becomes y = 8;

Advantages: Reduces computations, enables further optimizations Disadvantages: Requires


complex analysis for conditionals

Liveness Analysis
Definition: Determines which variables are "live" (will be used later) at each program point.

Live Variable: A variable whose value may be used before redefinition.

Uses:

 Register allocation
 Dead code elimination

Flowchart: CFG → Backward Analysis → Compute Live Sets → Optimization Decisions

Advantages: Efficient register use, removes dead code Disadvantages: Complex for large
programs

Common Subexpression Elimination


Definition: Identify and eliminate redundant computations of same expression.

Example:

a = b + c;
d = b + c; // eliminated, use value of a

Types:

1. Local: Within basic block


2. Global: Across basic blocks

Advantages: Reduces computations, faster execution Disadvantages: Increased register


pressure, complex analysis

Flowchart: Code → Identify Common Expressions → Build Expression Table → Replace


Redundant Computations → Optimized Code
All these optimizations work together to produce efficient, fast-executing machine code while
maintaining program correctness.

 Token recognition bridges the gap between text and structured


analysis

Review of CFG, Ambiguity, and Parsing Techniques

Simple Definitions
CFG (Context Free Grammar): A formal grammar with production rules where left side has
single non-terminal and right side has any combination of terminals and non-terminals.

Ambiguity: A grammar that can generate the same string with multiple different parse trees.

Parsing: Process of analyzing input string to determine its grammatical structure according to
grammar rules.

Top-Down Parsing: Building parse tree from root (start symbol) to leaves (terminals).

Bottom-Up Parsing: Building parse tree from leaves (terminals) to root (start symbol).

1. Context Free Grammar (CFG)


Definition:

Grammar consisting of terminals, non-terminals, production rules, and start symbol.

CFG Components:

 Terminals (T): Basic symbols (keywords, operators, identifiers)


 Non-terminals (N): Syntactic categories (Expr, Stmt, Term)
 Productions (P): Rules of form A → α
 Start Symbol (S): Root non-terminal

CFG Example:
E→E+T|T
T→T*F|F
F → (E) | id
CFG Flowchart:
Start Symbol → Apply Production → Expand Non-terminals →
Continue Until Terminals → Parse Tree Complete

Advantages (Simple Words):

 Expressive: Handles most programming constructs


 Recursive: Supports nested structures
 Parseable: Efficient parsing algorithms exist
 Standard: Widely used in compilers

Disadvantages (Simple Words):

 Limited: Cannot handle context-sensitive rules


 Ambiguous: May have multiple interpretations
 Complex: Can become unwieldy for large languages

Uses (Simple Words):

 Compilers: Language syntax definition


 Parsers: Syntax analysis tools
 Validators: Input validation systems
 Generators: Parser generator tools

2. Ambiguity of Grammars
Definition:

A grammar is ambiguous if there exists at least one string that can be derived using two or more
different parse trees.

Ambiguous Grammar Example:


E → E + E | E * E | (E) | id

String: id + id * id

Parse Tree 1 (+ first):

E
/|\
E+E
| /|\
id E * E
| |
id id

Parse Tree 2 (* first):

E
/|\
E*E
/|\ |
E + E id
| |
id id

Removing Ambiguity:
E→E+T|T (+ is left associative, lower precedence)
T→T*F|F (* is left associative, higher precedence)
F → (E) | id (Factors: parentheses and identifiers)

Ambiguity Detection Flowchart:


Grammar Input → Generate Strings → Check Multiple Parse Trees →
If Multiple Trees → Ambiguous | If Single Tree → Unambiguous

Types of Ambiguity:

Operator Precedence Ambiguity

 Problem: Unclear which operator binds first


 Solution: Define precedence levels

Associativity Ambiguity

 Problem: Unclear grouping for same precedence


 Solution: Define left/right associativity

Dangling Else Problem


stmt → if expr then stmt
| if expr then stmt else stmt
| other

Ambiguous: if E1 then if E2 then S1 else S2


Advantages of Removing Ambiguity:

 Deterministic: Single interpretation


 Predictable: Consistent parsing
 Implementable: Can build parsers

Disadvantages of Ambiguous Grammar:

 Confusion: Multiple meanings


 Unparseable: Cannot build deterministic parser
 Errors: Unexpected behavior

3. Introduction to Parsing
Definition:

Process of analyzing input token sequence to determine syntactic structure according to grammar
rules.

Parsing Process Flowchart:


Token Stream → Parser → Parse Tree/AST → Semantic Analysis

Types of Parsers:

Syntax-Directed

 Build explicit parse tree


 Good for educational purposes
 Memory intensive

Table-Driven

 Use parsing tables


 Efficient execution
 Complex construction

Recursive Descent

 Use recursive procedures


 Easy to understand
 Manual construction

Parser Classification:
Parsers
├── Top-Down
│ ├── Recursive Descent
│ ├── LL(k)
│ └── Predictive
└── Bottom-Up
├── LR(k)
├── SLR
├── LALR
└── Operator Precedence

4. Top-Down Parsing
Definition:

Parsing technique that starts from start symbol and tries to derive the input string by applying
production rules.

Top-Down Parsing Flowchart:


Start Symbol → Choose Production → Expand Non-terminal →
Match Terminals → Backtrack if Mismatch → Success/Failure

Problems with Top-Down Parsing:

Left Recursion

Problem: A → Aa | b Issue: Infinite recursion

Solution: Convert to right recursion

A → bA'
A' → aA' | ε

Left Factoring

Problem: A → ab | ac Issue: Cannot decide which production

Solution: Factor common prefix

A → aA'
A' → b | c
Advantages (Simple Words):

 Intuitive: Natural top-down approach


 Predictive: Can predict productions
 Error-friendly: Good error recovery

Disadvantages (Simple Words):

 Left-recursion: Cannot handle left recursion


 Backtracking: May need backtracking
 Inefficient: Can be slow

Uses (Simple Words):

 Compilers: Simple language parsers


 Interpreters: Expression evaluators
 Validators: Input format checkers

5. LL Grammars & Parsers


Definition:

LL(k) means Left-to-right scan, Leftmost derivation, k symbols lookahead.

LL(1) Grammar Requirements:

1. No left recursion
2. Left factored
3. Unique productions for each (A, a) pair

FIRST and FOLLOW Sets:

FIRST(α)

Set of terminals that can begin strings derived from α.

Algorithm:

1. If α is terminal, FIRST(α) = {α}


2. If α → ε, add ε to FIRST(α)
3. If α → X₁X₂...Xₙ, add FIRST(X₁) to FIRST(α)
4. If ε ∈ FIRST(X₁), add FIRST(X₂), and so on
FOLLOW(A)

Set of terminals that can appear immediately after non-terminal A.

Algorithm:

1. Add $ to FOLLOW(start symbol)


2. If A → αBβ, add FIRST(β) - {ε} to FOLLOW(B)
3. If A → αB or A → αBβ where ε ∈ FIRST(β),
add FOLLOW(A) to FOLLOW(B)

LL(1) Parsing Table Construction:


For each production A → α:
1. For each terminal a in FIRST(α), add A → α to M[A, a]
2. If ε in FIRST(α), for each terminal b in FOLLOW(A),
add A → α to M[A, b]

LL(1) Parsing Algorithm:


stack = [S, $]
input = w$
while stack ≠ [$]:
if top(stack) = input[0]:
pop stack, advance input
else if top(stack) is non-terminal A:
use table entry M[A, input[0]]
pop A, push production right side
else:
error

6. Error Handling in LL Parsers


Error Recovery Strategies:

Panic Mode

 Method: Skip tokens until synchronizing symbol


 Synchronizing: Statement terminators (;, }, end)
 Advantage: Simple implementation
 Disadvantage: May skip too much
Phrase Level

 Method: Replace/insert/delete tokens locally


 Example: Insert missing semicolon
 Advantage: Precise recovery
 Disadvantage: Complex implementation

Error Productions

 Method: Add productions for common errors


 Example: stmt → if expr then stmt else
 Advantage: Specific error messages
 Disadvantage: Grammar bloat

Global Correction

 Method: Find minimum token changes


 Advantage: Optimal correction
 Disadvantage: Computationally expensive

Error Handling Flowchart:


Parse Error → Identify Error Type → Apply Recovery Strategy →
Resume Parsing → Continue Until Success/Fatal Error

7. Recursive Descent Parsing


Definition:

Top-down parsing technique using recursive procedures for each non-terminal.

Implementation Structure:
cpp
class RecursiveDescentParser {
Token lookahead;

void expression() {
term();
while ([Link] == PLUS) {
match(PLUS);
term();
}
}

void term() {
factor();
while ([Link] == MULT) {
match(MULT);
factor();
}
}

void factor() {
if ([Link] == ID) {
match(ID);
} else if ([Link] == LPAREN) {
match(LPAREN);
expression();
match(RPAREN);
} else {
error();
}
}

void match(TokenType expected) {


if ([Link] == expected) {
lookahead = nextToken();
} else {
error();
}
}
};

Grammar to Code Translation:


Grammar Rule: A → aBc
Code:
void A() {
a();
B();
match('c');
}

Grammar Rule: A → a | b
Code:
void A() {
if (lookahead == 'a') {
match('a');
} else if (lookahead == 'b') {
match('b');
} else {
error();
}
}

Advantages (Simple Words):

 Simple: Easy to write and understand


 Flexible: Can add semantic actions easily
 Debuggable: Easy to trace execution
 Hand-coded: Full control over parsing

Disadvantages (Simple Words):

 Manual: Must write by hand


 Limited: Only works for LL grammars
 Maintenance: Hard to modify grammar

8. Predictive Parsers
Definition:

LL(1) parsers that use parsing table to predict which production to use.

Predictive Parser Structure:


┌─────────────┐
│Input Buffer │ → a + b $ ...
└─────────────┘

┌─────────────┐ ┌─────────────┐
│ Stack │ ←→ │Parsing Table│
│ E │ │ M[A,a] │
│ $ │ └─────────────┘
└─────────────┘

┌─────────────┐
│ Output │ → Production sequence
└─────────────┘

Parsing Table Example:


│ id │ + │ * │ ( │ ) │ $
─────┼─────┼─────┼─────┼─────┼─────┼─────
E │E→TE'│ │ │E→TE'│ │
E' │ │E'→+TE'│ │ │E'→ε │E'→ε
T │T→FT'│ │ │T→FT'│ │
T' │ │T'→ε │T'→*FT'│ │T'→ε │T'→ε
F │F→id │ │ │F→(E)│ │

Advantages (Simple Words):

 Efficient: Linear time parsing


 Deterministic: No backtracking
 Table-driven: Systematic approach
 Automated: Can be generated automatically

Disadvantages (Simple Words):

 Grammar-restricted: Only LL(1) grammars


 Table-size: Large tables for complex grammars
 Construction: Complex table construction

9. Bottom-Up Parsing
Definition:

Parsing technique that starts from input tokens and builds parse tree toward start symbol.

Bottom-Up Process:
Input: id + id * id
Step 1: id + id * id (reduce id to F)
Step 2: F + id * id (reduce F to T)
Step 3: T + id * id (reduce T to E)
Step 4: E + id * id (reduce id to F)
Step 5: E + F * id (reduce F to T)
Step 6: E + T * id (reduce id to F)
Step 7: E + T * F (reduce T*F to T)
Step 8: E + T (reduce E+T to E)
Step 9: E (accept)

Bottom-Up Flowchart:
Input Tokens → Shift to Stack → Look for Handle →
Reduce by Production → Continue Until Start Symbol

Key Concepts:

Handle

 Definition: Substring matching right side of production


 Property: Can be reduced to left side
 Detection: Core of bottom-up parsing

Viable Prefix

 Definition: Prefix of right-sentential form


 Property: No handle extends past right end
 Use: Determines when to shift vs reduce

10. Shift-Reduce Parsing


Definition:

Bottom-up parsing using shift and reduce operations on a stack.

Operations:

Shift

 Move next input symbol onto stack


 Used when no handle is found
Reduce

 Replace handle on stack with left side of production


 Used when handle is identified

Accept

 Input consumed and stack contains start symbol


 Successful parse completion

Error

 No valid action available


 Parse failure

Shift-Reduce Example:
Stack Input Action
$ id+id$ shift
$id +id$ reduce F→id
$F +id$ reduce T→F
$T +id$ reduce E→T
$E +id$ shift
$E+ id$ shift
$E+id $ reduce F→id
$E+F $ reduce T→F
$E+T $ reduce E→E+T
$E $ accept

Conflicts:

Shift-Reduce Conflict

 Problem: Can either shift or reduce


 Example: Dangling else problem
 Resolution: Define precedence/associativity

Reduce-Reduce Conflict

 Problem: Multiple reductions possible


 Cause: Grammar ambiguity
 Resolution: Rewrite grammar
11. LR Parsers
Definition:

LR(k) means Left-to-right scan, Rightmost derivation in reverse, k symbols lookahead.

LR Parser Structure:
┌─────────────┐
│Input Buffer │ → a b $ ...
└─────────────┘

┌─────────────┐ ┌─────────────┐
│ Stack │ ←→ │ Action/Goto │
│ s₀s₁s₂... │ │ Table │
└─────────────┘ └─────────────┘

LR Parsing Algorithm:
stack = [0]
while true:
state = top(stack)
symbol = lookahead
action = ACTION[state, symbol]

if action = "shift s":


push symbol, push s
advance input
elif action = "reduce A → β":
pop 2*|β| symbols
state = top(stack)
push A, push GOTO[state, A]
elif action = "accept":
return success
else:
error

LR Item:
A→α•β
where • indicates current position in production
Item Types:

 Kernel Items: Core items that define state


 Closure Items: Items derived from kernel items

12. Construction of SLR Parsing Tables


SLR (Simple LR):

Extension of LR(0) using FOLLOW sets to resolve conflicts.

SLR Construction Steps:

1. Construct LR(0) Items


I₀: S' → •S, S → •CC, C → •cC, C → •d

2. Compute GOTO Function


GOTO(I₀, S) = {S' → S•}
GOTO(I₀, C) = {S → C•C, C → •cC, C → •d}

3. Build ACTION Table


If A → α•aβ in Iᵢ and GOTO(Iᵢ, a) = Iⱼ:
ACTION[i, a] = shift j

If A → α• in Iᵢ and A ≠ S':
For all a in FOLLOW(A):
ACTION[i, a] = reduce A → α

SLR Example:
Grammar: S → CC, C → cC | d

States:
I₀: S' → •S, S → •CC, C → •cC, C → •d
I₁: S' → S•
I₂: S → C•C, C → •cC, C → •d
I₃: C → c•C, C → •cC, C → •d
I₄: C → d•
I₅: S → CC•
I₆: C → cC•
13. Canonical LR & LALR Parsing Tables
Canonical LR (CLR):

Uses full LR(1) items with complete lookahead information.

LR(1) Item:
[A → α•β, a]
where a is lookahead symbol

LALR (Look-Ahead LR):

Merges CLR states with same core but different lookaheads.

Comparison:
States Table Size Power
SLR Small Small Weak
LALR Medium Medium Medium
CLR Large Large Strong

LALR Construction:

1. Build CLR parser


2. Merge states with identical cores
3. Union lookahead sets
4. Check for conflicts

14. Parsing with Ambiguous Grammar


Precedence and Associativity:

Used to resolve conflicts in ambiguous grammars.

Precedence Declaration:
yacc
%left '+' '-'
%left '*' '/'
%right UMINUS
Conflict Resolution:

 Shift-Reduce: Use operator precedence


 Reduce-Reduce: Use rule ordering

Example:
E → E + E | E * E | -E | (E) | id

Conflicts resolved by:


- * has higher precedence than +
- + and * are left associative
- - (unary) has highest precedence

15. Operator Precedence Parsing


Definition:

Bottom-up parsing using operator precedence relations.

Precedence Relations:

a ⋖ b: a has lower precedence than b


a ≐ b: a has same precedence as b

a ⋗ b: a has higher precedence than b



Precedence Table:
│ + │ * │ ( │ ) │ id│ $
────┼───┼───┼───┼───┼───┼───
+ │⋗│⋖│⋖│⋗│⋖│⋗
* │⋗│⋗│⋖│⋗│⋖│⋗
( │⋖│⋖│⋖│≐│⋖│
) │⋗│⋗│ │⋗│ │⋗
id │ ⋗ │ ⋗ │ │ ⋗ │ │ ⋗
$ │⋖│⋖│⋖│ │⋖│

Parsing Algorithm:

1. Find handle using precedence relations


2. Reduce when right end is found
3. Continue until input consumed
16. YACC (Yet Another Compiler Compiler)
Definition:

Automatic parser generator that creates LALR parsers from grammar specifications.

YACC File Structure:


yacc
%{
/* C declarations */
%}

/* YACC declarations */
%token NUMBER ID
%left '+' '-'
%left '*' '/'

%%
/* Grammar rules */
expr: expr '+' expr { $$ = $1 + $3; }
| expr '*' expr { $$ = $1 * $3; }
| NUMBER { $$ = $1; }
| ID { $$ = lookup($1); }
;
%%

/* C subroutines */

YACC Features:

 Automatic: Generates parser code


 LALR: Uses LALR parsing algorithm
 Actions: Supports semantic actions
 Error: Built-in error recovery

YACC Usage:
bash
yacc grammar.y # Generate [Link].c
cc [Link].c -ly # Compile with YACC library

17. Error Handling in LR Parsers


Error Detection:

Error detected when no valid action in parsing table.

Error Recovery Strategies:

Panic Mode

 Method: Pop stack until error recovery state


 Synchronizing: Use statement separators
 Implementation: Error productions in grammar

Phrase Level

 Method: Modify stack/input locally


 Actions: Insert, delete, or replace tokens
 Goal: Resume normal parsing quickly

Error Productions

 Method: Add productions for common errors


 Example: Missing semicolon, unmatched parentheses
 Advantage: Specific error messages

Global Correction

 Method: Find minimal cost correction


 Algorithm: Dynamic programming approach
 Cost: Computationally expensive

Error Recovery Example:


yacc
stmt: expr ';'
| if_stmt
| error ';' { yyerror("Invalid statement"); }
;

expr: term
| expr '+' term
| expr error term { yyerror("Missing operator"); }
;

LR Error Recovery Flowchart:


Error Detected → Pop Stack States → Find Recovery State →
Discard Input Tokens → Insert Error Token → Resume Parsing

Summary Comparison
Parser Comparison Table:
Feature │ LL(1) │ SLR │ LALR │ CLR │ Precedence
───────────────┼───────┼──────┼──────┼──────┼───────────
Grammar Class │ Small │ Med │ Large│ Very │ Limited
Table Size │ Small │ Med │ Med │ Large│ Small
Parse Power │ Weak │ Med │ Good │ Best │ Limited
Construction │ Easy │ Med │ Hard │ Hard │ Easy
Error Handling │ Good │ Med │ Good │ Good │ Poor

Advantages and Disadvantages Summary:

Top-Down:

 Advantages: Intuitive, good-errors, predictive


 Disadvantages: Left-recursion, limited-grammars

Bottom-Up:

 Advantages: Powerful, handles-more grammars, efficient


 Disadvantages: Complex-construction, harder-errors

Uses: Compilers, interpreters, syntax-validators, language-processors

This comprehensive coverage provides the foundation for understanding parsing techniques and
their applications in compiler design and language processing systems.

You might also like