0% found this document useful (0 votes)
15 views156 pages

Compiler Structure and Lexical Analysis

The document provides an overview of compiler structure and the phases involved in translating high-level source code into machine code, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, optimization, and code generation. It also discusses the roles of interpreters, including recursive and iterative interpreters, in executing programs directly without generating machine code. Additionally, it details the process of lexical analysis, including tokenization, error detection, and input buffering, which enhances the efficiency of the compilation process.

Uploaded by

paavanireddy842
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views156 pages

Compiler Structure and Lexical Analysis

The document provides an overview of compiler structure and the phases involved in translating high-level source code into machine code, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, optimization, and code generation. It also discusses the roles of interpreters, including recursive and iterative interpreters, in executing programs directly without generating machine code. Additionally, it details the process of lexical analysis, including tokenization, error detection, and input buffering, which enhances the efficiency of the compilation process.

Uploaded by

paavanireddy842
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module-I

INTRODUCTION TO COMPILER AND LEXICAL ANALYSIS


1.1 STRUCTURE OF A COMPILER
A compiler is a complex software program that translates high-level source code written in programming
languages (like C, Java, or Python) into machine code or intermediate code that can be executed by a
computer. The structure of a compiler is typically broken down into several distinct phases or components,
each responsible for a specific task in the translation process. Here's an overview of the main components
of a compiler:
Source Code

Lexical Analysis (Scanner) → Tokens

Syntax Analysis (Parser) → Parse Tree/AST

Semantic Analysis → Annotated AST or Intermediate Representation

Intermediate Code Generation → Intermediate Code

Optimization → Optimized Intermediate Code

Code Generation → Assembly/Machine Code

Code Linking and Assembly → Executable (Optional)
This process ensures that a high-level program is translated efficiently into executable machine code that
can be run on the desired hardware.
1. Lexical Analysis (Scanner)
• Purpose: The lexical analyzer (or scanner) reads the source code and converts it into a
sequence of tokens. Tokens are the smallest meaningful units of the language (e.g.,
keywords, operators, identifiers, literals).
• Output: A list of tokens.
• Functionality:
o Recognizes the basic syntactic components of the language.
o Ignores irrelevant details such as whitespace and comments.
2. Syntax Analysis (Parser)
• Purpose: The parser takes the list of tokens produced by the lexical analyzer and arranges
them into a syntax tree (also known as a parse tree) according to the grammar rules of the
programming language.
• Output: A parse tree or abstract syntax tree (AST).
• Functionality:
o Checks the syntactic correctness of the token sequence.
o Detects syntax errors and provides error messages.
3. Semantic Analysis
• Purpose: This phase checks the logical consistency of the program. It verifies whether the
program follows the semantic rules of the language (e.g., type checking, variable declaration,
scope rules).
• Output: Annotated syntax tree or intermediate representation (IR).
• Functionality:
o Ensures that operations are valid (e.g., checking types in expressions, ensuring
variables are defined before use).
o Resolves symbol references (e.g., variables, functions).
4. Intermediate Code Generation
• Purpose: The intermediate code generator translates the abstract syntax tree or the semantic
information into an intermediate representation (IR) that is easier to manipulate than the
original high-level code, but not yet machine-specific.
• Output: Intermediate representation (IR), which could be a form of assembly or another
lower-level code.
• Functionality:
o Simplifies the target machine’s instruction set while still being independent of a
specific machine architecture.
o Allows optimization and code generation to be more flexible.
5. Optimization
• Purpose: This phase improves the intermediate code to make the generated program run
more efficiently. Optimizations can be done at various levels, such as optimizing loops,
removing dead code, or reducing memory usage.
• Output: Optimized intermediate code.
• Functionality:
o Reduces the program's execution time or memory usage.
o Examples: Constant folding, dead code elimination, loop unrolling, etc.
6. Code Generation
• Purpose: The code generator translates the optimized intermediate code into the target
machine code or assembly code, which can be executed on the computer's hardware.
• Output: Machine code or assembly code.
• Functionality:
o Maps the intermediate representation to specific instructions for the target
architecture (e.g., x86, ARM).
o Ensures that the generated machine code follows the conventions of the target
platform, such as register allocation, memory addressing, etc.
7. Code Optimization (Optional)
• Purpose: Further optimization after code generation to make the machine code even more
efficient.
• Output: Optimized machine code.
• Functionality:
o Can involve optimizations like instruction scheduling, register allocation, etc., to
improve the performance of the machine code.
8. Code Linking and Assembly (Optional)
• Purpose: Linking involves combining the compiled program with libraries or other modules
that are needed to create an executable. The assembler converts assembly code into machine
code.
• Output: Executable file.
• Functionality:
o Resolves symbols and addresses in the code.
o Produces an executable or a relocatable object file.
9. Error Handling and Reporting
• Throughout all phases, the compiler needs to identify errors (e.g., lexical, syntax, semantic)
and provide useful feedback to the programmer.
• Functionality: Reports errors or warnings and provides information to help the programmer
correct issues.
1.2 INTERPRETATION – INTERPRETERS, RECURSIVE INTERPRETERS AND ITERATIVE
INTERPRETERS
In the context of compiler design, interpretation refers to the process where the compiler or interpreter
executes a program directly, rather than generating a complete machine code executable. This execution
is done either through a recursive interpreter or an iterative interpreter, each with different methods
of interpreting and executing the program. Below is an overview of each in the context of compiler
design.
1. Interpreter
An interpreter in the context of compiler design refers to a system that directly executes instructions in
a high-level programming language without first converting them into machine code. It processes the
source code line-by-line or statement-by-statement and performs the specified actions immediately.
Interpreters are often used in languages that are designed to be executed directly without a compilation
step (such as Python, JavaScript, or Ruby).
• Role in Compiler Design:
o Interpreters can be used in just-in-time compilation (JIT), where some parts of the
code are interpreted and then compiled into machine code at runtime for
optimization.
o In interpreted languages, the interpreter is responsible for managing and executing
the program without an intermediate machine code step.
o Interpreters generally focus on runtime execution, error handling, and providing
feedback as the code is executed.
• Features of an Interpreter:
o No Compilation to Machine Code: It translates and executes code directly.
o Line-by-Line Execution: The program is executed step by step.
o Error Reporting: Errors are identified during execution, rather than at compile time.
2. Recursive Interpreters
A recursive interpreter uses recursion to handle the execution of nested expressions, function calls, and
subprograms. In recursive interpreters, the interpreter repeatedly calls itself to handle nested expressions
or recursive function calls. This is particularly useful when interpreting languages that heavily rely on
recursive constructs (such as functional programming languages like Lisp or Scheme).
Characteristics of Recursive Interpreters:
• Recursive Evaluation: A recursive interpreter will invoke itself to handle expressions that
depend on recursive data structures or recursive functions. For example, evaluating a
recursive function will cause the interpreter to call itself with new arguments.
• Handling Recursion: Recursive interpreters naturally handle recursive calls without
needing a separate mechanism, such as a stack, to simulate recursion.
Role in Compiler Design:
• Abstract Syntax Trees (ASTs): Recursive interpreters can be used to traverse and evaluate
Abstract Syntax Trees (ASTs) produced by a parser.
• Symbol Tables: They often utilize symbol tables to track variables, functions, and scopes in
recursive function calls.
• Memory Consumption: Recursive interpreters tend to consume more memory and stack
space because each recursive call requires additional memory on the call stack.
Example:
In languages like Lisp or Scheme, where functions are often recursive by nature, a recursive interpreter is
well-suited for interpreting such recursive constructs. The interpreter might recursively evaluate an
expression that contains further nested expressions or function calls.
3. Iterative Interpreters
An iterative interpreter, as opposed to a recursive interpreter, processes the source code using loops
rather than recursive function calls. Instead of invoking itself recursively to evaluate sub-expressions, it
uses an explicit loop to process the program step by step.
Characteristics of Iterative Interpreters:
• Loop-Based Execution: Iterative interpreters use loops (such as while or for loops) to
sequentially fetch, parse, and execute each statement in the source code.
• Memory Efficiency: Iterative interpreters typically consume less memory because they
avoid the overhead of maintaining deep recursion stacks.
• Handling Recursion: While iterative interpreters do not inherently support recursion, they
can manage recursion manually by implementing an explicit stack to simulate the behavior.
Role in Compiler Design:
• Execution of Intermediate Representations: Iterative interpreters may execute the
intermediate representations generated during compilation (such as bytecode or intermediate
code), rather than the original source code.
• Optimization: In some cases, iterative interpreters may be used in a just-in-time (JIT)
compilation approach to optimize execution on the fly.
Example:
An example of an iterative interpreter could be one that interprets a simplified version of a programming
language, where statements like assignments, arithmetic operations, and function calls are processed in a
loop without recursion. An iterative interpreter might be used in earlier compiler phases or when dealing
with languages that are not highly recursive.
Comparison of Recursive vs. Iterative Interpreters in Compiler Design

1.3 LEXICAL ANALYSIS


Lexical analysis is the first phase of a compiler. During this phase, the lexical analyzer (also called a
scanner) breaks the input source code into a sequence of tokens. Tokens are the smallest units of
meaningful data in the programming language (such as keywords, identifiers, operators, literals, and
punctuation).
The primary role of lexical analysis is to ensure that the source code is divided into these manageable,
meaningful components for further processing by the syntax analyzer (parser). The lexical analyzer also
helps catch basic errors like unrecognized characters or malformed tokens.
1.3.1 THE ROLE OF THE LEXICAL ANALYZER
The lexical analyzer, also known as a scanner, is the first phase of the compiler and plays a crucial role
in transforming raw source code into a form that can be further processed by subsequent phases of the
compiler, such as the syntax analyzer (parser). The primary responsibility of the lexical analyzer is to
break the input source code into tokens, which are the smallest units of meaningful data in a
programming language.
Key Responsibilities of the Lexical Analyzer:
1. Tokenization:
o Tokenization is the process of dividing the source code into tokens, which are
categorized sequences of characters that represent syntactically meaningful elements
of the program.
o These tokens can represent a variety of constructs such as keywords (e.g., if, while),
identifiers (e.g., variable names), literals (e.g., numbers, strings), operators (e.g., +, -
, *), and punctuation (e.g., parentheses, semicolons).
o The lexical analyzer reads the raw source code character by character, groups
characters together based on predefined rules (such as regular expressions), and
produces these tokens.
2. Eliminating Whitespace and Comments:
o The lexical analyzer discards irrelevant elements like whitespace (spaces, tabs,
newlines) and comments, which do not contribute to the syntactic structure of the
program.
o These elements are useful for human readability but are unnecessary for the
compilation process. By removing them, the lexical analyzer simplifies the input
stream and reduces the work for later stages of the compiler.
3. Error Detection:
o The lexical analyzer helps catch lexical errors. A lexical error occurs when a
sequence of characters cannot be recognized as a valid token, such as an illegal
character or an invalid number format.
o For example, if the source code contains an invalid character (e.g., # in a language
where it is not allowed), the lexical analyzer will detect this and report an error before
the parsing phase.
o Error handling is a key part of the lexical analysis process. Lexical errors are
typically flagged, and informative error messages are provided, often pointing out
the exact location (line and column) of the issue.
4. Token Classification:
o The lexical analyzer classifies tokens into different token classes such as:
▪ Keywords: Reserved words with predefined meanings in the language (e.g.,
if, for, return).
▪ Identifiers: Names used to identify variables, functions, types, etc.
▪ Literals: Constant values such as numbers (5, 3.14) and strings ("hello").
▪ Operators: Symbols used to perform operations (e.g., +, -, *, =, ==).
▪ Delimiters and Punctuation: Symbols like semicolons (;), parentheses (()),
curly braces ({}), and commas (,).
o By recognizing these tokens, the lexical analyzer helps structure the source code into
meaningful units that the parser can later use to build a syntax tree.
5. Providing a Stream of Tokens to the Parser:
o After processing the input source code and breaking it into tokens, the lexical
analyzer passes the token stream to the next phase of the compiler, which is the
syntax analyzer (parser).
o The parser expects the input in the form of a sequence of tokens, and it is the lexical
analyzer’s job to ensure that the tokens are correctly identified and passed on to the
parser.
How the Lexical Analyzer Works:
• Input Processing: The lexical analyzer takes the source code as input, usually a sequence
of characters, and processes them to identify tokens.
• Token Recognition: Using patterns specified by regular expressions or finite automata, the
lexical analyzer matches substrings in the source code to specific token patterns (e.g.,
identifiers, literals, keywords, etc.).
• Token Generation: Each time a token is recognized, the lexical analyzer creates a token
object (or a simple representation of the token) and passes it to the parser.
Example of Lexical Analysis:
Consider the following simple source code:
int sum = 10 + 5;
The lexical analyzer would break this into the following tokens:
• Keyword: int
• Identifier: sum
• Operator: =
• Integer Literal: 10
• Operator: +
• Integer Literal: 5
• Punctuation: ;
These tokens are passed to the syntax analyzer (parser) for further processing.
Benefits of Lexical Analysis:
1. Simplifies Parsing:
o By breaking the input into tokens, the lexical analyzer simplifies the job of the syntax
analyzer. The parser only needs to handle structured tokens instead of raw characters.
2. Error Detection Early:
o It allows for early detection of certain types of errors, particularly lexical errors,
before the program is passed to more complex stages like parsing or semantic
analysis.
3. Optimized Execution:
o Lexical analysis enables more efficient execution in modern compilers. Techniques
like input buffering (e.g., double-buffering) improve performance by reducing the
number of input operations needed.
4. Modularity:
o Lexical analysis modularizes the compilation process by separating the task of
recognizing tokens from more complex tasks like syntax checking, making the
compiler easier to design and maintain.
By performing these tasks efficiently, the lexical analyzer ensures that the compiler can process the source
code correctly and efficiently, setting the stage for subsequent phases of compilation.
1.3.2 INPUTBUFFERING
Input buffering is a technique used by the lexical analyzer (scanner) in a compiler to efficiently read
and process characters from the source code during lexical analysis. The main goal of input buffering is
to minimize the overhead of repeated input operations and improve the performance of the lexical
analyzer.
Since the lexical analyzer processes the source code character by character, if it reads directly from the
input source (e.g., a file or a stream) every time a character is needed, it could result in a lot of unnecessary
I/O operations. To address this, buffering is used to hold a block of input characters in memory, allowing
the lexical analyzer to process them efficiently in bulk.
Key Concepts of Input Buffering
1. Buffering Basics:
o Buffer: A contiguous block of memory used to hold a portion of the input source
code. The lexical analyzer reads from the buffer instead of directly from the source
file.
o By using a buffer, the program can access characters from memory, which is faster
than accessing characters from disk or other input devices.
2. The Role of Input Buffering:
o The lexical analyzer needs to read the source code in chunks and process tokens.
Instead of constantly reading from the source (which can be slow), the input is read
into a buffer. The buffer acts as an intermediary between the input source and the
lexical analyzer.
o The buffer allows the lexical analyzer to look ahead and consume characters as
needed without having to frequently perform input operations.
3. Challenges:
o The main challenge is that the lexical analyzer often needs to look at multiple
characters ahead in the input (to determine where a token starts and ends).
o It also needs to handle situations where a token might span multiple reads from the
buffer or the source code.
Types of Input Buffering
There are two common types of input buffering techniques used in lexical analysis:
1. Single Buffering:
o In single buffering, only one buffer is used to hold the input characters.
o The lexical analyzer reads from this buffer character by character.
o Once the buffer is exhausted, it needs to be refilled from the source code, which may
cause some delays as the analyzer waits for new input.
Drawback: Single buffering requires the analyzer to frequently access the input source (e.g., disk or
terminal), which can be inefficient.
2. Double Buffering:
o In double buffering, two buffers are used to store input characters.
o While the lexical analyzer is reading from one buffer, the other buffer is
simultaneously being filled with the next block of characters from the source code.
o This method minimizes the idle time spent waiting for new input because one buffer
is always ready to be processed while the other is being populated with data from the
source code.
Advantage: Double buffering significantly improves performance by reducing the frequency of input
operations and ensuring that characters are always available for processing.
How Double Buffering Works
In double buffering, the input is divided into two buffers (usually named Buffer 1 and Buffer 2):
1. Initial Setup: The lexical analyzer starts by filling Buffer 1 with a chunk of characters from
the input source.
2. Processing: The lexical analyzer processes the characters in Buffer 1. While it is reading
and processing Buffer 1, Buffer 2 is simultaneously filled with the next chunk of input
characters from the source.
3. Buffer Swap: Once Buffer 1 is fully processed (or when more input is needed), the lexical
analyzer switches to Buffer 2 and starts reading from it. The roles of the buffers are swapped,
with Buffer 1 being filled with the next chunk of input while the lexical analyzer processes
Buffer 2.
4. Loop: This process continues, with the buffers alternating between being read and being
filled, allowing the lexical analyzer to process characters efficiently without waiting for
input operations.
Example of Double Buffering:
Let’s say the input source is the string "int sum = 10 + 5;". In double buffering:
• Buffer 1 could initially contain: "int sum = ".
• Buffer 2 could be filled simultaneously with the next part of the input: "10 + 5;".
The lexical analyzer reads the characters from Buffer 1 while Buffer 2 is being filled. Once Buffer 1 is
exhausted, the analyzer switches to Buffer 2, and Buffer 1 is filled with the next chunk of data.
Advantages of Input Buffering
1. Improved Efficiency:
o Input buffering reduces the number of times the lexical analyzer has to perform an
I/O operation, which makes the process of lexical analysis much faster, especially
for large programs.
o Double buffering is particularly effective because it ensures that the lexical analyzer
is never waiting for input; there's always a buffer ready for processing.
2. Reduced Latency:
o By having a second buffer ready, double buffering minimizes delays that would occur
if the analyzer had to wait for new characters to be read from the source.
3. Optimized Token Recognition:
o Input buffering ensures that the lexical analyzer can look ahead at characters to
correctly identify tokens, which is especially important for languages with complex
tokens (e.g., where a token might start in one buffer and end in another).
Example: How Buffering Helps Lexical Analysis
Consider the following input:
int a = 5 + 3;
Without input buffering, the lexical analyzer would read each character individually, possibly making
many trips to the source code and spending significant time in each one.
With double buffering, the lexical analyzer reads two blocks at a time (say, "int a =" in one buffer and "5
+ 3;" in another). This improves efficiency by reducing I/O overhead and enabling the lexer to quickly
process the input and tokenize it.
Buffering in Action in Lexical Analyzers
1. Efficient Reading: In a typical lexical analyzer, input buffering ensures that after the initial
load of characters, the analyzer spends most of its time processing tokens, not waiting for
more data.
2. Look-Ahead: Input buffering allows the lexer to efficiently perform look-ahead in
tokenization. For example, in the case of a multi-character token, such as an identifier or
keyword, the lexical analyzer can read ahead and determine where the token ends.
Input buffering is a crucial optimization in lexical analysis, enabling the lexical analyzer to efficiently
process input by reducing the number of direct I/O operations. Double buffering is particularly effective,
ensuring continuous input availability while the lexer is processing the current buffer. These techniques
improve the performance of the compiler, especially when analyzing large programs, by minimizing
delays caused by waiting for input and enabling faster token recognition.
1.3.3 SPECIFICATION OF TOKENS
In compiler design, the specification of tokens refers to the process of defining the patterns or rules that
identify the basic building blocks of a programming language. Tokens are the smallest units of meaning
in a programming language and serve as the input for the syntax analyzer (parser). These tokens can
include keywords, operators, identifiers, literals, and punctuation marks.
The process of token specification is typically done using regular expressions or context-free
grammars, which describe the pattern or structure of valid tokens.
1. What are Tokens?
A token is a sequence of characters in the source code that matches a pattern defined by the programming
language's syntax rules. Tokens are categorized into classes, each representing a specific type of construct
in the language.
Common types of tokens include:
• Keywords: Reserved words in the programming language that have a predefined meaning,
such as if, while, return, int, for, etc.
• Identifiers: Names used for variables, functions, classes, etc. An identifier typically starts
with a letter or underscore and may be followed by letters, digits, or underscores.
• Literals: Constants or fixed values in the program. These can be:
o Integer literals (e.g., 10, 0, -5).
o Floating-point literals (e.g., 3.14, 0.1, -0.56).
o String literals (e.g., "hello", "world").
o Character literals (e.g., 'a', '%').
• Operators: Symbols that represent operations, such as +, -, *, /, ==, etc.
• Delimiters: Punctuation marks used for separating different components in the source code,
such as ;, ,, (), {}, [], etc.
2. Specifying Tokens Using Regular Expressions
In lexical analysis, regular expressions (regex) are commonly used to define the patterns that tokens must
match. A regular expression describes a set of strings that belong to a particular class of tokens.
For example:
Identifiers: In many languages, an identifier consists of a letter (or underscore) followed by any number
of letters, digits, or underscores. The regular expression for an identifier might be:
[a-zA-Z_][a-zA-Z0-9_]*
This regex ensures that the token begins with a letter or underscore, followed by any number of
alphanumeric characters or underscores.
Integer Literals: An integer literal consists of one or more digits. The regex for an integer might be:
[0-9]+
This pattern matches one or more digits.
Keywords: Keywords are typically specified directly, as they are reserved words with a fixed set. For
example:
if | else | return | int | for
Operators: The regex for operators such as +, -, *, /, ==, etc., can be:
\+ | - | \* | / | == | != | < | >
String Literals: String literals are typically enclosed in double quotes and may contain escaped characters
like \n, \t, etc. The regex might look like:
\"(\\.|[^\\"])*\"
This regex matches a string literal, handling escape sequences within the string.
3. Token Specification Examples
Here’s a table of some common tokens and their corresponding regular expressions in a typical
programming language:
4. Using Finite Automata (DFA) for Token Specification
A deterministic finite automaton (DFA) is a mathematical model used by lexical analyzers to recognize
tokens. Regular expressions can be translated into a DFA, which then processes the input stream character
by character.
The process works as follows:
1. Conversion to DFA: The regular expressions for token classes are converted into a DFA.
2. Input Processing: The DFA processes the input source code character by character,
transitioning between states based on the input symbols.
3. Token Recognition: When the DFA reaches an accepting state, a token has been successfully
recognized, and the lexical analyzer outputs the token.
5. Example of Token Specification Process
Consider a simple language with the following token types:
• Keywords: if, else, int
• Identifiers: Any string starting with a letter or underscore, followed by letters, digits, or
underscores.
• Integer literals: A sequence of digits.
Here’s how token specification might work:
1. Keywords: We define keywords as a special class using direct matching (e.g., if, else, int).
2. Identifiers: We define identifiers using a regular expression, [a-zA-Z_][a-zA-Z0-9_]*.
3. Integer Literals: We define integer literals with the regular expression [0-9]+.
Sample Input: int x = 10;
The lexical analyzer would process this input and recognize the following tokens:
• Keyword: int
• Identifier: x
• Operator: =
• Integer Literal: 10
• Punctuation: ;
6. Lexical Analyzer Tools (e.g., LEX)
Tools like LEX (a lexical analyzer generator) automate the token specification process. A user writes
regular expressions for token classes in the LEX input file, and LEX generates the C code that implements
the lexical analyzer. For example:
%%
int { return KEYWORD_INT; }
[0-9]+ { return INTEGER; }
[a-zA-Z_][a-zA-Z0-9_]* { return IDENTIFIER; }
%%
This LEX specification defines tokens for the keyword int, integer literals, and identifiers, and generates
a scanner that can recognize these tokens.
The specification of tokens is an essential part of the lexical analysis phase in compiler design. By using
regular expressions or finite automata, the compiler defines patterns for different types of tokens,
including keywords, identifiers, literals, operators, and punctuation. This allows the lexical analyzer to
efficiently break down the source code into meaningful components, which can then be processed by later
phases of the compiler (like the syntax analyzer).
1.3.4 THE LEXICAL-ANALYZERGENERATOR LEX
LEX is a powerful tool used for generating lexical analyzers (also known as scanners) in compilers. It
automates the process of turning regular expressions, which specify the patterns of tokens, into a C-based
program that can identify those tokens in a given input stream. LEX is widely used to simplify the
implementation of the lexical analysis phase in a compiler.
What is LEX?
LEX is a lexical analyzer generator that takes a set of regular expressions (used to define the patterns of
tokens) as input and generates a C program that performs lexical analysis on a source code file. The
generated C code uses a finite automaton to recognize the defined patterns (tokens) and perform actions
based on what it matches.
How LEX Works
1. Input Specification: The user provides a specification file for the lexical analyzer, which
consists of a set of regular expressions for token classes and associated C code for actions
to be taken when tokens are matched.
2. Regular Expressions: LEX uses regular expressions to define patterns for different tokens.
For example, it might have regular expressions for keywords (if, else), identifiers (variable
names), integer literals, etc.
3. Action Code: Alongside each regular expression, the user provides action code (usually in
C) that is executed when a pattern is matched. The action might involve returning a token to
the parser, performing an operation, or printing an error message.
4. Finite Automaton Generation: LEX translates the regular expressions into a finite
automaton, which is a state machine that can process the input string character by character,
transitioning between states based on the input and matching tokens.
5. C Code Generation: After processing the regular expressions and actions, LEX generates a
C program that includes the logic to scan input based on the defined regular expressions.
This C program can then be compiled and used to perform lexical analysis.
Structure of a LEX Specification File
A LEX specification file generally consists of three sections:
1. Definitions Section:
o This section is used to define macros, regular expressions, or include necessary
libraries.
o It typically contains global declarations or defines for constants and regular
expressions.
2. Rules Section:
o This section is the heart of the LEX specification.
o Each rule consists of a regular expression (defining the token) followed by a block
of C code that is executed when the token is recognized.
o Rules are separated by new lines, and LEX processes them in the order they are
specified.
3. User Code Section:
o This section allows for additional C code that will be included in the generated
scanner.
o It often includes function definitions and main() function code that initializes the
scanner and calls the lexical analyzer functions.
Example LEX Specification
Here’s an example of a simple LEX specification for a lexical analyzer that recognizes integer literals,
identifiers, and a few keywords (if, else, int):
%{
/* Definitions section: include necessary libraries or macros */
#include <stdio.h>
#include <stdlib.h>
%}
/* Rules section: specify token patterns and actions */
%%
if { printf("Keyword: if\n"); }
else { printf("Keyword: else\n"); }
int { printf("Keyword: int\n"); }
[0-9]+ { printf("Integer: %s\n", yytext); }
[a-zA-Z_][a-zA-Z0-9_]* { printf("Identifier: %s\n", yytext); }
\n { /* ignore newlines */ }
[ \t] { /* ignore whitespace */ }
. { printf("Invalid character: %s\n", yytext); }
%%
/* User code section */
int main(int argc, char *argv[]) {
yylex(); /* Call the lexical analyzer */
return 0;
}
Explanation of the Example
• Definitions Section:
o This section includes the necessary libraries (stdio.h for input/output and stdlib.h for
general functions).
o There are no macros or advanced definitions in this simple example.
• Rules Section:
o if, else, and int: These are keywords in the language. When one of these keywords
is matched, the corresponding action (printf) is executed.
o [0-9]+: This regular expression matches one or more digits, representing an integer
literal. The action prints the matched number using yytext, which is a special
variable that holds the current matched token.
o [a-zA-Z_][a-zA-Z0-9_]*: This matches an identifier (a sequence starting with a
letter or underscore, followed by letters, digits, or underscores).
o [ \t]: This matches whitespace characters (spaces and tabs), which are ignored.
o \n: This matches a newline character, which is also ignored.
o .: This matches any character not matched by the previous rules and prints an error
message.
• User Code Section:
o The main() function calls yylex(), which is the function generated by LEX that
performs the lexical analysis.
o The yylex() function is the entry point for the lexical analyzer. It reads input, applies
the rules, and invokes the corresponding actions.
LEX Output
When this LEX file is processed, the lex tool generates a C file (e.g., [Link].c) that contains a lexical
analyzer. This C file is then compiled, resulting in an executable that can read an input stream and perform
lexical analysis. For example, given the input:
int x = 10;
if (x > 5) {
x = 20;
}
The output from running the generated program might be:
Keyword: int
Identifier: x
Invalid character: =
Integer: 10
Invalid character: ;
Keyword: if
Invalid character: (
Identifier: x
Invalid character: >
Integer: 5
Invalid character: )
Invalid character: {
Identifier: x
Invalid character: =
Integer: 20
Invalid character: ;
Invalid character: }
Key Features of LEX
1. Token Definition: LEX allows you to define tokens using regular expressions, which are
simple and concise.
2. Action Code: LEX allows you to specify C code that is executed when a token is recognized.
This can include returning tokens to the parser, printing information, or handling errors.
3. Efficient: LEX uses finite automata to match regular expressions efficiently. This ensures
that lexical analysis is fast even for large inputs.
4. Error Handling: LEX allows you to define actions for unmatched characters, making it
easy to handle lexical errors.
Advantages of Using LEX
1. Automation: LEX automates the creation of lexical analyzers, reducing the amount of code
that needs to be written manually.
2. Efficiency: The finite automaton approach used by LEX ensures that lexical analysis is
performed efficiently.
3. Portability: LEX generates C code, which can be compiled and run on different platforms,
making it portable across systems.
4. Flexibility: LEX allows you to define a wide range of token patterns and actions, making it
highly customizable for different programming languages.
MODULE-II
SYNTAX ANALYSIS
2.1 ThE ROLE OF THE PARSER
The parser is a critical component of a compiler's syntax analysis phase. After the lexical analyzer (lexer)
breaks the input source code into tokens, the parser takes these tokens and determines their syntactic
structure according to the grammar of the programming language. In other words, the parser checks if the
sequence of tokens forms valid expressions and statements, and builds an internal representation of the
program, typically a parse tree or abstract syntax tree (AST).
Key Responsibilities of the Parser:
1. Syntax Analysis:
o The parser ensures that the sequence of tokens provided by the lexical analyzer
conforms to the grammatical rules of the source language.
o The grammar defines the valid structure of statements and expressions in the
language. The parser checks the tokens against this grammar and verifies whether
they can be combined in valid ways.
2. Building the Abstract Syntax Tree (AST):
o Once the parser verifies the syntactic correctness of the input, it constructs a data
structure representing the hierarchical structure of the program, called the abstract
syntax tree (AST).
o The AST is a simplified, tree-like representation of the program that captures its
logical structure without worrying about unnecessary syntactic details (like
parentheses or commas).
o For example, the expression 3 + 5 * 2 would be represented in the AST as a tree
where * is a node with two children (5 and 2), and + is a parent of the * node and 3.
3. Error Detection:
o The parser is also responsible for detecting syntax errors. If the sequence of tokens
does not adhere to the grammar, the parser throws an error, providing useful feedback
about where and what went wrong in the source code.
4. Syntax Tree Construction:
o The parser typically constructs a parse tree or abstract syntax tree (AST). The
parse tree is a complete representation of the grammatical structure of the input,
whereas the AST is a more abstract, simplified version that omits unnecessary
details.
Types of Parsers:
There are two main types of parsers used in compilers:
1. Top-Down Parsers:
o These parsers begin at the root of the parse tree and work down to the leaves, trying
to match the grammar rules against the input tokens.
o Recursive Descent Parser: This is a common type of top-down parser that uses a
set of recursive procedures to process each non-terminal in the grammar. It requires
the grammar to be in LL(1) form (i.e., no ambiguity and no left recursion).
o Predictive Parsing: This is a more efficient version of recursive descent parsing,
which uses a lookahead symbol to make decisions about which rule to apply.
2. Bottom-Up Parsers:
o Bottom-up parsers start at the leaves (tokens) and work their way up to the root of
the parse tree. These parsers reduce the input tokens to non-terminals according to
the production rules of the grammar.
o LR Parsers: These are a class of bottom-up parsers that can handle a wide range of
grammars. They are more powerful than top-down parsers and can process LR(1)
grammars (i.e., grammars that can be parsed with a single lookahead token).
o Shift-Reduce Parsing: A common technique used by LR parsers. It involves shifting
tokens onto a stack and then reducing them to higher-level constructs as the parsing
process progresses.
Parsing Techniques:
1. LL Parsing (Top-Down):
o An LL parser reads input from left to right and applies leftmost derivation.
o LL parsers are usually simpler but have restrictions, such as the need for predictive
parsing and the inability to handle left-recursive grammars without transformation.
2. LR Parsing (Bottom-Up):
o An LR parser reads input from left to right and constructs a rightmost derivation
in reverse.
o LR parsers are more powerful than LL parsers and can handle a broader class of
grammars, including most programming language grammars.
o They are typically implemented using shift-reduce algorithms, where the input is
shifted onto a stack, and reductions are performed when applicable grammar rules
are matched.
Parse Tree vs Abstract Syntax Tree (AST):
• Parse Tree:
o The parse tree (also known as a concrete syntax tree) represents all the syntactic
details of the source code, adhering to the grammar of the language.
o Every grammar rule is explicitly represented in the parse tree.
o A parse tree is more detailed and is used primarily during parsing to validate the
input.
• Abstract Syntax Tree (AST):
o The AST is a more abstract, simplified version of the parse tree that omits extraneous
syntactic details (like parentheses, commas, etc.) that are not necessary for the
program's meaning.
o The AST focuses on the logical structure of the program, making it easier for
subsequent phases (such as semantic analysis and code generation) to operate on the
program's logic.
o For example, the expression 3 + 5 * 2 would be represented in the AST with + as the
root, 3 as a left child, and * as the right child with 5 and 2 as its children.
Example of Parsing:
Consider the following simple arithmetic expression:
3+5*2
Parse Tree: The parse tree for this expression (according to typical arithmetic grammar) would look like
this:
+
/\
3 *
/\
5 2
Abstract Syntax Tree (AST): The AST for this expression, omitting unnecessary syntactic elements,
would look like this:
+
/\
3 *
/\
5 2
In this case, the AST might look similar to the parse tree because the grammar for basic arithmetic is
simple. However, in more complex expressions, the AST would omit details like parentheses or
intermediate nodes.
Error Detection in the Parser:
• Syntax Errors: The parser can detect errors in the source code based on the grammar. If a
sequence of tokens does not match any valid rule, the parser will throw a syntax error, typically
indicating the position where the error occurred.
For example, an expression like:
3+*5
would be flagged by the parser as an error because the * operator is misplaced.
• Recovery: Parsers often have strategies for recovering from errors to continue parsing the rest of
the input. For example, an error recovery strategy might skip invalid tokens or try to synchronize
with the next valid construct.
The Role of the Parser in Compiler Phases:
1. Input: The parser receives tokens from the lexical analyzer.
2. Processing: It checks if the sequence of tokens follows the syntax rules of the language and
constructs a parse tree or AST.
3. Output: The parser outputs the abstract syntax tree (AST) or parse tree that will be passed
to the next phase of the compiler, such as semantic analysis or intermediate code generation.
2.2 ELIMINATING AMBIGUITY
Ambiguity in the context of programming languages and compiler design refers to situations where a
grammar can generate more than one valid parse tree for a given input string. Ambiguity can lead to
confusion and problems in both language parsing and interpretation, as it becomes unclear which structure
or meaning the string is supposed to represent.
In a compiler, eliminating ambiguity is a critical task, because an ambiguous grammar can make it
impossible to consistently and correctly parse programs. Ambiguity in a grammar can lead to multiple
interpretations of a given piece of source code, making the behavior of the program unpredictable.
Causes of Ambiguity
Ambiguity arises when the grammar allows multiple interpretations of the same string. For example, in
arithmetic expressions, the expression 3 + 4 * 5 could be interpreted in two ways:
1. (3 + 4) * 5 (First add, then multiply)
2. 3 + (4 * 5) (First multiply, then add)
The ambiguity comes from the lack of clear precedence rules between addition (+) and multiplication (*).
If the grammar allows both interpretations, then it is ambiguous.
Example of an Ambiguous Grammar
Consider the following simple grammar for arithmetic expressions:
E→E+E
E→E*E
E → (E)
E → id
This grammar is ambiguous because there is no specification for the precedence of operators like + and
*. For example, the string id + id * id could be parsed in two ways:
• (id + id) * id (where addition happens first)
• id + (id * id) (where multiplication happens first)
Both interpretations are valid, but they give different results for the same expression.
Methods to Eliminate Ambiguity
There are several techniques to eliminate ambiguity from a grammar:
1. Rewriting the Grammar (Refactor the Grammar)
A common approach to eliminating ambiguity is to rewrite the grammar to clarify precedence rules and
avoid ambiguous constructs.
For example, the ambiguous grammar for arithmetic expressions can be rewritten to make operator
precedence explicit:
E→E+T|T
T→T*F|F
F → (E) | id
In this grammar:
• E stands for an expression.
• T stands for a term (which handles multiplication).
• F stands for a factor (which could be an identifier or a subexpression).
This modified grammar clearly separates the addition (+) and multiplication (*) operators, assigning
higher precedence to multiplication (T), so that the string id + id * id is interpreted as id + (id * id), rather
than (id + id) * id.
2. Operator Precedence and Associativity Rules
Operator precedence and associativity rules can be explicitly incorporated into the grammar to ensure that
expressions are interpreted in the intended way. For example:
• Precedence: Multiplication has higher precedence than addition.
• Associativity: If two operators of the same precedence appear in an expression, the
associativity determines how they should be grouped. For most operators like + and *, the
associativity is left-to-right (left associative).
This can be done by creating separate non-terminals for different levels of precedence, as shown in the
previous example.
3. Use of Semantic Actions
While the syntax of a language can be ambiguous, semantic rules can often help resolve ambiguities. This
is done by defining the meaning of constructs that can be interpreted in different ways, based on the
context. For example:
• Context-sensitive parsing: The context of the expression could dictate whether + or *
should be parsed first, without needing to change the grammar. This approach involves
adding semantic checks during parsing (e.g., based on the types of operands).
4. Left Recursion Elimination
In some cases, ambiguity can be caused by left recursion in a grammar. Left recursion occurs when a
non-terminal symbol in a production rule references itself as the first symbol on the right-hand side.
For example, the following grammar is left-recursive and potentially ambiguous:
E→E+T
E→T
This can lead to infinite recursion in parsers that don't handle left recursion. To eliminate ambiguity, left
recursion can be removed by restructuring the grammar, as shown below:
E → T E'
E' → + T E' | ε
T → id | (E)
Here, E' handles the recursive part, and the grammar is now right-recursive, which is more easily handled
by parsers.
5. Use of Parsing Strategies (LL and LR)
Certain parsing techniques can handle ambiguity more effectively:
• LL(1) Parsers: These parsers look ahead one token and use a top-down approach. By
enforcing left-to-right parsing and leftmost derivation, these parsers can avoid ambiguity
through proper grammar design.
• LR(1) Parsers: These parsers use a bottom-up approach, which is capable of handling more
complex grammars and can deal with operator precedence and associativity more
effectively.
By adjusting the grammar and designing parsing strategies, ambiguity can often be controlled.
6. Disambiguation Using Priorities or Precedence Tables
For more complex cases of ambiguity (e.g., operator precedence), disambiguation rules such as
precedence tables or priority values can be applied. These tables or rules specify which production
should be chosen when multiple alternatives are available for the same input.
For example, a precedence table might specify that multiplication has higher precedence than addition,
ensuring that expressions like 3 + 5 * 2 are interpreted as 3 + (5 * 2).
By applying these methods, a grammar can be made unambiguous, enabling the compiler to accurately
parse and understand source code. Ambiguity elimination ensures that programs are parsed consistently
and that their meaning is clear to the compiler.
2.3 ELIMINATING OFLEFT RECURSION AND LEFT FACTORING
In the context of compiler design, left recursion and left factoring are two important concepts that can
cause issues when constructing parsers, particularly recursive descent parsers. These issues can lead to
infinite recursion or inefficient parsing. To ensure the grammar can be parsed efficiently, it is important
to eliminate left recursion and apply left factoring where necessary.
Let's go through each concept and how to eliminate them.
1. Eliminating Left Recursion
Left recursion occurs when a non-terminal symbol in a production rule refers to itself on the left side of
the rule, potentially causing infinite recursion during parsing. Left recursion is problematic for top-down
parsers (like recursive descent parsers), which process the input from left to right and expand the leftmost
non-terminal first. Left recursion can lead to infinite loops when trying to expand such non-terminals.
Example of Left Recursion
Consider the following grammar for arithmetic expressions:
E→E+T
E→T
In this grammar, the production E → E + T is left-recursive because the non-terminal E appears at the
beginning of the right-hand side. If a parser starts expanding E by applying the rule E → E + T, it will
continue expanding E indefinitely.
Elimination of Left Recursion
To eliminate left recursion, we can rewrite the grammar to make it right-recursive. The basic idea is to
introduce a new non-terminal that handles the recursive part in a rightward fashion. Here's how we can
transform the left-recursive grammar:
Original Grammar (with left recursion):
E→E+T
E→T
Rewritten Grammar (without left recursion):
E → T E'
E' → + T E' | ε
In this transformed grammar:
• E now produces T followed by E'.
• E' handles the recursion, producing + T E' (for recursive cases) or ε (for the base case).
• This is a right-recursive structure that no longer leads to infinite recursion.
General Approach for Elimination of Left Recursion
To eliminate left recursion in a grammar rule of the form:
A→Aα| β
Where:
• A is a non-terminal.
• α and β are sequences of grammar symbols (with α starting with A, causing left recursion).
We follow these steps:
1. Introduce a new non-terminal A'.
2. Rewrite the grammar as:
A → β A'
A' → α A' | ε
This effectively moves the recursive part to the right-hand side, preventing the infinite recursion problem.
2. Left Factoring
Left factoring is a technique used to eliminate ambiguity in a grammar that can lead to inefficiency in
predictive parsing (such as in recursive descent parsing). It occurs when there are two or more
alternatives that begin with the same prefix, making it difficult for the parser to decide which rule to apply
based on the first token.
Example of Left Factoring
Consider the following grammar for arithmetic expressions:
E→T+E
E→T-E
E→T
Here, the non-terminal E has two alternatives (T + E and T - E) that both start with the same token T. This
creates ambiguity because the parser will not know whether to apply T + E or T - E until it sees more
tokens.
Elimination of Left Factoring
To eliminate left factoring, we introduce a new non-terminal that captures the common prefix (T in this
case) and factors it out. Here's how we can transform the grammar:
Original Grammar (with left factoring):
E→T+E
E→T-E
E→T
Rewritten Grammar (after left factoring):
E → T E'
E' → + E | - E | ε
In this transformed grammar:
• E first produces T, followed by E'.
• E' handles the remaining part, which can be + E, - E, or nothing (represented by ε).
• This makes the decision about whether to apply the + or - operator unambiguous, as it comes
after the T.
General Approach for Left Factoring
To factor out a grammar rule of the form:
A→αβ|αγ
Where:
• A is a non-terminal.
• α is the common prefix in both alternatives.
• β and γ are the rest of the rules that follow the common prefix.
We rewrite the grammar as:
A → α A'
A' → β | γ
This allows the parser to first match the common prefix α, and then decide between β or γ.
3. Practical Example
Let’s consider a more complex grammar and eliminate both left recursion and left factoring:
Original Grammar (with both left recursion and left factoring):
E→E+T|E-T|T
T → id
• The rule E → E + T | E - T is left-recursive (because E appears on the left side).
• The alternatives E + T and E - T start with the same token E, leading to a need for left
factoring.
Step 1: Eliminate Left Recursion
First, eliminate the left recursion:
E → T E'
E' → + T E' | - T E' | ε
T → id
Now, the grammar is no longer left-recursive, but we still need to address left factoring.
Step 2: Eliminate Left Factoring
In the rewritten grammar E' → + T E' | - T E', we have alternatives that start with the same token (+ and -
). Left factoring can be done as follows:
E → T E'
E' → ( + | - ) T E' | ε
T → id
Now, the grammar is both free from left recursion and left factoring issues, making it suitable for a top-
down parser.
2.4 TOP-DOWN PARSING
Top-down parsing is a method of syntax analysis in which the parsing process starts from the start
symbol (root) of the grammar and works its way down to the leaves (the terminals). It attempts to match
the input string with the derivation of the grammar from the top to the bottom.
2.4.1 RECURSIVE DESCENT PARSING
Recursive descent parsing is one of the most straightforward and intuitive forms of top-down parsing. It
involves a set of recursive functions that correspond to the non-terminal symbols in the grammar. The
main idea is that each non-terminal has a recursive function that tries to match the corresponding
production rules of the grammar.
Key Concepts of Recursive Descent Parsing
1. Grammar: Recursive descent parsing works with context-free grammars (CFGs), which are
typically in LL(1) form. An LL(1) grammar means that it can be parsed with a left-to-right
scan of the input and a 1-token lookahead.
2. Recursive Functions: Each non-terminal in the grammar is represented by a function. The
function tries to match the corresponding production rules.
3. Backtracking: Some recursive descent parsers can backtrack when a rule doesn't work
(though this isn't always efficient). Backtracking is typically avoided in parsers for LL(1)
grammars because such grammars don't require it.
4. Lookahead: The parser uses a lookahead token to decide which rule to apply. For LL(1)
parsers, one lookahead token is enough to make a decision.
How Recursive Descent Parsing Works
Let’s consider a simple grammar for arithmetic expressions:
E→E+T|E-T|T
T → id
Step 1: Define Functions for Each Non-terminal
We define a recursive function for each non-terminal in the grammar. For example, for the non-terminal
E, we have a function parse_E(), which corresponds to the production rules for E. Similarly, we have
functions for T.
Step 2: Handle Each Rule in the Grammar
The recursive descent parsing function works by trying to match the input string with the grammar's rules.
If a match is found, it recurses on the next part of the string; if no match is found, it backtracks or signals
a parsing error.
For the grammar:
E→E+T|E-T|T
T → id
We define the following recursive functions:
1. parse_E(): This function will handle the rules for E.
2. parse_T(): This function will handle the rules for T.
Recursive Descent Parser for the Example Grammar
1. Function for E
We have the production rules:
E→E+T|E-T|T
This can be interpreted as:
• An expression E is either another expression E followed by + or - and a term T, or just a term
T.
def parse_E():
parse_T() # Start by parsing the first T
while lookahead == '+' or lookahead == '-':
consume(lookahead) # Consume the '+' or '-'
parse_T() # Parse the next term
2. Function for T
The rule for T is simpler:
T → id
This means a term is just an identifier (id), so we would define:
def parse_T():
if lookahead == 'id':
consume('id') # Consume the identifier token
else:
error() # If it's not an 'id', there's a syntax error
3. Token Consumption and Lookahead
The lookahead is the next token in the input string. The consume() function reads the current token and
moves the pointer to the next token. The lookahead is updated to the next token after each consumption.
Example of How the Recursive Descent Parser Works
Consider the input id + id - id.
1. parse_E() is called, which first calls parse_T().
2. parse_T() matches id and consumes it.
3. The lookahead token is now +, so the parser checks for the next operator in the parse_E()
function.
4. The parser consumes + and calls parse_T() again to parse the next term (id).
5. This process repeats, and eventually, the parser consumes all tokens and successfully parses
the entire input.
Advantages of Recursive Descent Parsing
1. Simplicity: Recursive descent parsing is simple and easy to implement. Each non-terminal
in the grammar corresponds to a function, making the parser structure easy to understand
and maintain.
2. Direct Mapping to Grammar: The parser closely follows the structure of the grammar.
Each production rule is translated directly into a recursive function.
3. Flexibility: Recursive descent parsers can easily handle certain types of grammar
(particularly LL(1) grammars), and they can be easily extended or modified.
Limitations of Recursive Descent Parsing
• Left Recursion: One major drawback of recursive descent parsers is that they cannot handle left-
recursive grammars. Left recursion occurs when a non-terminal calls itself on the left side of its
production (e.g., E → E + T), which leads to infinite recursion.
Example:
E→E+T|T
The parser would try to expand E indefinitely, leading to a stack overflow or infinite loop.
Solution: Left recursion can be eliminated by refactoring the grammar (as discussed previously in the
context of eliminating left recursion).
• LL(1) Restriction: Recursive descent parsers are suitable for LL(1) grammars, which are
grammars that can be parsed with a single lookahead token. They are not suitable for more
complex grammars, such as LR grammars, which require more sophisticated parsing strategies
(e.g., LR parsers).
• Backtracking: While some parsers support backtracking (attempting different parsing strategies
if one fails), backtracking is inefficient and can lead to poor performance in larger grammars or
more complex languages.
2.4.2 NON RECURSIVEPREDICTIVE PARSING
Non-recursive predictive parsing is a type of top-down parsing that constructs a parse tree for a given
input string using a predictive parser without the need for recursion. This method is particularly useful
for parsing grammars that are LL(1), meaning they can be parsed using a single token lookahead, and it's
typically implemented using a stack and table-based approach rather than function calls like recursive
descent parsing.
Key Concepts
1. Predictive Parsing: Predictive parsing is a form of LL parsing where the parser makes
predictions about which production rule to apply based on the current input token and the
current state. It uses a lookahead token (usually just one token) to decide which production
to apply.
2. Non-Recursive: Unlike recursive descent parsing, where each non-terminal has a
corresponding recursive function, non-recursive predictive parsing uses an explicit stack to
track the parsing process, which avoids the risk of stack overflow or infinite recursion.
3. LL(1) Grammar: Non-recursive predictive parsing works on LL(1) grammars, which can
be parsed with a single left-to-right scan of the input and a one-token lookahead. These
grammars must be unambiguous, and every decision about which rule to apply can be made
by inspecting the current input token.
How Non-Recursive Predictive Parsing Works
Non-recursive predictive parsing typically works using a parse table and a stack to simulate the recursive
process in a non-recursive manner. The general approach involves using a stack to keep track of the
current non-terminal being expanded and using a parsing table to determine which production rule to
apply based on the current non-terminal and lookahead token.
Steps Involved:
1. Initialize Stack: The stack is initialized with the start symbol of the grammar.
2. Start Parsing:
o The parser reads the input string from left to right.
o The stack holds the current non-terminal that needs to be expanded, starting with the
start symbol.
o The lookahead is the current token from the input string.
3. Predictive Decision:
o Based on the current non-terminal at the top of the stack and the lookahead token,
the parser consults the predictive parsing table to find which production rule to
apply.
o If the top of the stack is a non-terminal, the parser uses the appropriate production
rule from the table to expand it.
o If the top of the stack is a terminal and matches the lookahead, the terminal is popped
from the stack, and the input token is consumed (moved to the next token).
o If there is no matching entry in the table, a parsing error is raised.
4. Repeat the process until the stack is empty, indicating the input string has been successfully
parsed, or an error is encountered.
Example of Non-Recursive Predictive Parsing
Consider the following simple LL(1) grammar:
E → T E'
E' → + T E' | ε
T → id
• Start symbol: E
• Terminals: id, +
• Non-terminals: E, E', T
We need to construct a parsing table to guide the non-recursive parsing process.
Parsing Table Construction
To create the parsing table, we need to compute the FIRST and FOLLOW sets for each non-terminal:
1. FIRST sets:
o FIRST(E) = FIRST(T) = {id}
o FIRST(E') = {+, ε}
o FIRST(T) = {id}
2. FOLLOW sets:
o FOLLOW(E) = {$, +} (end of input or +)
o FOLLOW(E') = {$, +} (since E' is at the end of the production for E)
o FOLLOW(T) = {+, $} (from the productions of E and E')
Using these sets, we can now construct the parsing table.

The table entries represent the rules to apply for each non-terminal and lookahead combination.
Parsing Example
Let’s parse the input string id + id using the parsing table.
1. Initialization:
o Stack: [E] (start symbol)
o Input: id + id
o Lookahead: id
2. Step 1:
o The top of the stack is E, and the lookahead is id.
o According to the table, we apply E → T E'.
o Stack: [T, E']
3. Step 2:
o The top of the stack is T, and the lookahead is id.
o According to the table, we apply T → id.
o Stack: [E']
o Input: + id (consume id)
4. Step 3:
o The top of the stack is E', and the lookahead is +.
o According to the table, we apply E' → + T E'.
o Stack: [+, T, E']
5. Step 4:
o The top of the stack is +, and the lookahead is +.
o We match the lookahead with the top of the stack, and consume the +.
o Stack: [T, E']
o Input: id (consume +)
6. Step 5:
o The top of the stack is T, and the lookahead is id.
o According to the table, we apply T → id.
o Stack: [E']
o Input: `` (consume id)
7. Step 6:
o The top of the stack is E', and the lookahead is $ (end of input).
o According to the table, we apply E' → ε.
o Stack: [] (empty)
The stack is empty, and the input has been fully consumed, so the input string id + id has been successfully
parsed.
Advantages of Non-Recursive Predictive Parsing
1. Efficiency: Non-recursive predictive parsers avoid the overhead of recursion, making them
more efficient in terms of both memory and performance, particularly for large inputs.
2. Simple Implementation: The use of a stack and a parsing table makes the implementation
straightforward, and it's easy to understand compared to recursive descent parsers.
3. Error Detection: Predictive parsers can detect errors early by checking the parsing table for
valid production rules. If the table does not provide a valid rule for a given non-terminal and
lookahead token, an error is detected.
4. No Backtracking: Unlike recursive descent parsers, predictive parsers do not require
backtracking, which can lead to inefficiency.
Limitations of Non-Recursive Predictive Parsing
1. LL(1) Restriction: Non-recursive predictive parsers are limited to LL(1) grammars, which
are grammars that can be parsed with a single token lookahead. These grammars must be
unambiguous and should not require backtracking or multiple alternatives with the same
prefix.
2. Complexity of Grammar: For more complex grammars (i.e., those that are not LL(1)),
other parsing techniques like LR parsing or recursive descent with backtracking may be
necessary.
2.4.3 LL (1) GRAMMARS
LL(1) grammars are a subset of context-free grammars (CFGs) that can be parsed using a predictive
parser with a single lookahead token. The term LL(1) is derived as follows:
• L: Left-to-right scanning of the input.
• L: Leftmost derivation (the leftmost non-terminal is expanded first).
• 1: A single token lookahead (the parser looks at the next input token to decide the parsing
action).
An LL(1) grammar can be parsed by an LL(1) parser, which uses a single token of lookahead to decide
which rule to apply at each step of parsing. These grammars have specific properties that make them
suitable for such parsers.
Key Properties of LL(1) Grammars
1. Left-to-right: The input string is read from left to right, meaning the parser processes the
input sequentially, starting with the first token.
2. Leftmost Derivation: In a leftmost derivation, the leftmost non-terminal in the current
sentential form is always replaced first. This ensures the parser constructs the parse tree in a
left-to-right order.
3. 1-token Lookahead: At each step of parsing, the decision about which production to apply
is made using only the next token in the input. This is why LL(1) grammars are relatively
simple to parse with a single-token lookahead.
4. No Ambiguity: LL(1) grammars must be unambiguous, meaning there must be no ambiguity
in the choice of production rules when parsing any prefix of the input.
5. No Left Recursion: LL(1) grammars cannot have left recursion. This is because left
recursion can cause infinite recursion in top-down parsers (like recursive descent or
predictive parsers). Left recursion needs to be eliminated for a grammar to be LL(1).
Formal Definition of LL(1) Grammar
A grammar is LL(1) if for every non-terminal A and for every pair of distinct input tokens a and b:
• For each production rule A → X1 X2 ... Xn, the parser should be able to choose the correct
rule based solely on the next input token a.
The first condition of LL(1) grammar requires that for each non-terminal A, the sets of FIRST sets of
the productions of A must be disjoint (i.e., there must be no overlap between the FIRST sets).
• FIRST(A) is the set of terminal symbols that can appear at the beginning of a string derived
from A.
The second condition of LL(1) grammars requires that if a production A → ε (i.e., an epsilon production)
exists for a non-terminal A, then FOLLOW(A) (the set of terminal symbols that can appear immediately
to the right of A) must be disjoint with FIRST(A).
• FOLLOW(A) is the set of terminal symbols that can appear immediately to the right of A
in any derivation.
FIRST and FOLLOW Sets
The FIRST and FOLLOW sets are critical to understanding and constructing LL(1) parsers.
1. FIRST Set: The FIRST set of a non-terminal A, denoted as FIRST(A), is the set of terminals
that appear first in the derivation of A. For example:
o If A → aB is a production, then a is in FIRST(A).
o If A → ε (epsilon, the empty string) is a production, then ε is in FIRST(A).
2. FOLLOW Set: The FOLLOW set of a non-terminal A, denoted as FOLLOW(A), is the set
of terminals that can appear immediately to the right of A in some derivation of the grammar.
For example:
o If there is a production B → αAβ, then all the symbols in FIRST(β) are added to
FOLLOW(A).
o If A is the start symbol, then $ (end-of-input marker) is added to FOLLOW(A).
Constructing an LL(1) Parsing Table
An LL(1) parsing table guides the parsing process. It is a two-dimensional table where:
• The rows represent the non-terminals in the grammar.
• The columns represent the terminals (including the end-of-input marker $).
• Each entry in the table contains a production rule to apply for a given non-terminal and
lookahead terminal.
To construct the LL(1) table:
1. For each non-terminal A and terminal a, find the production rule A → X1 X2 ... Xn such
that a ∈ FIRST(X1 X2 ... Xn). Place this production rule in the table at position [A, a].
2. If A → ε is a valid production and a ∈ FOLLOW(A), place A → ε in the table at position
[A, a].
3. If multiple entries are found in any table position, the grammar is not LL(1).

Example of an LL(1) Grammar


Consider the following simple grammar:
E → T E'
E' → + T E' | ε
T → id
Step 1: Compute FIRST and FOLLOW Sets
1. FIRST(E) = FIRST(T) = {id}
2. FIRST(E') = {+, ε}
3. FIRST(T) = {id}
For the FOLLOW sets:
1. FOLLOW(E) = {$, +}
2. FOLLOW(E') = {$, +}
3. FOLLOW(T) = {+, $}
Step 2: Construct the Parsing Table

Step 3: Parsing Example


To parse the input id + id, follow these steps:
1. Start with the stack: [E] and input: id + id$.
2. The top of the stack is E, and the lookahead is id. According to the table, apply E → T E'.
3. The top of the stack is now T, and the lookahead is id. Apply T → id.
4. After consuming id, the stack is now [E'], and the lookahead is +.
5. The top of the stack is E', and the lookahead is +. Apply E' → + T E'.
6. The stack is now [T, E'] with the lookahead +.
7. The top of the stack is T, and the lookahead is id. Apply T → id.
8. After consuming id, the stack is [E'], and the lookahead is $.
9. The top of the stack is E', and the lookahead is $. Apply E' → ε.
10. The stack is empty, and the input has been fully consumed, so the parsing is successful.
2.4.4 A TRADITIONAL TOP-DOWNPARSER GENERATOR—YACC
YACC (Yet Another Compiler Compiler) is a traditional bottom-up parser generator widely used for
creating parsers in the context of compiler construction. Although YACC is mainly associated with
generating LR parsers, it's commonly mentioned in discussions about parser generators and is sometimes
considered alongside other parser generators like Bison or ANTLR, which are used for generating both
top-down and bottom-up parsers.
YACC Overview
YACC is a tool used to generate parsers based on a formal grammar. It works by taking a context-free
grammar (CFG) and generating a parser, typically of the LR (left-to-right, rightmost derivation) type,
which is capable of parsing a wide range of context-free grammars that a top-down parser (like recursive
descent) might struggle with. However, it can be used in conjunction with lexical analyzers to produce a
full compiler or interpreter.
Characteristics of YACC
1. Bottom-Up Parsing: YACC generates parsers using LR parsing, a class of bottom-up
parsers. These parsers construct the parse tree by starting from the leaves (input symbols)
and working their way up to the root (start symbol).
2. Table-Driven Approach: YACC generates parsing tables (action and goto tables) that are
used by an LL(1) or LR(1) parser to decide which production rules to apply at each step of
parsing.
3. Context-Free Grammars: YACC can generate parsers for context-free grammars (CFGs).
It is particularly suited for parsing grammars that involve more complex structures, such as
ambiguous grammars, that a simple top-down parser like recursive descent cannot handle.
4. LALR Parsing: YACC uses LALR(1) (Lookahead-LR) parsing, which is a more efficient
variant of SLR(1) and LR(1) parsing. LALR parsers combine the efficiency of LR parsers
with the reduced size of parsing tables, making them faster and more memory-efficient.
5. Integration with Lex: YACC works well in combination with Lex (or Flex, a more modern
alternative). Lex is used for lexical analysis (tokenizing the input), and YACC handles
syntactic analysis (parsing). Together, they form the core components of many compilers
and interpreters.
How YACC Works
YACC operates in two stages:
1. Grammar Specification: The user provides a formal grammar that defines the syntax of the
programming language (or data format) to be parsed. The grammar is written in a format
that resembles BNF (Backus-Naur Form) or EBNF (Extended BNF).
2. Parser Generation: YACC generates a C code file that contains the parsing tables and the
parser's logic. The generated parser can then be compiled and linked with a lexical analyzer
(produced by Lex or Flex) to create a full parser for the specified grammar.
YACC Syntax
A typical YACC input file consists of three sections:
Definitions Section: This section includes declarations for tokens (produced by Lex), as well as macros
or type definitions.
%token INTEGER PLUS MINUS
%type <int> expression
Rules Section: This section defines grammar rules. Each rule consists of a non-terminal followed by a
production and associated action. The action is written in C and specifies what should happen when a
rule is applied.
expression:
expression PLUS term { $$ = $1 + $3; }
| expression MINUS term { $$ = $1 - $3; }
| term
;
• The non-terminal is on the left side (e.g., expression).
• The production is on the right side (e.g., expression PLUS term).
• The action specifies what the parser should do when the rule is matched. In this case, it
performs addition or subtraction.
User Code Section: This section allows the inclusion of C code that is needed for initialization, error
handling, or other specific operations.
int yyerror(char *s) {
fprintf(stderr, "Error: %s\n", s);
return 0;
}
Example: Simple Calculator
Consider a simple calculator grammar that can evaluate arithmetic expressions with addition and
subtraction.
YACC Input Example
%{
#include <stdio.h>
#include <stdlib.h>
%}

%token INTEGER PLUS MINUS


%type <int> expression

%%

expression:
expression PLUS term { $$ = $1 + $3; }
| expression MINUS term { $$ = $1 - $3; }
| term
;

term:
INTEGER { $$ = $1; }
;

%%

int yyerror(char *s) {


fprintf(stderr, "Error: %s\n", s);
return 0;
}

int main(void) {
yyparse();
return 0;
}
• The grammar defines expression as an arithmetic expression involving addition and
subtraction (PLUS and MINUS), and term as a basic integer (INTEGER).
• The actions (e.g., $$ = $1 + $3;) specify how the results of parsing are calculated.
• The main() function calls yyparse(), which is the parsing function generated by YACC.
Lex Input Example
To pair with this YACC input, we also need a Lex file for tokenizing the input:
%{
#include "[Link].h"
%}

%%

[0-9]+ { yylval = atoi(yytext); return INTEGER; }


"+" { return PLUS; }
"-" { return MINUS; }
[ \t\n] { /* ignore whitespace */ }
. { return 0; }

%%
• This Lex file defines rules for recognizing integers ([0-9]+), plus (+), and minus (-).
• The yylval is set to the integer value of the token (for use by the parser).
Generating the Parser
To generate and compile the parser using YACC and Lex, you would follow these steps:
1. Run Lex to generate the lexical analyzer:
lex calc.l
2. Run YACC to generate the parser:
yacc -d calc.y
This creates two files: [Link].c (the parser code) and [Link].h (the header file with token definitions).
3. Compile the generated C code along with Lex-generated code:
gcc -o calc [Link].c [Link].c -ll -ly
4. Run the calculator:
./calc
You can then input arithmetic expressions like 3 + 4 - 2, and the calculator will evaluate them.
Advantages of YACC
1. Efficient Parsing: YACC generates LR parsers, which are very efficient for a large class
of context-free grammars. They can handle left recursion and ambiguous constructs that are
difficult for simple top-down parsers like recursive descent.
2. Error Handling: YACC provides built-in error handling mechanisms that help in detecting
syntax errors during parsing.
3. Extensibility: YACC can easily be extended with user-defined actions (written in C) for
performing semantic analysis, syntax-directed translation, or even code generation.
4. Well-Established: YACC has been in use for decades and has a vast ecosystem, with many
tutorials, documentation, and resources available.
Limitations of YACC
1. Bottom-Up Parsing: While YACC is great for bottom-up parsing (using LR parsing), it
doesn't generate top-down parsers, which are used in simpler parsing techniques like
recursive descent.
2. Complexity: YACC can be difficult to use with more complex or highly ambiguous
grammars. It also requires knowledge of LALR parsing and how to handle shift/reduce
conflicts, which can be challenging for beginners.
3. Requires Lex: YACC typically works in conjunction with Lex (or its modern variant, Flex)
for lexical analysis, adding an extra step in the compilation process.
4. Limited Error Reporting: While YACC can handle basic errors, it may not always provide
detailed or intuitive error messages, especially for more complex grammars.
2.5 BOTTOM-UP PARSING
2.5.1 SHIFT REDUCE PARSING
hift-reduce parsing is a common bottom-up parsing technique used in LR parsers. It builds the parse
tree starting from the leaves (tokens) and works its way upwards towards the root (start symbol). This
contrasts with top-down parsing, which begins with the start symbol and tries to derive the input string.
In a shift-reduce parser, the process is governed by two primary operations:
1. Shift: Move the next input token onto the stack.
2. Reduce: Replace a sequence of symbols on the stack that corresponds to a right-hand side
of a production rule with the left-hand side (the non-terminal) of that rule.
Shift-reduce parsers use a stack to store symbols and a buffer to hold the remaining input. The parser
repeatedly applies shift and reduce operations until it has successfully parsed the entire input.
Key Components of Shift-Reduce Parsing
1. Stack: Holds the symbols (both terminals and non-terminals) that have been shifted from
the input or reduced by production rules.
2. Input Buffer: Holds the remaining input tokens (starting from the first token not yet
processed).
3. Action: The parser decides whether to apply the shift operation or the reduce operation based
on the current state and the top of the stack.
4. Parsing Table: A table that helps the parser decide whether to shift or reduce based on the
current state and the lookahead token.
Shift-Reduce Parsing Steps
1. Shift Operation:
o The parser moves the next token from the input buffer onto the stack.
o This action is taken when the parser is not in a state where a reduction can be made.
2. Reduce Operation:
o The parser applies a reduction when the top of the stack matches the right-hand side
of a grammar rule.
o The sequence of symbols on the stack that matches the right-hand side of a
production is replaced with the non-terminal on the left-hand side of the production.
o This reduces the stack size, and the parser moves one step closer to the root of the
parse tree.
3. Accepting:
o The parser has successfully parsed the input if the stack contains only the start
symbol and the input buffer is empty.
4. Error Handling:
o If neither a shift nor a reduction is applicable, or if the parser encounters an
unexpected symbol, an error occurs.
Example: Shift-Reduce Parsing
Consider a simple grammar:
S →AB
A→a
B→b
Input String: a b
We will demonstrate how a shift-reduce parser processes this string using this grammar.
1. Initial State:
o Stack: [] (empty)
o Input: a b
o Action: Since the stack is empty and no reduction is possible, we perform a shift to
move a onto the stack.
2. After Shift (Move a onto the stack):
o Stack: [a]
o Input: b
o Action: No reduction possible. Perform shift to move b onto the stack.
3. After Shift (Move b onto the stack):
o Stack: [a, b]
o Input: (empty)
o Action: Now, a reduction is possible. According to the rule B → b, we reduce b to
B.
4. After Reduction (B → b):
o Stack: [a, B]
o Input: (empty)
o Action: Now, a reduction is possible again. According to the rule S → A B, we
reduce [a, B] to S.
5. After Reduction (S → A B):
o Stack: [S]
o Input: (empty)
o Action: The stack contains the start symbol S, and the input is empty, so the parser
accepts the input as valid.
Parsing Table for Shift-Reduce Parsing
A shift-reduce parser can be guided by a parsing table. In the case of LR parsers, the table typically
contains two types of entries:
1. Action Table: It tells the parser whether to shift or reduce based on the current state and
lookahead token.
o Shift means move the lookahead symbol onto the stack.
o Reduce means apply a grammar rule to reduce the symbols on the stack.
o Accept means the parsing is successful.
o Error means an error has occurred (no action is defined).
2. Goto Table: It tells the parser what state to transition to after a reduction.
Example of Parsing Table for a Simple Grammar
Let’s define a parsing table for the following grammar:
S →AB
A→a
B→b
1. States:
• State 0: The parser has just begun parsing.
• State 1: The parser has seen a and is waiting for B after A.
• State 2: The parser has seen b and can reduce using B → b.
• State 3: The parser has successfully parsed S → A B.
2. Action Table (indexed by state and lookahead token):

3. Goto Table (indexed by state and non-terminal):

Conflict in Shift-Reduce Parsing


In shift-reduce parsing, conflicts can occur, typically in the following two forms:
1. Shift/Reduce Conflict: This happens when the parser has to choose between shifting a token or
reducing using a rule, but both actions are valid for the current state and lookahead token.
Example:
S →A| B
A→a
B→a
If the input starts with a, the parser could either shift a or reduce a to A or B. This leads to a conflict.
2. Reduce/Reduce Conflict: This occurs when there are multiple possible reductions for the same portion
of the input.
Example:
S →A| B
A→a
B→a
After shifting a, the parser may encounter a choice between reducing it to A or reducing it to B.
To resolve conflicts, more sophisticated parsing techniques like LALR parsing (used by YACC) or GLR
parsing can be employed.
Advantages of Shift-Reduce Parsing
1. Efficient: Shift-reduce parsers can efficiently parse most programming languages and are
widely used in real-world compilers.
2. Handles Ambiguity: While LR parsers can handle certain ambiguities that recursive descent
parsers cannot, shift-reduce parsers are robust in handling them.
3. Memory Efficiency: Shift-reduce parsing is typically more memory-efficient than top-down
approaches like recursive descent because the parser does not need to maintain multiple
function calls or recursion depth.
2.5.2 LR PARSERS
LR Parsing is a class of bottom-up parsing techniques where the parsing process builds the parse tree
from the leaves (input tokens) up to the root (start symbol). It is particularly powerful because it can
handle a large subset of context-free grammars that other parsers, like recursive descent parsers, cannot.
In an LR parser, the "L" stands for left-to-right scanning of the input, and the "R" stands for rightmost
derivation in reverse. This means that an LR parser reads the input from left to right, and the derivation
of the input is done starting from the rightmost non-terminal.
[Link] SIMPLELR PARSER
An SLR parser is a type of LR parser that is relatively simple and efficient. It is based on the idea of
lookahead (using a single lookahead symbol) and a set of LR(0) items. The key difference between the
SLR parser and other types of LR parsers (like LALR or Canonical LR) is that it uses a simplified
method for determining the reduction rules during parsing.
How an SLR Parser Works
1. Input Buffer: The input is stored in a buffer, and the parser reads the input from left to right
one symbol at a time.
2. Stack: The parser uses a stack to store symbols, which include both non-terminals and
terminals. Initially, the stack is empty.
3. Action Table: The action table guides the parser in terms of what action to take (shift,
reduce, accept, or error) at each step. The action is determined based on the current state and
the current input symbol.
4. Goto Table: The goto table helps the parser transition from one state to another after a
reduction step. This table is crucial for managing non-terminal reductions in the parse
process.
Steps of SLR Parsing
1. Shift Operation:
o If the current symbol on the input is part of the terminal symbols (and no reduction
rule applies), the parser shifts the symbol onto the stack and moves to the next input
symbol.
2. Reduce Operation:
o If the stack contains a sequence of symbols that matches the right-hand side of a
production rule, the parser reduces the sequence of symbols to the corresponding
non-terminal.
3. Accept:
o If the parser reaches a state where the stack contains the start symbol and the input
buffer is empty, the input is successfully parsed, and the parser accepts the input.
4. Error:
o If neither a shift nor a reduce operation is valid, the parser encounters an error.
SLR Parsing Tables
The SLR parser relies on two main tables:
1. Action Table:
o This table defines the shift and reduce actions based on the current state and the
input symbol.
o Shift: If the parser should move the next token to the stack.
o Reduce: If the parser should apply a production rule to reduce the symbols on the
stack.
o Accept: When the parsing is successful.
o Error: When no valid action is defined (parsing error).
2. Goto Table:
o This table is used after a reduction to determine the next state based on the non-
terminal that was reduced.
Example of Simple LR Parsing
Consider a grammar:
S →AB
A→a
B→b
Let's generate the SLR parsing tables for this grammar and parse the input string a b.
1. States:
o The states correspond to different configurations of the parser, based on the grammar
and the current input symbol. We will compute the LR(0) items (which are sets of
production rules with a dot showing how much of the rule has been parsed).
2. LR(0) Items:
o Initially, the parser starts with a state that is derived from the start symbol S and the
production rules.
3. Action Table: The Action table tells us what to do based on the current state and the current
input symbol.
• State 0: When we see a, we shift to state 1.
• State 1: When we see b, we shift to state 2.
• State 2: When we see the end of the input ($), we can reduce using B → b.
• State 3: The parser accepts when it has successfully parsed S → A B and the input is empty.
4. Goto Table: The Goto table is used to transition between states after a reduction.

• After reducing B → b, the parser goes to state 3 and accepts the input.
• After reducing S → A B, the parser transitions to the accepting state.
Parsing the Input String a b
We begin with an empty stack and the input a b. Here’s how the parser would proceed:
1. Initial State:
o Stack: []
o Input: a b
o Action: Shift a onto the stack.
2. After Shifting a:
o Stack: [a]
o Input: b
o Action: Shift b onto the stack.
3. After Shifting b:
o Stack: [a, b]
o Input: (empty)
o Action: Reduce using the production B → b.
4. After Reducing B → b:
o Stack: [a, B]
o Input: (empty)
o Action: Reduce using the production S → A B.
5. After Reducing S → A B:
o Stack: [S]
o Input: (empty)
o Action: Accept the input.
The input a b has been successfully parsed.
Advantages of SLR Parsing
1. Simple and Efficient: SLR parsers are simpler and more efficient than other types of LR
parsers because they use a simpler method for constructing the action table and determining
when reductions should occur.
2. Bottom-Up Parsing: SLR parsing is a bottom-up approach that can handle more complex
grammars than top-down parsers (like recursive descent).
3. Large Grammar Handling: SLR parsers can handle a wide range of programming
languages, especially those with a large number of constructs.
Limitations of SLR Parsing
1. Limited Power: SLR parsers are not as powerful as other LR parsers like LALR or
Canonical LR parsers. SLR parsers can encounter shift/reduce conflicts and
reduce/reduce conflicts in certain ambiguous grammars.
2. Conflict Handling: In cases where the grammar has conflicts (like ambiguity), an SLR
parser might fail or require additional techniques (e.g., more lookahead or more advanced
LR parsing methods) to resolve them.
[Link] CANONICAL LR PARSER
A Canonical LR (CLR) parser is a more powerful and general type of LR parser that can handle a
broader range of grammars compared to simpler LR parsers, such as SLR or LALR parsers. The term
"Canonical" refers to the fact that this parser uses the most precise and detailed method for determining
what actions to take during parsing. It is capable of handling grammars with greater complexity, such as
ambiguous or context-sensitive constructs, by maintaining a larger state machine and more precise parsing
tables.
The LR in "Canonical LR" stands for:
• L: Left-to-right scanning of the input (processing the input string from left to right).
• R: Rightmost derivation in reverse (deriving the input string from right to left).
A Canonical LR parser is often considered the most powerful of the LR parsing techniques because it
can handle any context-free grammar that can be parsed by an LR parser, with the additional benefit of
being deterministic.
Key Components of a Canonical LR Parser
A Canonical LR parser uses a state machine to parse the input. The key components involved in the
parsing process are:
1. State Stack: This stack stores the states during parsing. A state corresponds to a certain point
in the parsing process and determines what actions to take next based on the current input
token.
2. Input Buffer: This buffer stores the input string that the parser will process, typically along
with a lookahead token.
3. Action Table: The action table is the key to making decisions during parsing. It provides
instructions on whether to:
o Shift (move the next input symbol onto the stack),
o Reduce (apply a production rule to reduce a sequence of symbols on the stack to a
non-terminal),
o Accept (indicating that the input has been successfully parsed),
o Error (if no valid action is defined for a given state and input symbol).
4. Goto Table: The goto table is used after a reduction to determine the next state, depending
on the non-terminal that has been reduced.
The Canonical LR parser uses both the action and goto tables to decide whether to shift, reduce, or
take other actions during parsing. The state machine transitions between states based on the stack and
input symbol.
Steps in Canonical LR Parsing
1. Initial State:
o The parser starts with an initial state, typically corresponding to the start symbol of
the grammar. The first token from the input buffer is used to decide the first action.
2. Shift Operation:
o If the top of the stack does not match a production's right-hand side, the parser shifts
the next input symbol onto the stack.
3. Reduce Operation:
o If the top of the stack matches the right-hand side of a production rule, the parser
reduces that sequence of symbols to the corresponding non-terminal on the left-hand
side of the production.
4. Acceptance:
o When the stack contains only the start symbol, and the input buffer is empty, the
parser has successfully parsed the input.
5. Error Handling:
o If the parser encounters a situation where no valid action is defined (i.e., no shift or
reduce operation is applicable), it results in a parsing error.
Constructing the Canonical LR Parser
To build a Canonical LR parser, the following steps are generally taken:
1. Constructing LR(0) Items:
The first step in creating a Canonical LR parser is to generate LR(0) items. These items are derived from
the production rules and represent the different stages in the parsing process.
Each LR(0) item consists of a production rule and a dot (.) that marks the position in the rule that has been
parsed. For example, if we have a production:
A→XYZ
An LR(0) item could be:
• A → . X Y Z (indicating that nothing from this production has been parsed yet),
• A → X . Y Z (indicating that X has been parsed),
• A → X Y . Z (indicating that X and Y have been parsed), and so on.
2. Building the DFA (Deterministic Finite Automaton):
The next step is to build a deterministic finite automaton (DFA) using the LR(0) items. This DFA
represents the state machine that will guide the parser. Each state in the DFA corresponds to a set of LR(0)
items, and transitions between states are based on input symbols (both terminals and non-terminals).
The states of the DFA are constructed by taking the closure of the LR(0) items. The closure of a set of
items includes all items that can be derived from the items in the set by applying productions.
3. Generating the Action and Goto Tables:
The action table is constructed based on the DFA. For each state and input symbol, the parser determines
whether to shift, reduce, or accept based on the transitions in the DFA and the lookahead symbol.
The goto table is constructed by determining the state that the parser should transition to after a reduction
is performed. It helps the parser manage transitions between non-terminals after reducing a sequence of
symbols.
Example: Canonical LR Parsing for a Simple Grammar
Let’s consider a simple grammar:
S →AB
A→a
B→b
Step 1: Create LR(0) Items
We generate the following LR(0) items from the production rules:
• S → .AB
• A→.a
• B→.b
Step 2: Construct the DFA
The DFA is constructed based on these LR(0) items. Each state corresponds to a set of items, and
transitions are made based on the current input symbol.
• State 0: {S → . A B, A → . a, B → . b}
o Transition on a → State 1: {A → . a}
o Transition on b → State 2: {B → . b}
• State 1: {A → a .}
o Reduce by A → a.
• State 2: {B → b .}
o Reduce by B → b.
Step 3: Build Action and Goto Tables
The action table shows what the parser should do at each state given a particular input symbol:
The goto table manages transitions after reductions, based on non-terminal symbols:

Step 4: Parse Input


Consider the input string a b. The steps to parse are:
1. Start in State 0:
o Input: a b
o Action: Shift to State 1 (shift a).
2. In State 1:
o Input: b
o Action: Reduce by A → a.
3. After reduction (replace a with A), transition to State 3 (using the goto table).
4. In State 3:
o Input: b
o Action: Shift to State 2 (shift b).
5. In State 2:
o Input: (empty)
o Action: Reduce by B → b.
6. After reduction (replace b with B), transition to State 3.
7. In State 3:
o Input: (empty)
o Action: Accept the input as valid.
Advantages of Canonical LR Parsing
1. Powerful: The Canonical LR parser is the most general and powerful type of LR parser,
capable of handling any LR(1) grammar without conflicts.
2. Deterministic: It is deterministic, meaning that there is always a well-defined action to take
for each state and input symbol. This avoids ambiguity during parsing.
3. Error Detection: It can detect syntax errors early during parsing, which is beneficial for
compiler construction.
Disadvantages of Canonical LR Parsing
1. Memory and Complexity: Canonical LR parsers tend to have larger parsing tables and a
more complex state machine compared to simpler parsers like SLR or LALR.
2. Efficiency: The size of the action and goto tables can become very large, especially for
more complex grammars. This can lead to inefficiencies in both memory usage and table
lookup.
3. Construction Time: Constructing a Canonical LR parser can be computationally expensive,
as it requires building and processing large DFA tables.
[Link] LALR PARSER
An LALR (Look-Ahead LR) parser is a more efficient variant of the Canonical LR (CLR) parser. It
aims to retain the power of an LR parser but reduce the size of the parsing tables, making it a practical
choice for most programming language grammars. LALR parsers combine the simplicity of SLR (Simple
LR) parsing with the power of Canonical LR parsing, thus providing a balance between efficiency and
the ability to handle more complex grammars.
Key Concepts of LALR Parsing
• LALR stands for Look-Ahead LR, where "LR" refers to the Left-to-right scanning of the
input and Rightmost derivation in reverse.
• LALR parsers use a single lookahead symbol (i.e., they can make decisions based on the
current state and the next input symbol) to make parsing decisions.
The difference between LALR and Canonical LR lies in the way they handle states. LALR parsers
merge similar states in the canonical LR state machine to reduce the size of the parsing tables, which
makes them more memory efficient while still retaining the ability to handle many context-free grammars
that can't be parsed with simpler parsers like SLR.
Steps in LALR Parsing
An LALR parser follows a similar process as other LR parsers but with certain optimizations:
1. State Machine Construction:
o The LALR parser first builds the Canonical LR state machine (with LR(0) items).
o It then merges certain states that have identical action entries (shift/reduce
decisions) but might differ in the lookahead symbol used for decision-making. This
merging reduces the number of states and the size of the parsing table.
2. Action Table:
o The action table guides the parser on whether to shift, reduce, accept, or error
based on the current state and the lookahead symbol.
3. Goto Table:
o The goto table directs the parser to the next state after a reduction has occurred,
based on the non-terminal that was reduced.
4. Parsing:
o The parsing proceeds similarly to a Canonical LR parser, with the parser shifting
tokens onto a stack, reducing sequences of symbols to non-terminals, and making
decisions based on the lookahead token.
o The shift and reduce operations are determined based on the action table and the
goto table.
Example of LALR Parsing Process
Consider a simple grammar:
S →AB
A→a
B→b
Step 1: Construct Canonical LR Items
The initial LR(0) items from the grammar are:
• S → .AB
• A→.a
• B→.b
Step 2: Build the Canonical LR State Machine
From the LR(0) items, we build the Canonical LR state machine. The machine has states based on the
items, transitions based on input symbols, and actions such as shift and reduce.

Step 3: Merge Similar States


The main step that differentiates LALR parsing from Canonical LR parsing is the state merging process.
In Canonical LR, each state represents a unique set of LR(0) items. However, LALR parsers merge states
that have identical shift and reduce operations, effectively reducing the number of states. This reduction
leads to smaller parsing tables.
For example, states 1 and 2 from the Canonical LR table might be merged into one state if the shift and
reduce actions are the same. The idea is to keep the parser’s capability to make decisions based on the
current state and lookahead symbol, while reducing the number of states and entries in the table.
Step 4: Construct the Action and Goto Tables
After merging similar states, the parser constructs the action table and goto table based on the merged
states. The action table will show shift/reduce actions based on the current state and lookahead symbol,
while the goto table will describe the state transitions after reductions.
Advantages of LALR Parsing
1. Efficient Table Size:
o LALR parsers reduce the size of the parsing tables compared to Canonical LR
parsers. This is achieved by merging states that have the same shift/reduce decisions,
leading to reduced memory usage.
2. More Powerful than SLR:
o LALR parsers can handle a broader set of grammars than SLR parsers while
keeping the table size manageable. It can handle many grammars that are too
complex for SLR parsing.
3. Widely Used in Compiler Construction:
o LALR parsers are commonly used in practical compilers (e.g., Yacc, Bison)
because they provide a good trade-off between parsing power and table size. They
are capable of handling most programming languages' syntaxes.
4. Deterministic Parsing:
o Like all LR parsers, LALR parsers are deterministic, meaning there is only one
action for each state and lookahead symbol combination, which eliminates ambiguity
in parsing.
Disadvantages of LALR Parsing
1. Limited Grammar Support:
o While LALR parsers handle most programming language grammars, they still
cannot handle certain complex grammars that Canonical LR parsers can, such as
those that involve more complicated lookahead situations or ambiguities.
2. State Merging Conflicts:
o During the state merging process, conflicts might arise if two states can have
different reduce actions for the same lookahead symbol. This can lead to shift/reduce
or reduce/reduce conflicts, though these conflicts are much rarer than in SLR
parsing.
3. Complexity in Grammar Design:
o Designing grammars that work well with LALR parsers requires careful attention,
as certain ambiguities or conflicts may not be detected until parsing. In some cases,
manual adjustments or grammar transformations may be required.
Example of LALR Parser Construction
For the grammar:
S →AB
A→a
B→b
Step 1: Canonical LR Items
The LR(0) items for the grammar would be:
• S → .AB
• A→.a
• B→.b
Step 2: Create Canonical LR States
We would then create the states for the Canonical LR machine, representing different positions in the
parsing process:
• State 0: {S → . A B, A → . a, B → . b}
o Transition on a → State 1
o Transition on b → State 2
• State 1: {A → a .}
o Reduce by A → a
• State 2: {B → b .}
o Reduce by B → b
Step 3: Merge States
If the transition tables for States 1 and 2 are identical in terms of shift/reduce actions, these states are
merged into one. This step is the core of LALR parsing. The merged states will reduce the number of total
states in the table.
Step 4: Construct Action and Goto Tables
The action table and goto table would be built by referring to the merged states. The action table would
include shift/reduce decisions, and the goto table would include transitions for non-terminals.
2.5.3 USING AMBIGUOUSGRAMMARS
Ambiguous grammars present a unique challenge for bottom-up parsers because such grammars can
generate multiple parse trees for the same input string. This creates uncertainty in parsing, making it
difficult for a parser to determine the correct derivation or action.
What Are Ambiguous Grammars?
A grammar is said to be ambiguous if there exists a string in the language generated by the grammar
that has more than one leftmost derivation, more than one rightmost derivation, or more than one
parse tree.
For example, consider the following grammar for simple expressions involving addition (+) and
multiplication (*):
E→E+E
E→E*E
E → id
With this grammar, the string id + id * id can be parsed in two different ways:
1. (id + id) * id (Group the first two ids together and then apply multiplication).
2. id + (id * id) (Apply multiplication first and then addition).
Ambiguity in Bottom-Up Parsing
In bottom-up parsing, the parser works by reducing the input string to the start symbol of the grammar.
At each step, the parser looks for a sequence of symbols in the input that matches the right-hand side of a
production rule and reduces it to the left-hand side (i.e., applies a production rule in reverse).
For an ambiguous grammar, the bottom-up parser can encounter situations where it has multiple valid
reductions to apply at the same point in the input. This can lead to shift/reduce conflicts or reduce/reduce
conflicts, where the parser doesn't know which reduction to apply. The result is that there could be more
than one valid parse tree, which is problematic because the parser may not know which one to choose.
Conflict Types in Ambiguous Grammars
Shift/Reduce Conflict: A shift/reduce conflict occurs when the parser can either shift (read more input)
or reduce (apply a production rule) based on the current state and the lookahead token. In an ambiguous
grammar, there may be multiple valid reductions, causing a conflict between shifting and reducing.
o Example: Consider the input string id + id * id for the ambiguous expression
grammar:
E→E+E
E→E*E
E → id
o After reading the first id, the parser has a choice:
▪ It can shift the + token, or
▪ It can reduce id to E and wait for the next token.
This results in a shift/reduce conflict because the parser could proceed either by shifting or reducing.
Reduce/Reduce Conflict: A reduce/reduce conflict occurs when the parser has more than one rule it can
apply to reduce the same sequence of symbols in the input.
Example: Consider the grammar:
A→a
A→b
1.
o If the input string is a b, the parser might encounter a point where it has to reduce
either a to A or b to A. Since both reductions lead to A, this creates a reduce/reduce
conflict.
Handling Ambiguity in Bottom-Up Parsing
Bottom-up parsers like LR parsers (including SLR, LALR, and Canonical LR) may struggle with
ambiguous grammars. To resolve ambiguity in these parsers, a few techniques can be used:
1. Grammar Transformation:
One way to handle ambiguity is by transforming the grammar to remove the ambiguity. These
transformations may include:
• Left factoring: This involves restructuring the grammar to ensure that the parser can decide
the next step without ambiguity.
• Eliminating left recursion: Left recursion can cause infinite recursion in some parsers, and
transforming the grammar to remove left recursion helps prevent ambiguity in parsing.
For example, consider the ambiguous grammar:
E→E+E
E→E*E
E → id
By introducing precedence and associativity rules, we can resolve the ambiguity and specify which
operator has higher precedence (e.g., multiplication has higher precedence than addition).
The grammar might be rewritten to reflect these rules:
E→E+T
E→T
T → T * id
T → id
Now, we specify that multiplication (*) has higher precedence over addition (+), effectively resolving the
ambiguity in the original grammar.
2. Precedence and Associativity:
In practice, many programming language parsers handle ambiguity by defining precedence and
associativity rules. These rules give the parser guidance on how to resolve conflicts when multiple
reductions or shifts are possible.
• Precedence: Defines the priority of operators (e.g., * has higher precedence than +).
• Associativity: Specifies how operators of the same precedence are grouped (e.g., + is left-
associative, meaning a + b + c is parsed as (a + b) + c).
For example, consider the following precedence and associativity definitions for arithmetic expressions:
%left + -
%left * /
This indicates that the parser should treat + and - as having the same precedence (left-associative) and that
* and / have higher precedence than + and -.
These rules ensure that the parser always knows which reduction to perform when faced with multiple
options, helping to resolve ambiguity.
3. Using More Powerful Parsers (Canonical LR):
While SLR or LALR parsers can handle many grammars, Canonical LR parsers are more powerful and
can handle a broader range of grammars, including some ambiguous ones, by explicitly managing more
states and parsing decisions. However, even Canonical LR parsers are limited when it comes to grammars
that are inherently ambiguous.
If ambiguity remains in the grammar, it may require more sophisticated methods (such as manually
intervening with parsing strategies or transforming the grammar) to ensure deterministic parsing.
4. Error Recovery:
When a bottom-up parser encounters ambiguity that it cannot resolve based on its grammar, it may need
to employ error recovery strategies. Error recovery involves making assumptions about where the parser
should backtrack or accept an error in order to continue processing the rest of the input.
• Backtracking: The parser may try different paths or production rules, and if one fails, it
backtracks and tries another.
• Error production rules: Some parsers include rules that guide the parser in dealing with
unexpected input, such as when a conflict arises.
Example of Ambiguous Grammar and Bottom-Up Parsing
Consider the ambiguous grammar for simple expressions:
E→E+E
E→E*E
E → id
For the input string id + id * id, we can derive two different parse trees:
1. (id + id) * id: Apply the E → E + E rule first, reducing id + id to E, then apply E → E * E
to reduce the result and id to E.
2. id + (id * id): Apply E → E * E to reduce id * id first, then apply E → E + E to reduce id +
E.
The bottom-up parser will encounter a conflict when it has read id + id, as it must decide whether to shift
(read the * symbol) or reduce id + id to E. Similarly, when it has read id * id, it must decide whether to
reduce id * id to E or shift the + symbol.
By defining precedence rules (giving * higher precedence than +), the ambiguity is resolved, and the
correct parse tree becomes clear.
MODULE-III
SYNTAX DIRECTED TRANSLATION AND TYPE CHECKING
3.1 SYNTAX DIRECTED DEFINITION
A Syntax-Directed Definition (SDD) is a formal framework used to define the semantics of programming
languages in a way that is directly tied to the syntactic structure of the language. It associates each
syntactic construct (i.e., production) of a context-free grammar with semantic rules, which provide
meaning to the language constructs.
In essence, an SDD is a specification of the semantics of a language based on its syntax, where the
meaning of a program is determined by the structure (syntax) of the program according to the grammar
and the semantic rules attached to the grammar’s production rules.
Key Components of Syntax-Directed Definitions
1. Context-Free Grammar (CFG): The syntax of the language is specified using a context-
free grammar. This defines the syntactic structure, such as how statements, expressions, and
other constructs in the language can be formed.
2. Attributes: These are associated with the grammar’s non-terminals and terminals. They are
used to hold information about the language constructs. There are two types of attributes:
o Synthesized Attributes: Information that is propagated up the parse tree, from
leaves to root. These attributes are computed from the children nodes (e.g., the values
of expressions or types of variables).
o Inherited Attributes: Information that is propagated down the parse tree, from the
root to the leaves. These are passed from parent to child (e.g., context-sensitive
information like scopes or the types of variables).
3. Semantic Rules: These rules define how the attributes are computed. They specify how the
values of attributes should be derived based on the syntactic structure of the language. The
rules are usually attached to the productions in the grammar.
4. Syntax Tree: A parse tree or abstract syntax tree (AST) represents the syntactic structure of
a program. The semantic rules of an SDD are used to annotate this tree with values that
correspond to the meaning of the program.
Example of Syntax-Directed Definition
Let’s define a simple language of arithmetic expressions consisting of integers, addition, and
multiplication. The grammar and corresponding syntax-directed definitions would look as follows:
Grammar
E→E+T
E→T
T→T*F
T→F
F → id
F → num
Syntax-Directed Definition
We will associate attributes with the non-terminals and terminals:
• [Link]: The value of an expression E.
• [Link]: The value of a term T.
• [Link]: The value of a factor F.
The semantic rules that define how to compute these attributes are as follows:
1. E → E + T
o Semantic rule: [Link] = [Link] + [Link]
(The value of E is the sum of the values of E and T.)
2. E → T
o Semantic rule: [Link] = [Link]
(The value of E is just the value of T.)
3. T → T * F
o Semantic rule: [Link] = [Link] * [Link]
(The value of T is the product of the values of T and F.)
4. T → F
o Semantic rule: [Link] = [Link]
(The value of T is just the value of F.)
5. F → id
o Semantic rule: [Link] = get_id_value(id)
(The value of F is the value of the identifier id.)
6. F → num
o Semantic rule: [Link] = num
(The value of F is the numerical value of the terminal num.)
How It Works
Consider the expression: id + num * id
Step 1: Parsing and Semantic Rule Application
• First, the grammar is used to parse the input into a parse tree (or abstract syntax tree). For
this expression, the parse tree looks like:
E
/\
E + T
/ / \
T T * F
/ / \
F F id
/ |
id num
Step 2: Attribute Propagation
• The leaf nodes (id and num) are assigned values. Let's assume:
o id = 5
o num = 3
• Now, starting from the leaf nodes and applying the semantic rules:
1. For F → id, the semantic rule is [Link] = get_id_value(id). So, for the first F (which is id), we
get [Link] = 5.
2. For F → num, the semantic rule is [Link] = num. So, for the second F (which is num), we get
[Link] = 3.
3. Moving up to T → T * F, the semantic rule is [Link] = [Link] * [Link]. Using the computed
values for T and F, we get [Link] = 3 * 5 = 15.
4. For E → E + T, the semantic rule is [Link] = [Link] + [Link]. Using the computed values for E
and T, we get [Link] = 5 + 15 = 20.
Final Result
The final value of the entire expression id + num * id is 20, which is the result of applying the syntax-
directed definitions to the parse tree.
Types of Syntax-Directed Definitions
1. SDDs for Attribute Grammars:
o Syntax-directed definitions are a formalization of attribute grammars, which are
grammars augmented with attributes that represent meaning.
o SDDs are closely related to attribute grammars, where attributes are associated
with symbols in the grammar, and semantic rules specify how to compute those
attributes.
2. SDDs for Compiler Phases:
o SDDs are commonly used in various phases of a compiler. For example:
▪ Lexical analysis: Tokens are assigned meaning through attributes.
▪ Syntax analysis: The syntax tree is constructed while associating semantic
information.
▪ Semantic analysis: Checking types, scopes, etc., by propagating attributes
through the parse tree.
▪ Code generation: Mapping the syntax tree to machine instructions using
semantic rules.
Advantages of Syntax-Directed Definitions
1. Formal Semantics: SDDs provide a formal way of defining the semantics of a programming
language, which makes it easier to understand and implement language features.
2. Clear Structure: By directly associating semantic rules with the grammar’s productions,
SDDs offer a clear structure for managing both syntax and semantics.
3. Modularity: SDDs make it easier to add or modify semantics without altering the overall
syntax of the language.
4. Tool Support: Many compiler construction tools like Yacc, ANTLR, or Bison support
generating parsers that can handle syntax-directed definitions, making it easier to automate
the generation of parsers and semantic analyzers.
3.2 S-ATTRIBUTED AND L-ATTRIBUTEDDEFINITIONS
S-attributed definitions are a subclass of syntax-directed definitions in which all attributes are
synthesized (i.e., computed from the attributes of the children nodes in the parse tree).
Key Characteristics of S-attributed Definitions:
1. Only Synthesized Attributes: In an S-attributed definition, every attribute is synthesized.
This means that the attribute of a non-terminal or terminal is calculated based on the
attributes of its children or terminals.
2. Attribute Flow: Since attributes are synthesized, they are propagated upwards in the parse
tree, starting from the leaves and going towards the root.
3. Evaluation Order: Since synthesized attributes are calculated based on the children, it is
easy to establish a bottom-up evaluation order. The parser can process children first and
then propagate the results upwards.
4. Efficiency: S-attributed grammars are computationally efficient since all attributes are
computed in a straightforward, bottom-up manner.
Example of S-attributed Definition
Consider a simple arithmetic expression grammar:
E→E+T
E→T
T→T*F
T→F
F → id
F → num
Let’s define the synthesized attribute val for each non-terminal. The value of an expression or term is
computed by the following semantic rules:
• E→E+T:[Link]=[Link]+[Link]
(The value of E is the sum of the values of E and T.)
• E→T:[Link]=[Link]
(The value of E is the value of T.)
• T→T*F:[Link]=[Link]*[Link]
(The value of T is the product of the values of T and F.)
• T→F:[Link]=[Link]
(The value of T is the value of F.)
• F→id:[Link]=get_id_value(id)
(The value of F is the value of the identifier id.)
• F→num:[Link]=num
(The value of F is the numerical value of num.)
All attributes (val) are synthesized in this grammar, and the computation of val propagates upwards in the
parse tree, making it an S-attributed definition.
L-attributed Definitions
L-attributed definitions are another subclass of syntax-directed definitions, but they impose a restriction
on the order in which attributes can be evaluated. Specifically, in an L-attributed definition:
1. Both Synthesized and Inherited Attributes: In an L-attributed grammar, both synthesized
and inherited attributes can be used.
o Synthesized attributes are propagated upwards in the parse tree, similar to S-
attributed definitions.
o Inherited attributes are passed downwards from the parent node to the child nodes
(or passed across siblings in the parse tree).
2. Evaluation Order: L-attributed definitions impose a restriction on the order of evaluation.
The inherited attributes of a node can depend on both the attributes of its parent node and its
siblings. However, the inherited attributes of a node must only depend on the already-
evaluated attributes of its parent or left siblings (not right siblings), which ensures that
an efficient left-to-right evaluation order is possible.
3. Efficiency: L-attributed grammars are still efficient in terms of computation because the
inherited attributes can be passed down in a left-to-right manner, and both synthesized and
inherited attributes can be evaluated in a single left-to-right pass over the input.
Example of L-attributed Definition
Consider a simple example involving variable declarations and expressions:
S → var id : T
T → int
T → bool
E → E1 + E2
E → id
Here, we define the following attributes:
• [Link]: Type of the variable or expression.
• [Link]: Value of the expression (this is synthesized).
• [Link]: The type of the identifier (inherited from the declaration).
Now, let’s define the semantic rules:
• S → var id : T:
o [Link] = [Link] (Inherited attribute: id gets the type of T from the parent.)
• T → int:
o [Link] = "int" (Synthesized attribute: The type of T is int.)
• T → bool:
o [Link] = "bool" (Synthesized attribute: The type of T is bool.)
• E → E1 + E2:
o [Link] = [Link] + [Link] (Synthesized attribute: The value of E is the sum of the values
of E1 and E2.)
• E → id:
o [Link] = get_id_value(id) (Synthesized attribute: The value of E is the value of the
identifier id.)
In this case, the type of an identifier is passed downwards from the declaration ([Link] = [Link]), which
makes this definition L-attributed.
Key Differences Between S-attributed and L-attributed Definitions

3.3 CONSTRUCTION OF SYNTAX TREES


A syntax tree, also called a parse tree, is a tree representation of the syntactic structure of a source
program according to a formal grammar. Each internal node of the tree represents a non-terminal symbol
in the grammar, while the leaf nodes represent the terminal symbols (tokens). Constructing a syntax tree
is a key part of the syntax analysis phase of a compiler or interpreter.
Purpose of Syntax Trees
• Representation of the Syntax: The syntax tree reflects the structure of the source program
as defined by the grammar of the language.
• Semantic Analysis: The tree is used in later stages (e.g., semantic analysis, code generation)
to derive meaning from the syntactic structure.
• Facilitating Optimization: In compilers, the syntax tree is often transformed into an
abstract syntax tree (AST) for further processing or optimization.
Types of Syntax Trees
1. Parse Tree (Full Syntax Tree): A parse tree is a full representation of a grammar, showing
every rule applied during the derivation of the input string.
2. Abstract Syntax Tree (AST): An abstract syntax tree simplifies the parse tree by removing
non-essential nodes (like intermediate grammatical constructs) and focusing on the actual
operations and data structures of the program.
Steps for Constructing a Syntax Tree
1. Input Token Stream: The first step in constructing a syntax tree is having an input token
stream. This stream is typically generated by the lexical analyzer (lexer) and consists of
tokens like keywords, identifiers, operators, literals, etc.
2. Parse Using Grammar: To construct the syntax tree, you need to parse the input according
to a context-free grammar (CFG). Parsing involves applying the production rules of the
grammar to match the tokens in the input.
3. Recursive Descent Parsing: This is a top-down parsing method where you start from the
start symbol of the grammar and recursively apply production rules to match parts of the
input. As you apply each production, you create nodes in the syntax tree corresponding to
the non-terminal and terminal symbols.
4. Bottom-Up Parsing: In bottom-up parsing, you begin with the input tokens (the leaves of
the tree) and try to reduce them to the start symbol by applying the production rules in
reverse. This approach is commonly used in shift-reduce parsers.
5. Handling Ambiguities: Sometimes, the grammar may be ambiguous, meaning there are
multiple ways to parse the same input. In such cases, the parser may need to use additional
strategies to select the correct parse tree (such as prioritizing certain rules, or using a
grammar disambiguation technique).
Example 1: Constructing a Parse Tree
Consider the following grammar for arithmetic expressions:
E→E+T|T
T→T*F|F
F → ( E ) | id
Let's construct the parse tree for the expression: id + id * id.
Step-by-Step Parsing
Start with the start symbol (E).
Apply the production E → E + T:
E
/\
E + T
Now, parse the left E. Apply E → T:
E
/\
E + T
/
T
Now, parse T. Apply T → F:
E
/\
E + T
/ \
T F
/
F
Now, parse F. Apply F → id:
E
/\
E + T
/ \
T F
/ \
F id
/
Id
Now, parse the right side of the expression (T → T * F). Apply the production T → T * F:
E
/\
E + T
/ \
T T
/ \
F F
/ \
id id

/ id

7. Continue parsing `F → id` for both `T` and `F`.

8. Finally, the full parse tree for `id + id * id` is:


E
/\
E + T
/ \
T T
/\ /\
F * F F
/ / / \

id id id id

---

### Example 2: Constructing an Abstract Syntax Tree (AST)

An **abstract syntax tree (AST)** simplifies the parse tree by omitting intermediate grammatical rules
that do not contribute to the actual computation or interpretation of the program. In the case of the
expression `id + id * id`, we can omit the multiplication and addition operators as separate nodes and
instead focus on the operands and their relationships.

For `id + id * id`, the AST would look like this:


+
/\

id * /
id id

Here:
- The AST focuses on the operands and operators that directly contribute to the evaluation of the
expression.
- We don't need to explicitly represent the `T → T * F` production because the multiplication operation is
naturally represented in the AST.

---

### Techniques for Constructing Syntax Trees

1. **Recursive Descent Parsing**:


- This is a top-down parsing technique where each non-terminal has a corresponding procedure or
function that recursively processes the input according to the grammar’s rules.
- The parse tree is constructed during the recursive calls by creating nodes for non-terminals and
terminals as the parser processes the input.

2. **LL(1) Parsing**:
- An **LL(1) parser** is a type of **top-down** parser that constructs a syntax tree using a
**predictive parsing** approach, where the choice of production rule to apply is determined by looking
at the next input token.
- The parser uses a stack to represent the parse tree, pushing non-terminals and popping them as it applies
rules to match the input tokens.

3. **LR Parsing**:
- An **LR parser** is a **bottom-up** parser that constructs the syntax tree by starting from the input
tokens and reducing them into higher-level constructs. In the case of **shift-reduce** parsing, the parser
"shifts" tokens onto a stack and "reduces" them by applying production rules when possible.
- This type of parsing is suitable for constructing syntax trees for languages with complex and more
powerful grammars.

4. **Shift-Reduce Parsing**:
- **Shift-reduce parsing** is a bottom-up parsing method commonly used in **LR parsers**. The
parser shifts tokens onto a stack and, when it matches a right-hand side of a production rule, it reduces the
stack by replacing the matched sequence with the non-terminal on the left-hand side of the rule.
- The syntax tree is built incrementally as reductions occur.

---

### Challenges in Syntax Tree Construction

- **Ambiguity**: If the grammar is ambiguous, the same input can have multiple valid parse trees. In this
case, the parser must either choose one or employ strategies for resolving ambiguity (e.g., using
precedence rules or rewriting the grammar).

- **Left Recursion**: Some grammars may be left-recursive, meaning they can lead to infinite recursion
in top-down parsers. To avoid this, left recursion must be eliminated, or an appropriate parsing method
(like **LR parsing**) must be used.

- **Efficiency**: Parsing can be computationally expensive, especially for large programs. Optimizations
like **memoization** or using **table-driven parsers** (e.g., LL or LR) can improve performance.
---

### Summary

Constructing a **syntax tree** is essential for representing the syntactic structure of a program. It is
created through the process of **parsing** using a context-free grammar. The process involves either top-
down (e.g., recursive descent, LL parsing) or bottom-up (e.g., shift-reduce, LR parsing) parsing
techniques. The **parse tree** represents the complete syntax, while the **abstract syntax tree (AST)**
simplifies this representation by removing unnecessary details, focusing only on the meaningful
components of the program. The construction of syntax trees is foundational in compiler design, semantic
analysis, and code generation.
3.4 TYPE CHECKING
Type checking is a crucial phase in the compilation process, where the types of variables, expressions,
and function return values are verified to ensure that they are used consistently according to the rules of
the programming language. It helps detect errors like trying to perform arithmetic on non-numeric types
or calling a function with the wrong number of arguments.
In this context, Type Expressions play an essential role in specifying and understanding the types of
various constructs in a program, such as variables, expressions, and function signatures.
3.4.1 TYPE EXPRESSIONS
1. What Are Type Expressions?
A type expression is a formal way of describing the type of an entity (like a variable, function, or
expression) in a programming language. It is essentially a formula or notation that defines the type of an
object or a value in terms of simpler types.
In the context of type checking, type expressions are used to describe the type of variables, parameters,
and the result of operations. These expressions are evaluated to determine if operations on variables are
type-safe (i.e., if they can be correctly performed according to the type rules of the language).
For example:
• An integer type might be represented as Int.
• A boolean type might be represented as Bool.
• A function type that takes an integer and returns a boolean might be written as Int -> Bool.
Type expressions can be quite complex in languages with advanced features, such as polymorphism,
higher-order functions, or type constructors.
2. Types of Type Expressions
Type expressions can be broadly classified into the following categories:
a. Basic Types
Basic or primitive types represent the simplest data types, such as:
• Integer (Int): A type for whole numbers.
• Boolean (Bool): A type for logical values, i.e., true or false.
• Character (Char): A type for individual characters.
These types can be represented directly in type expressions as:
• Int
• Bool
• Char
b. Function Types
Function types represent functions that take arguments of certain types and return values of a specific
type. A function type is expressed as a function arrow (->), which denotes that the function takes one
type as input and produces another type as output.
Example:
• A function that takes an integer and returns a boolean is written as Int -> Bool.
• A function that takes two integers and returns an integer is written as Int -> Int -> Int (or
equivalently, Int -> (Int -> Int)).
In some languages, higher-order functions (functions that take other functions as arguments or return
them as results) are supported, and function types can become more complex.
c. Product Types (Tuples)
A product type (also known as a tuple or record) combines multiple types into one composite type. A
tuple represents a fixed-size collection of heterogeneous types.
Example:
• A tuple containing an integer and a boolean is written as (Int, Bool).
• A record type in some languages might be represented as {x: Int, y: Bool}, where each field
is typed.
d. Sum Types (Variants or Discriminated Unions)
A sum type represents a value that can be one of several different types, commonly used for tagged unions
or enums. This allows values to belong to one of a finite set of types, with a tag that distinguishes them.
Example:
• A sum type that can either be an integer or a boolean might be represented as Int | Bool
(where | means "or").
• A discriminated union might represent a Shape that can be either a Circle or a Rectangle,
where Shape = Circle | Rectangle.
e. Recursive Types
Recursive types are types that refer to themselves. They are useful in representing structures that are
naturally recursive, such as linked lists or trees.
Example:
• A linked list in a functional programming language might have a type like List = Empty |
Cons of Int * List, where Cons is a constructor that holds an integer and another list, and
Empty represents an empty list.
3. Type Checking Using Type Expressions
Type checking using type expressions involves ensuring that the types of expressions are consistent with
the rules of the language. During this process, the type system will validate whether a given expression
adheres to the expected type and whether operations on values of different types are permissible.
Example 1: Type Checking a Simple Expression
Consider the expression:
x+y
Where x and y are variables. To check if this expression is valid, we must check the types of x and y. For
instance:
• If x and y are both of type Int, then the expression is valid (since the + operator is typically
defined for integers).
• If x is of type Int and y is of type Bool, then the expression is type incorrect because + is
not defined for integers and booleans.
Example 2: Type Checking a Function
Consider the following function definition in a language like Haskell or ML:
add :: Int -> Int -> Int
add x y = x + y
This type annotation states that add is a function that takes two arguments of type Int and returns a result
of type Int.
• The type expression Int -> Int -> Int tells us that the function is curried (i.e., it takes one
argument and returns a function that takes another argument).
• The function body x + y ensures that the arguments x and y are both of type Int because
addition (+) is defined for Int types.
4. Advanced Type Expressions in More Complex Languages
In more advanced programming languages, type expressions may include concepts such as:
a. Polymorphism
Polymorphic types allow functions to operate on values of different types. These types are often expressed
using type variables that act as placeholders for any type.
Example:
• In a language with parametric polymorphism, a generic function that takes a list of any type
and returns the length of the list might be written as:
length :: [a] -> Int
• Here, a is a type variable, and the function works for any type a.
b. Dependent Types
Dependent types are types that depend on values. This means that the type of an expression can depend
on the value of another expression, leading to more expressive type systems.
Example:
• A function that takes a list and returns the number of elements in the list, where the type of
the list depends on its length, can be represented as:
length :: List n -> Int
• Here, n is not just a type variable but a value that represents the length of the list.
5. Type Inference
Many modern programming languages allow for type inference, where the compiler can deduce the types
of expressions without explicit type annotations. Type inference uses the type rules of the language and
the structure of the code to automatically assign types to variables and expressions.
Example:
• In Haskell, the expression:
add x y = x + y
The compiler infers the type of add to be Int -> Int -> Int, based on the + operator.
Challenges in Type Checking
• Type Coercion: Some languages allow implicit type conversion or coercion (e.g.,
converting int to float automatically). Type checking needs to determine whether such
coercions are allowed and safe.
• Polymorphic and Higher-Order Types: Handling polymorphic types (like generics) and
higher-order types (functions that take other functions as arguments or return them) can be
complex and requires sophisticated type inference and checking techniques.
• Dependent Types: Checking dependent types requires reasoning about both the structure
and the values in the program, which can make type checking computationally expensive
and complex.
3.4.2 TYPE EQUIVALENCE
Type equivalence refers to the concept of comparing types to determine whether they are the same or
compatible within a given programming language. It plays an essential role in type checking, type
inference, and type assignment in compilers and interpreters. The goal is to ensure that operations on
variables or expressions are type-safe and that the types involved are compatible according to the rules of
the language.
There are different approaches to defining type equivalence, and understanding how two types are
compared is crucial for a programming language's type system. In general, type equivalence is used in
operations such as function argument checking, type assignment, and assignment statements in programs.
Types of Type Equivalence
The key forms of type equivalence used in programming languages are as follows:
1. Structural Equivalence (also known as Content Equivalence):
o Two types are considered structurally equivalent if they have the same internal
structure or composition, regardless of their names or where they are defined.
o This approach compares types based on their composition rather than their name.
o For example, two different type declarations, such as struct Point { int x; int y; } and
typedef struct { int x; int y; } Point;, would be considered structurally equivalent
because they represent the same structure (two integers for x and y).
Example:
typedef struct { int x; int y; } Point1;
typedef struct { int x; int y; } Point2;

// Point1 and Point2 are structurally equivalent,


// though they have different names.
2. Name Equivalence (also known as Nominal Equivalence):
• In name equivalence, two types are considered equivalent if they have the same name,
regardless of whether their structures are identical.
• This means that two types that are defined with different names but are structurally the same
are considered not equivalent in languages that use name equivalence.
• Name equivalence is used in languages that treat types based on their declared name rather
than their structure.
Example:
typedef struct { int x; int y; } Point1;
typedef struct { int x; int y; } Point2;

// Point1 and Point2 are **not** equivalent under name equivalence,


// even though their structure is the same.
3. Weak Equivalence:
• Weak equivalence allows for some flexibility, meaning that two types are considered
equivalent if their structure is sufficiently similar, even if they have different names.
• This is a middle ground between name equivalence and structural equivalence, where
certain differences (like typedefs) are ignored.
Example:
typedef struct { int x; int y; } Point1;
typedef struct { int x; int y; } Point2;

// Point1 and Point2 are equivalent under weak equivalence,


// as their structure is the same, regardless of name.
4. Intersection Equivalence:
• Intersection equivalence involves comparing types based on the intersection of their
structures. This approach is often used in type systems that support subtyping or union
types.
• Two types are considered equivalent if they share the same common properties (i.e., their
intersection is non-empty).
Examples of Type Equivalence
1. C/C++ Style (Name Equivalence)
In C and C++, type equivalence is typically name-based. For example:
typedef struct { int x; int y; } Point1;
typedef struct { int x; int y; } Point2;

// Point1 and Point2 are not considered equivalent, because they have different names.
Here, Point1 and Point2 are not equivalent because C uses name equivalence: the types must have the
same name to be considered equal, even if their structures are identical.
2. Type Equivalence in Functional Languages (Structural Equivalence)
In languages like Haskell or ML, type equivalence is often structural. For example:
type Point1 = (Int, Int)
type Point2 = (Int, Int)

-- Point1 and Point2 are structurally equivalent,


-- because they represent the same tuple structure (Int, Int).
In this case, Point1 and Point2 are structurally equivalent because both represent a tuple containing two
integers, even though they are defined with different names.
3. Type Equivalence with Records and Tuples
In record types or tuple types, equivalence can depend on how the types are constructed:
typedef struct { int x; int y; } Point1;
typedef struct { int a; int b; } Point2;

// Point1 and Point2 are considered different types under name equivalence,
// but may be equivalent under structural equivalence, since their structures are identical.
When is Type Equivalence Used?
• Function Arguments: When checking if function arguments match the expected types,
equivalence rules determine whether the argument and parameter types are compatible.
• Assignment Statements: Type equivalence is checked to ensure that the types of the left-
hand side and the right-hand side of an assignment are compatible.
• Type Inference: During type inference, a system needs to ensure that different occurrences
of the same variable or expression are compatible by checking type equivalence.
• Subtyping: In languages that support subtyping (like Java), checking for subtype
relationships often involves comparing the types of two entities and seeing if one is a subtype
of the other.
• Generics/Parametric Types: In languages with generic types (e.g., Java, C#), type
equivalence is used to check whether types in generic functions or classes are compatible.
Subtype and Supertype Equivalence
In languages with subtyping (such as those that support object-oriented programming), type equivalence
may extend to subtype and supertype relationships:
• A type T is considered a supertype of another type S if every instance of S can also be an
instance of T.
• Subtype equivalence allows one type to be considered equivalent to another if it is a
subtype, meaning it has at least the same behavior as the other type but potentially additional
capabilities (fields or methods).
For example, in object-oriented languages (like Java), a class Dog might be considered a subtype of a
class Animal if Dog inherits from Animal. Therefore, a variable of type Animal can hold an instance of
type Dog.
Challenges in Type Equivalence
1. Ambiguity in Mixed-Type Systems: Some languages have complex type systems, with
combinations of structural and name-based equivalence. Deciding which rules to apply in
such cases can be challenging.
2. Type Aliases and Typedefs: Type aliases or typedef in languages like C/C++ can make it
harder to check type equivalence since the same structure may be represented with different
names, which may or may not be considered equivalent depending on the system's rules.
3. Polymorphism and Generics: In polymorphic languages (such as C++ or Java with
templates), type equivalence becomes complex when generics or templates are involved.
Types can be instantiated with different concrete types, making it necessary to check
equivalence in a more dynamic way.
4. Subtyping: In systems with subtyping, defining when two types are equivalent is tricky
because one type could be a subtype of the other, and type equivalence might not hold even
if the types are compatible in many contexts.
3.4.3 RULES FOR TYPE CHECKING
Type checking is the process of verifying that the types of variables and expressions in a program are
used correctly according to the rules of the language. The goal is to ensure that operations on variables
are type-safe and that the program adheres to the language’s type system.
Here are the key rules for type checking that are commonly applied in most programming languages:
1. Type Compatibility Rules
• Unary and Binary Operators: Ensure that the operands of operators like +, -, *, /, ==, etc.,
have compatible types.
o For example, + can be used on numeric types (Int, Float) but not on Bool or String.
o Example: In an expression like x + y, both x and y must be of a type that supports
addition (e.g., both Int or both Float).
• Relational Operators: Ensure that the operands of relational operators like ==, <, >, <=, >=
are of types that can be compared.
o Example: x < y requires that both x and y are of types that support comparison (e.g.,
Int, Float).
• Boolean Operators: Ensure that operands of boolean operators like &&, ||, !, etc., are of
type Bool.
o Example: x && y requires both x and y to be of type Bool.
• Assignment Compatibility: In an assignment x = y, ensure that the type of y matches the
type of x or is compatible with it (considering type casting or coercion).
o Example: x = 5 assigns an Int value 5 to x if x is of type Int.
2. Variable Declaration and Type Assignment Rules
• Variable Declaration: Each variable must be declared with a specific type, and any
subsequent use of the variable must match its declared type.
o Example: If int x; is declared, then x can only be used in contexts where an integer
is expected.
• Type Consistency in Expressions: When variables are used in expressions, the expression
must adhere to the expected type.
o Example: In the expression x + y, if x is an Int, y must also be of type Int for the
expression to be valid.
• Constant and Literal Values: Constant values such as 5, true, or "hello" have inherent
types, and their types must be compatible with where they are used.
o Example: "hello" + "world" is valid for string concatenation but invalid for integer
addition.
3. Function Type Checking Rules
• Function Parameters: When a function is called, the arguments passed must match the
types of the corresponding parameters. The type of each argument must be checked against
the parameter type.
o Example: For a function int add(int x, int y), calling add(2, "3") is an error because
"3" is a string, not an integer.
• Return Type: The return type of a function must be checked for consistency with the
expected return type. If the function is expected to return an integer, returning a string or a
boolean is an error.
o Example: For a function float divide(int x, int y), returning an Int instead of a Float
would be an error.
• Function Overloading: In languages that support function overloading (like C++, Java),
the function signatures must be checked to ensure the correct function is called based on the
number and types of arguments.
o Example: Overloading the function add(a, b) where one takes int and the other takes
float requires that the correct overload is selected based on the argument types.
4. Type Checking for Complex Data Types
• Arrays/Lists: Ensure that the elements of an array or list are of the specified type. If an array
is declared as int arr[10], then all elements assigned to it must be of type Int.
o Example: arr[0] = 10; is valid if arr is an integer array, but arr[0] = "hello"; is invalid.
• Records/Structs: Each field in a record or struct must be assigned a value of the correct
type, as specified in the type definition.
o Example: For a struct Point { int x, y; }, assigning a string to x would be an error.
• Union/Variant Types: If a type can be one of several types (as in sum types or
discriminated unions), type checking must ensure that the correct type is selected from the
union and that operations are performed on that specific type.
o Example: A variable of type Shape that could be either Circle or Rectangle must
ensure that a Circle-specific operation (like calculating the area) isn't performed on
a Rectangle.
• Type Aliases: In languages that support type aliases (e.g., C, C++), an alias is treated as
equivalent to the original type. Type checking must account for this equivalence and allow
the alias to be used in place of the original type.
o Example: typedef int Integer; means that Integer is equivalent to Int, and the same
type rules apply.
5. Polymorphism and Type Checking
• Polymorphism: In languages with polymorphism (like C++ or Java), type checking must
account for subtypes (in object-oriented languages) or generic types (in functional
programming languages).
o Example: In Java, a method expecting a parameter of type Animal should accept an
argument of type Dog if Dog is a subclass of Animal.
• Generic Types: For generic functions or templates (in languages like Java or C++), type
checking ensures that the parameters passed match the types expected by the generic
definition.
o Example: List<Integer> is a list of integers, and only integer elements can be added
to it.
6. Type Checking for Expressions
• Unary Expressions: In expressions involving unary operators (e.g., -, !), type checking
ensures that the operand has a valid type for the operator.
o Example: -x is valid only if x is a numeric type (like Int or Float).
o Example: !x is valid only if x is a Bool.
• Binary Expressions: In expressions involving binary operators (e.g., +, -, *, /), both
operands must have compatible types for the operation.
o Example: x + y is valid only if x and y are both integers (Int) or both floating-point
numbers (Float).
o Example: x == y requires both x and y to be of a comparable type (Int, Float, String,
etc.).
• Type Promotions: Some languages allow for automatic type conversions or type
promotions (e.g., from Int to Float), and the type checking process must account for such
conversions.
o Example: int x = 5; float y = x; (implicitly promotes x from Int to Float).
7. Context-Sensitive Type Checking
• Scope Checking: Ensure that variables are declared before they are used in expressions.
Variables must be checked to see if they exist in the current scope, and their types should be
consistent with the declaration.
o Example: Using a variable before its declaration would result in a type error.
• Type Compatibility Across Scopes: Ensure that variables declared in different scopes (e.g.,
in different functions or blocks) have consistent types when they are referenced or passed as
arguments.
o Example: Passing a variable from one function to another should be valid if the types
match.
8. Type System Enforcement Rules
• Strong vs Weak Typing: In strongly-typed languages, types are strictly enforced, and
implicit conversions are disallowed unless explicitly defined. In weakly-typed languages,
type conversions might occur automatically.
o Example: In a strongly typed language like Haskell, trying to add a string and an
integer will result in a type error. In weakly-typed languages like JavaScript, implicit
type conversion might allow such an operation.
• Type Inference: In languages with type inference (like Haskell or Scala), the type system
infers the types of variables and expressions based on their usage.
o Example: If x = 10, the type checker will infer that x is of type Int.
3.4.4 TYPE CONVERSIONS
Type conversion (or type casting) refers to the process of converting a value from one type to another.
This is an essential part of type systems in programming languages, as it allows operations to be performed
on values of different types. Type conversions are typically categorized into two types: implicit and
explicit conversions.
1. Implicit Type Conversion (Automatic Conversion)
Implicit type conversion is done automatically by the compiler or interpreter. It occurs when the compiler
can safely convert one type to another without any loss of information, often when types are compatible
or when a smaller type is promoted to a larger type.
• Automatic type conversion happens without the need for explicit instructions from the
programmer.
• The compiler determines when it is safe to convert the types.
• This is often referred to as type coercion or type promotion.
Examples of Implicit Type Conversion
Promotion of Integer to Floating-Point: When performing an operation involving both integers and
floating-point numbers, the integer is typically promoted to a floating-point number to ensure precision is
maintained.
int x = 5;
float y = 2.5;
float result = x + y; // x is implicitly converted to float
// result will be 7.5
Smaller Integer to Larger Integer (or Float): When a smaller type is assigned to a larger type (e.g.,
from int to long or float), the compiler performs the conversion automatically.
short s = 10;
int i = s; // short is automatically converted to int
Char to Int or Int to Float: A char (character) can be implicitly converted to its integer value (ASCII
value). Similarly, an int can be converted to float automatically.
char ch = 'A'; // 'A' has an ASCII value of 65
int num = ch; // char is implicitly converted to int
Restrictions of Implicit Type Conversion
• Implicit conversions can lead to loss of information or precision loss. For example, when
converting from float to int, the fractional part is discarded, which could lead to incorrect
results.
• Implicit conversions only occur if there is no risk of data loss or semantic errors in the
conversion.
2. Explicit Type Conversion (Type Casting)
Explicit type conversion, also known as type casting, requires the programmer to explicitly specify how
one type should be converted into another. This is done using casting operators or functions, and is used
when the compiler cannot automatically perform the conversion.
• This type of conversion allows the programmer to control the conversion process.
• It is used when converting between incompatible types, or when implicit conversion would
result in loss of data or precision.
Examples of Explicit Type Conversion
C-style Cast (C/C++): In languages like C and C++, explicit conversion is done using the (type) syntax.
The programmer specifies the target type inside parentheses.
float f = 3.14;
int i = (int) f; // f is explicitly converted to an integer
// The result is i = 3, because the fractional part is discarded
Function-style Cast (C++/Java): Some languages like C++ and Java allow you to use a function-like
syntax for casting.
double d = 3.14;
int x = int(d); // Explicit conversion from double to int
Java-style Cast (Java): In Java, type casting can be done explicitly by using the cast operator (type).
double d = 3.75;
int i = (int) d; // Explicit conversion from double to int
// i = 3, the fractional part is truncated
Using Conversion Functions: Some languages provide built-in functions for explicit type conversions.
For instance, in Python, functions like int(), float(), and str() are used for type casting.
x = "10"
y = int(x) # Converts string to integer
Explicit Conversion in Object-Oriented Languages
In object-oriented programming languages (e.g., Java, C++), explicit conversions are often used for
casting objects to different classes in the inheritance hierarchy. This is especially relevant for upcasting
and downcasting.
• Upcasting: Casting a subclass object to a superclass type.
• Downcasting: Casting a superclass object to a subclass type.
class Animal { }
class Dog extends Animal { }

Animal animal = new Dog(); // Upcasting (implicitly done)


Dog dog = (Dog) animal; // Downcasting (explicitly done)
Risks of Explicit Type Conversion
• Data Loss: When converting between incompatible types (e.g., float to int), explicit
conversion may lead to truncation or loss of information.
o Example: Converting a large float value into an int could cause overflow or loss of
precision.
• ClassCastException (in Java): In Java, casting between incompatible types can lead to
exceptions at runtime if the cast is invalid.
Object obj = new Integer(10);
String str = (String) obj; // Throws ClassCastException
3. Types of Type Conversions
Type conversions can also be categorized based on the nature of conversion between different types:
1. Implicit Conversions (Type Promotion)
• Numeric Promotion: A smaller type is promoted to a larger type, typically from int to long,
float, or double.
• Boolean to Numeric: In some languages, a boolean value true or false can be promoted to
1 or 0, respectively.
Example:
int x = 10;
float y = 5.5;
float result = x + y; // x is promoted to float
2. Explicit Conversions
• Narrowing Conversion: Converting from a larger type to a smaller type. This is typically
done explicitly, because it may result in data loss.
• Type Casting: Explicit conversion between unrelated or incompatible types, for example,
from double to int.
• Pointer Conversion: In languages like C/C++, pointers to objects of one type can be
converted to pointers to objects of another type, though this requires caution.
4. Special Considerations in Type Conversion
• Floating Point Precision: When converting between float and double, or between floating-point
and integer types, the precision may be affected.
Example: Converting 3.14159 (double) to int would discard the fractional part.
• String to Numeric Conversion: Many languages allow strings to be converted to numeric types,
but this can fail if the string doesn't represent a valid number.
Example (in Python)
s = "123"
num = int(s) # Converts string to integer
• Complex Data Types (e.g., Arrays, Objects): Type conversion between complex data structures
like arrays or objects may require deep copying or constructing new instances, especially when
converting between completely different data structures.
3.5 OVERLOADING OF FUNCTIONS
Function overloading is a feature in many programming languages (such as C++, Java, and C#) that
allows multiple functions with the same name to coexist in the same scope, as long as they have different
parameters (i.e., a different number or types of arguments). This enables a function to be called in
different ways depending on the type or number of arguments passed.
Function overloading enhances the flexibility and readability of the code by allowing you to use the same
function name for different operations that are conceptually similar but work with different input types or
quantities.
1. How Function Overloading Works
Function overloading is based on the function signature, which consists of the function's name and its
parameter list. The return type is not considered part of the signature, meaning that two overloaded
functions cannot differ only by their return types.
For function overloading to be successful:
• Parameters (number, type, or order of parameters) must be different.
• Return type alone cannot be used to distinguish between overloaded functions.
• Default parameters can also affect overloading if they are used in one version of the
function.
2. Examples of Function Overloading
1. Overloading Based on Number of Parameters
In this case, the same function name can be used, but the number of parameters differs.
#include <iostream>
using namespace std;

// Function to add two integers


int add(int x, int y) {
return x + y;
}

// Overloaded function to add three integers


int add(int x, int y, int z) {
return x + y + z;
}

int main() {
cout << add(2, 3) << endl; // Calls the first version (two parameters)
cout << add(2, 3, 4) << endl; // Calls the second version (three parameters)
return 0;
}
Output:
5
9
2. Overloading Based on Different Parameter Types
Functions can also be overloaded based on the types of their parameters. The function name stays the
same, but the types of arguments differ.
#include <iostream>
using namespace std;

// Function to add two integers


int add(int x, int y) {
return x + y;
}

// Overloaded function to add two floating-point numbers


float add(float x, float y) {
return x + y;
}

int main() {
cout << add(2, 3) << endl; // Calls the integer version
cout << add(2.5f, 3.5f) << endl; // Calls the float version
return 0;
}
Output:
5
6
3. Overloading Based on Parameter Order
Overloading can also be based on the order of parameters, where the function signatures are different in
terms of the type order.
#include <iostream>
using namespace std;

// Function to multiply two numbers (int, float)


float multiply(int x, float y) {
return x * y;
}

// Overloaded function to multiply two numbers (float, int)


float multiply(float x, int y) {
return x * y;
}

int main() {
cout << multiply(2, 3.5f) << endl; // Calls the first version (int, float)
cout << multiply(2.5f, 3) << endl; // Calls the second version (float, int)
return 0;
}
Output:
7
7.5
3. Overloading in Object-Oriented Languages
In Object-Oriented Programming (OOP) languages like C++ or Java, function overloading is often
used in methods to provide different behaviors for objects based on the number or type of arguments. This
helps in creating a more intuitive and clean API for the class.
Example in C++:
#include <iostream>
using namespace std;

class Printer {
public:
// Method to print a single integer
void print(int x) {
cout << "Printing integer: " << x << endl;
}

// Overloaded method to print a double


void print(double x) {
cout << "Printing double: " << x << endl;
}

// Overloaded method to print a string


void print(string x) {
cout << "Printing string: " << x << endl;
}
};

int main() {
Printer p;
[Link](10); // Calls print(int)
[Link](3.14); // Calls print(double)
[Link]("Hello!"); // Calls print(string)
return 0;
}
Output:
Printing integer: 10
Printing double: 3.14
Printing string: Hello!
4. Overloading in Java
In Java, function overloading works similarly. Overloaded methods can be defined within the same class,
but their parameter lists must differ.
Example in Java:
class Calculator {
// Method to add two integers
int add(int a, int b) {
return a + b;
}

// Overloaded method to add three integers


int add(int a, int b, int c) {
return a + b + c;
}

// Overloaded method to add two doubles


double add(double a, double b) {
return a + b;
}
}

public class Main {


public static void main(String[] args) {
Calculator calc = new Calculator();
[Link]([Link](2, 3)); // Calls add(int, int)
[Link]([Link](2, 3, 4)); // Calls add(int, int, int)
[Link]([Link](2.5, 3.5)); // Calls add(double, double)
}
}
Output:
5
9
6.0
5. Key Points to Remember About Function Overloading
• Overloading is based on the function signature (name + parameters), but not on return
type. Functions with the same name cannot be overloaded if the only difference between
them is their return type.
• Order of parameters and parameter types can be used to distinguish overloaded functions.
• Default parameters can affect overloading in certain cases. If a function has default
arguments, the compiler may have difficulty distinguishing which function to call if there
are multiple candidates.
void print(int x = 10); // Default argument
void print(int x = 20); // This will cause ambiguity when calling print();
• Overloading can increase readability by providing multiple ways to perform the same
operation with different input types, without needing to create distinct function names.
6. Advantages of Function Overloading
1. Improved Readability: By using the same function name for different operations that are
conceptually similar, code becomes more readable and easier to maintain.
2. Avoids Duplication: Function overloading avoids the need to create multiple functions with
different names for similar operations, reducing redundancy.
3. Flexibility: Provides flexibility to call the same function in different contexts, such as
handling different data types or numbers of parameters.
7. Disadvantages of Function Overloading
1. Ambiguity: If not designed properly, overloaded functions might cause ambiguity, where
the compiler is unable to decide which version of the function to call.
2. Increased Complexity: Overloading with many versions of the same function can make it
harder for developers to understand the different variations of the function.
3. Limits in Some Languages: Not all languages support function overloading. Some
languages, like Python, don't support it directly, relying on dynamic typing and default
arguments to handle similar functionality.
3.6 OPERATORS
Operators are symbols or keywords that perform operations on variables and values. They are
fundamental to programming, enabling the manipulation of data and the execution of logic within
programs. Operators can be categorized based on their function and how they are used in expressions.
1. Types of Operators
Operators can be broadly classified into several types based on their functionality:
1.1 Arithmetic Operators
Arithmetic operators are used to perform basic mathematical operations such as addition, subtraction,
multiplication, division, and modulus.
Addition (+): Adds two operands.
int a = 5, b = 3;
int sum = a + b; // sum = 8
Subtraction (-): Subtracts the second operand from the first.
int diff = a - b; // diff = 2
Multiplication (*): Multiplies two operands.
int prod = a * b; // prod = 15
Division (/): Divides the first operand by the second (integer division in many languages).
int div = a / b; // div = 1 (integer division)
Modulus (%): Returns the remainder of division of the first operand by the second.
int mod = a % b; // mod = 2
1.2 Relational (Comparison) Operators
Relational operators are used to compare two values or expressions and return a boolean value (true or
false).
Equal to (==): Checks if two values are equal.
if (a == b) { /* true if a is equal to b */ }
Not equal to (!=): Checks if two values are not equal.
if (a != b) { /* true if a is not equal to b */ }
Greater than (>): Checks if the left operand is greater than the right operand.
if (a > b) { /* true if a is greater than b */ }
Less than (<): Checks if the left operand is less than the right operand.
if (a < b) { /* true if a is less than b */ }
Greater than or equal to (>=): Checks if the left operand is greater than or equal to the right operand.
if (a >= b) { /* true if a is greater than or equal to b */ }
Less than or equal to (<=): Checks if the left operand is less than or equal to the right operand.
if (a <= b) { /* true if a is less than or equal to b */ }
1.3 Logical Operators
Logical operators are used to combine or invert boolean values or expressions.
AND (&&): Returns true if both operands are true.
if (a > 0 && b > 0) { /* true if both a and b are greater than 0 */ }
OR (||): Returns true if at least one operand is true.
if (a > 0 || b > 0) { /* true if either a or b is greater than 0 */ }
NOT (!): Inverts the boolean value of the operand.
if (!(a > 0)) { /* true if a is not greater than 0 */ }
1.4 Assignment Operators
Assignment operators are used to assign values to variables.
Simple Assignment (=): Assigns the value of the right operand to the left operand.
int a = 5; // a is assigned the value 5
Addition Assignment (+=): Adds the right operand to the left operand and assigns the result to the left
operand.
a += 3; // equivalent to a = a + 3
Subtraction Assignment (-=): Subtracts the right operand from the left operand and assigns the result to
the left operand.
a -= 2; // equivalent to a = a – 2
Multiplication Assignment (*=): Multiplies the left operand by the right operand and assigns the result
to the left operand.
a *= 2; // equivalent to a = a * 2
Division Assignment (/=): Divides the left operand by the right operand and assigns the result to the left
operand.
a /= 2; // equivalent to a = a / 2
Modulus Assignment (%=): Takes the modulus of the left operand by the right operand and assigns the
result to the left operand.
a %= 3; // equivalent to a = a % 3
1.5 Unary Operators
Unary operators operate on a single operand.
Increment (++): Increases the value of the operand by 1.
a++; // Post-increment: first uses the value of a, then increases a by 1
++a; // Pre-increment: first increases a by 1, then uses the value of a
Decrement (--): Decreases the value of the operand by 1.
a--; // Post-decrement: first uses the value of a, then decreases a by 1
--a; // Pre-decrement: first decreases a by 1, then uses the value of a
Negation (-): Negates the value of the operand.
int x = -a; // if a is 5, x becomes -5
Logical NOT (!): Negates the boolean value of the operand.
bool result = !(a > b); // if a > b is true, result becomes false
1.6 Bitwise Operators
Bitwise operators operate on the bit-level representation of integers.
AND (&): Performs bitwise AND on the bits of two operands.
int result = a & b; // bitwise AND
OR (|): Performs bitwise OR on the bits of two operands.
int result = a | b; // bitwise OR
XOR (^): Performs bitwise XOR on the bits of two operands.
int result = a ^ b; // bitwise XOR
NOT (~): Inverts the bits of the operand.
int result = ~a; // bitwise NOT
Left shift (<<): Shifts the bits of the left operand to the left by the number of positions specified by the
right operand.
int result = a << 2; // shifts a left by 2 positions
Right shift (>>): Shifts the bits of the left operand to the right by the number of positions specified by the
right operand.
int result = a >> 2; // shifts a right by 2 positions
1.7 Conditional (Ternary) Operator
The conditional (ternary) operator is a shorthand for an if-else statement.
Ternary (?:): Evaluates a condition and returns one of two values depending on whether the condition is
true or false.
int result = (a > b) ? a : b; // returns a if a > b, otherwise returns b
1.8 Type-Casting Operators
Type casting allows you to explicitly convert between different data types.
C-style Cast ((type)): Converts one type to another.
float f = 5.7;
int i = (int) f; // cast float to int (i will be 5)
Static Cast (static_cast<type>): In C++, this is a safer alternative to C-style casting.
double d = 5.6;
int i = static_cast<int>(d); // safely cast double to int
MODULE-IV
INTERMEDIATE CODE GENERATOR AND RUN TIME ENVIRONMENTS
4.1 PREPROCESSING THE INTERMEDIATE CODE
In the context of compiler design, intermediate code is an abstraction layer between the high-level
source code and the target machine code (low-level code). It is used to make the compilation process
more modular and flexible, allowing for optimizations and easier code generation for different target
architectures.
Preprocessing the intermediate code refers to the transformation or modification of intermediate code
before further stages of optimization and code generation. This step typically involves simplifying,
optimizing, or restructuring the intermediate code to make subsequent phases of the compiler more
efficient or to target specific machine features.
Key Concepts of Preprocessing Intermediate Code
1. Intermediate Code Generation
o The compiler generates intermediate code after the parsing phase. This code is
usually platform-independent and represents an abstraction of the source code.
o The intermediate code could be represented in various forms:
▪ Three-address code (TAC): A popular form of intermediate representation
(IR) where each instruction has at most three operands.
▪ Abstract Syntax Tree (AST): A tree representation of the syntax of the
source code.
▪ Control Flow Graph (CFG): Represents the flow of control in a program.
2. Preprocessing Techniques Preprocessing intermediate code typically involves the
following techniques:
o Simplification: Reducing the complexity of the intermediate code by eliminating
unnecessary operations or by simplifying expressions.
▪ Example: Simplifying x * 1 to x, or x + 0 to x.
o Common Subexpression Elimination (CSE): Identifying expressions that are
computed multiple times and replacing them with a single computation.
▪ Example: If a + b appears multiple times in the code, replace all instances
with a single computation temp = a + b;.
o Constant Folding: Evaluating constant expressions at compile time rather than at
runtime.
▪ Example: 5 + 3 becomes 8 during preprocessing, rather than performing the
addition during execution.
o Constant Propagation: Replacing variables that have constant values with the
actual constants.
▪ Example: If x = 10 is assigned earlier, all uses of x can be replaced by 10 in
subsequent expressions.
o Dead Code Elimination: Removing parts of the intermediate code that are never
used (i.e., unreachable or redundant code).
▪ Example: If a variable x is assigned a value but never used in the code, the
assignment can be removed.
o Strength Reduction: Replacing expensive operations (like multiplication or
division) with cheaper ones (like addition or bit shifts).
▪ Example: Replacing i * 2 with i << 1 (bit shift).
o Loop Invariant Code Motion: Moving computations that do not change within a
loop outside of the loop, reducing redundant calculations.
▪ Example: If x * y is computed inside a loop but x and y do not change within
the loop, move the computation outside the loop.
3. Why Preprocess Intermediate Code?
o Optimization: The main goal is to optimize the intermediate code to ensure that the
final machine code is efficient in terms of both time and space.
o Portability: Intermediate code is often platform-independent. Preprocessing can
help make it more suitable for different target machine architectures by enabling
more aggressive optimization.
o Simplification for Backend: Preprocessing intermediate code can make the backend
(code generation and optimization phases) simpler and faster by simplifying the
intermediate code structure.
o Code Generation Efficiency: Preprocessing may improve the efficiency of the code
generation phase, since the intermediate code is often transformed into a form that is
closer to what the machine architecture needs.
Example of Intermediate Code and Preprocessing
Suppose we have the following high-level code:
int a = 5;
int b = 10;
int c = a * b + 2 * a;
The compiler generates an intermediate code, represented in Three-Address Code (TAC):
t1 = a * b // Multiply a and b
t2 = 2 * a // Multiply 2 and a
t3 = t1 + t2 // Add t1 and t2
c = t3 // Assign result to c
Preprocessing Steps
Constant Folding:
If a = 5 and b = 10 are constants, we can evaluate the expressions at compile-time:
t1 = 5 * 10 // Constant folding: t1 = 50
t2 = 2 * 5 // Constant folding: t2 = 10
t3 = 50 + 10 // t3 = 60
c = t3 // c = 60
Constant Propagation:
If a and b are constants, replace them directly in the code, and remove any intermediate variables:
c = 60 // Direct assignment after evaluation
Dead Code Elimination:
If there are unused intermediate variables (t1, t2, or t3), they can be eliminated, leaving only the necessary
operations.
After preprocessing, the final intermediate code might simply be:
c = 60
This optimized intermediate representation can then proceed to the code generation phase, where it will
be translated into machine-specific code.
4.2 PREPROCESSING OF EXPRESSIONS
Preprocessing of expressions in compiler design refers to the techniques applied to simplify or optimize
expressions in the intermediate code or the source code before further compilation phases (like
optimization or code generation). This step is essential for improving the efficiency of the compiled
program, reducing execution time, and minimizing memory usage.
Why Preprocess Expressions?
• Optimization: Preprocessing helps in optimizing the expressions before they are evaluated
during runtime, reducing the need for unnecessary computations.
• Simplification: It simplifies complex expressions into simpler ones, making the code
generation and optimization phases easier and faster.
• Improved Code Generation: It helps generate more efficient machine code by reducing
redundant operations.
• Portability: The expression preprocessing step can make code more adaptable to different
target architectures by making expressions simpler and more general.
Key Techniques in Preprocessing Expressions
Several well-known techniques are used for preprocessing expressions. These techniques can be applied
at the level of intermediate code or the source code during compilation. Let's explore some of the most
common techniques:
1. Constant Folding
Constant folding is the technique of evaluating constant expressions at compile time rather than at
runtime. This reduces unnecessary runtime computations, making the program more efficient.
Example:
Consider the following expression:
int x = 5 + 3;
At compile time, we can evaluate this expression:
int x = 8; // After constant folding
This transformation means the program doesn't need to perform the addition operation during execution,
saving processing time.
2. Constant Propagation
Constant propagation is a technique where the compiler replaces variables that have constant values
with those constant values in the expressions. This can reduce unnecessary assignments and further
simplify expressions.
Example:
Consider the following code:
int a = 5;
int b = a * 3;
Using constant propagation:
int a = 5;
int b = 5 * 3; // a is replaced with 5 directly
After constant folding, it can be further simplified to:
int b = 15;
This reduces the need for computing a * 3 at runtime since the value is known at compile time.
3. Simplification of Expressions
Simplifying expressions involves replacing expressions that are trivially reducible, often based on identity
rules. This can include removing redundant operations or simplifying mathematical expressions.
Example:
Addition of zero:
int x = a + 0; // Simplifies to int x = a;
Multiplication by one:
int x = a * 1; // Simplifies to int x = a;
Expression involving identical operands:
int x = a + (-a); // Simplifies to int x = 0;
Expression with unnecessary parentheses:
int x = (a + b) + c; // Simplifies to int x = a + b + c;
These simplifications reduce unnecessary calculations in the final compiled code.
4. Common Subexpression Elimination (CSE)
Common subexpression elimination is a technique where the compiler identifies expressions that are
computed multiple times and stores them in a temporary variable to avoid recalculating them. This can
reduce redundant computations and save time.
Example:
Consider the following code:
int x = a * b;
int y = a * b + c;
The expression a * b is computed twice, so it can be optimized by computing it once and reusing it:
int temp = a * b;
int x = temp;
int y = temp + c;
This eliminates the redundant multiplication and makes the program more efficient.
5. Dead Code Elimination
Dead code elimination involves removing parts of the expression that never affect the outcome of the
program. If a variable or expression is computed but never used, it can be eliminated.
Example:
Consider the following code:
int a = 10;
int b = 20;
int c = a + b;
int d = 30;
If d is never used in subsequent code, the assignment int d = 30; is considered dead code and can be safely
eliminated.
6. Strength Reduction
Strength reduction involves replacing expensive operations with cheaper ones, typically to improve
performance. Common examples include replacing multiplication by a constant with bit shifting or
addition.
Example:
If an expression involves multiplication by a power of 2, it can be replaced with a bitwise left shift, which
is typically faster.
int x = i * 8; // Replace with
int x = i << 3; // Left shift by 3 (since 8 = 2^3)
This technique can speed up execution by utilizing more efficient hardware operations.
7. Loop Invariant Code Motion
Loop invariant code motion is a technique where expressions that do not depend on the loop variable
(i.e., expressions whose result does not change in each iteration of the loop) are moved outside the loop.
This reduces the number of times the expression is evaluated.
Example:
Consider the following loop:
for (int i = 0; i < n; i++) {
int x = a * b; // 'a * b' is invariant and can be moved outside the loop
// other code
}
The expression a * b does not depend on i, so it can be moved outside the loop to avoid redundant
computation:
int x = a * b;
for (int i = 0; i < n; i++) {
// Use x directly without recomputing a * b
}
This reduces the number of operations performed inside the loop, improving performance.
4.3 PREPROCESSING OF IF STATEMENTS AND GOTO STATEMENTS
Preprocessing of control flow statements like if statements and goto statements is a critical part of compiler
optimization. The main goal of preprocessing these control flow structures is to simplify or optimize them,
making the intermediate code more efficient and easier to translate into machine code. This step is
performed during the intermediate code generation phase or even earlier, depending on the compiler.
Preprocessing of if Statements
An if statement is a basic control flow statement used for conditional branching. The compiler can perform
several optimizations on if statements during preprocessing to make them more efficient or easier to
handle during subsequent stages of compilation.
1. Constant Folding in if Statements
If the condition of the if statement involves constant expressions, the compiler can evaluate the condition
at compile time (constant folding). This reduces the overhead of evaluating the condition at runtime.
Example:
if (5 + 3 > 8) { // constant folding will simplify this at compile time
// do something
}
Preprocessed version:
if (8 > 8) { // constant folding results in this
// This will never be true, so the entire block can be eliminated.
}
In this case, the if condition is always false, so the body of the if statement can be completely eliminated.
2. Constant Propagation in if Statements
If a variable used in the if condition has a constant value, it can be substituted at compile time, making
the condition simpler to evaluate.
Example:
int x = 10;
if (x > 5) {
// do something
}
Preprocessed version:
if (10 > 5) {
// do something
}
The condition x > 5 is replaced with 10 > 5, which is always true. The compiler can now remove
unnecessary checks or take appropriate actions based on the simplified condition.
3. Dead Code Elimination
If the condition in an if statement is always true or false, the compiler can eliminate the entire block of
code associated with the if statement.
Example 1 (Always True):
if (1) { // Always true
// do something
}
Preprocessed version:
// do something
Example 2 (Always False):
if (0) { // Always false
// do something
}
Preprocessed version:
// Dead code can be removed
If the condition can be determined at compile time (i.e., constant expression evaluation), the if statement
can either be simplified or completely removed.
4. Simplifying Nested if Statements
If there are nested if statements with conditions that can be simplified or merged, the compiler can perform
those optimizations.
Example:
if (a > 0) {
if (b > 0) {
// do something
}
}
If both a and b are constants and their values are known, the compiler can merge these conditions into a
single check and simplify the structure.
Preprocessing of goto Statements
The goto statement provides an unconditional jump to a specific label in the code, allowing for arbitrary
control flow. Since goto can make the program harder to analyze, compilers can optimize and preprocess
goto statements to improve readability and performance.
1. Eliminating Unnecessary goto Statements
If a goto statement does not alter the flow of control (e.g., it jumps to the next statement or to itself), it
can be removed.
Example:
goto next;
next:
// some code
Preprocessed version:
// some code
Preprocessing of if Statements and goto Statements in Compiler Design
Preprocessing of control flow statements like if statements and goto statements is a critical part of compiler
optimization. The main goal of preprocessing these control flow structures is to simplify or optimize them,
making the intermediate code more efficient and easier to translate into machine code. This step is
performed during the intermediate code generation phase or even earlier, depending on the compiler.
Preprocessing of if Statements
An if statement is a basic control flow statement used for conditional branching. The compiler can perform
several optimizations on if statements during preprocessing to make them more efficient or easier to
handle during subsequent stages of compilation.
1. Constant Folding in if Statements
If the condition of the if statement involves constant expressions, the compiler can evaluate the condition
at compile time (constant folding). This reduces the overhead of evaluating the condition at runtime.
Example:
c
Copy code
if (5 + 3 > 8) { // constant folding will simplify this at compile time
// do something
}
Preprocessed version:
c
Copy code
if (8 > 8) { // constant folding results in this
// This will never be true, so the entire block can be eliminated.
}
In this case, the if condition is always false, so the body of the if statement can be completely eliminated.
2. Constant Propagation in if Statements
If a variable used in the if condition has a constant value, it can be substituted at compile time, making
the condition simpler to evaluate.
Example:
c
Copy code
int x = 10;
if (x > 5) {
// do something
}
Preprocessed version:
c
Copy code
if (10 > 5) {
// do something
}
The condition x > 5 is replaced with 10 > 5, which is always true. The compiler can now remove
unnecessary checks or take appropriate actions based on the simplified condition.
3. Dead Code Elimination
If the condition in an if statement is always true or false, the compiler can eliminate the entire block of
code associated with the if statement.
Example 1 (Always True):
c
Copy code
if (1) { // Always true
// do something
}
Preprocessed version:
c
Copy code
// do something
Example 2 (Always False):
c
Copy code
if (0) { // Always false
// do something
}
Preprocessed version:
c
Copy code
// Dead code can be removed
If the condition can be determined at compile time (i.e., constant expression evaluation), the if statement
can either be simplified or completely removed.
4. Simplifying Nested if Statements
If there are nested if statements with conditions that can be simplified or merged, the compiler can perform
those optimizations.
Example:
c
Copy code
if (a > 0) {
if (b > 0) {
// do something
}
}
If both a and b are constants and their values are known, the compiler can merge these conditions into a
single check and simplify the structure.

Preprocessing of goto Statements


The goto statement provides an unconditional jump to a specific label in the code, allowing for arbitrary
control flow. Since goto can make the program harder to analyze, compilers can optimize and preprocess
goto statements to improve readability and performance.
1. Eliminating Unnecessary goto Statements
If a goto statement does not alter the flow of control (e.g., it jumps to the next statement or to itself), it
can be removed.
Example:
c
Copy code
goto next;
next:
// some code
Preprocessed version:
c
Copy code
// some code
If goto next is followed immediately by the label next, then the goto statement can be removed as it doesn't
change the control flow.
2. Converting goto to Structured Control Flow
In some cases, compilers may try to replace goto statements with structured control flow constructs like
while, for, or if statements, which are more readable and easier to analyze.
Example (Simplification of loop with goto):
start:
// some code
if (condition) goto start;
The goto can be replaced with a while loop:
while (condition) {
// some code
}
This improves the readability and maintainability of the code while achieving the same behavior.
3. Eliminating Unreachable Code After goto
Any code that follows a goto statement that is unconditionally jumped over becomes unreachable and
should be removed. This is a form of dead code elimination.
Example:
goto skip;
int x = 5; // Dead code
skip:
// some code
Preprocessed version:
goto skip;
// int x = 5; // This line is eliminated because it is unreachable
skip:
// some code
4. Goto Hoisting and Simplification
If a goto jumps over a set of sequential statements that can be combined into a single block or loop, it can
be hoisted to simplify the structure.
Example:
goto label;
x = 5;
y = 10;
label:
z = x + y;
The above code can be simplified to:
x = 5;
y = 10;
z = x + y;
If the code after the goto can be combined in a straight sequence, it can be simplified, removing the need
for the goto statement altogether.
4.4 PREPROCESSING OF ROUTINES
In compiler design, preprocessing of routines refers to the various transformations and optimizations
performed on functions or procedures (routines) before the actual code generation and optimization
phases. The goal is to make the routine calls more efficient, simplify the code, or make it easier for later
stages of the compiler to handle. These transformations often involve simplifying parameters, managing
function calls, removing unnecessary routines, and handling inlining and recursion.
Preprocessing routines typically occurs in the intermediate code generation phase or the semantic
analysis phase, before the optimization and code generation phases. This stage ensures that the routine
calls in the intermediate representation (IR) are more efficient or simplified.
Key Techniques in Preprocessing Routines
1. Inlining of Functions
Function inlining is the process of replacing a function call with the actual body of the function. This can
improve performance by eliminating the overhead associated with calling a function (e.g., pushing
arguments onto the stack, jumping to the function address, etc.).
Inlining is most beneficial for small, frequently called functions, where the cost of the function call
outweighs the cost of copying the function body directly into the calling code.
Example:
int add(int a, int b) {
return a + b;
}

int result = add(2, 3);


After inlining:
int result = 2 + 3;
This eliminates the need for a function call at runtime and can reduce execution time.
2. Function Call Simplification
In some cases, function calls can be simplified or replaced by expressions. This typically happens when
the result of a function call is known at compile time or when the function performs a simple operation
that can be expressed directly in the caller.
Example (constant folding and propagation):
int multiply(int x, int y) {
return x * y;
}

int a = 3;
int b = 4;
int result = multiply(a, b); // At compile time, this can be simplified
After preprocessing:
int result = 12; // Simplified at compile time by replacing the function call with the result
3. Tail Call Optimization (TCO)
Tail call optimization (TCO) is a technique used when the last operation in a function is a call to another
function (or itself, in the case of recursion). The idea is that, since there’s no need to keep the current
function’s stack frame after the call returns, the current function’s stack frame can be reused, effectively
converting the function call into a jump.
For recursive functions, this optimization prevents the stack from growing too large, potentially leading
to a stack overflow.
Example (without TCO):
int factorial(int n) {
if (n == 0) return 1;
return n * factorial(n - 1);
}
Example (with TCO):
int factorial(int n, int result) {
if (n == 0) return result;
return factorial(n - 1, n * result);
}
The tail-recursive version can be optimized to avoid additional stack frames by reusing the existing one.
4. Dead Code Elimination for Routines
If a function (or routine) is never called in the program, it can be removed during preprocessing. This is a
simple form of dead code elimination (DCE). Removing unused routines can significantly reduce the
size of the final code and improve efficiency.
Example:
void unusedFunction() {
// Some code
}

int main() {
// No calls to unusedFunction
return 0;
}
After preprocessing, the unused function is eliminated, reducing the code size.
5. Parameter Passing Optimization
The way parameters are passed to functions can be optimized during preprocessing. This includes
optimizing pass-by-value or pass-by-reference mechanisms, and simplifying the handling of parameters.
• Pass-by-value optimization: If the parameter is a constant or immutable, it can be directly
replaced in the function body.
Example:
void print(int x) {
printf("%d", x);
}

print(5); // 5 is passed as a constant


After preprocessing, the code could be simplified:
printf("%d", 5); // Directly call printf with the constant
• Pass-by-reference optimization: If parameters are passed by reference but never modified
inside the function, this can be simplified to pass-by-value.
6. Recursive Routine Preprocessing
Recursive routines can sometimes be problematic in terms of performance, particularly with deep
recursion leading to excessive memory use (stack overflow). Preprocessing recursive routines may
involve:
• Converting recursion to iteration: Some recursive functions, like those that involve simple
accumulation or iteration (such as factorial or Fibonacci), can be rewritten iteratively to
avoid the overhead of recursion.
Example (recursive Fibonacci):
int fibonacci(int n) {
if (n <= 1) return n;
return fibonacci(n - 1) + fibonacci(n - 2);
}
Rewritten iteratively:
int fibonacci(int n) {
int a = 0, b = 1, temp;
for (int i = 2; i <= n; i++) {
temp = a + b;
a = b;
b = temp;
}
return b;
}
• This avoids the overhead of recursive calls, such as managing the call stack.
7. Inlining Libraries and Standard Functions
If a function from a library is small or often called, the compiler might decide to inline the library function
during preprocessing. This is similar to function inlining, where the body of the library function is inserted
directly into the code, thus eliminating the overhead of function calls.
Example:
int add(int x, int y) {
return x + y;
}

int result = add(10, 20);


The function add could be inlined, producing:
int result = 10 + 20;
4.5 VARIANTS OF SYNTAX TREES
In compiler design, syntax trees (also known as parse trees) are used to represent the syntactic structure
of the source code according to a formal grammar. They are hierarchical structures that depict the syntax
of a programming language as a tree, where each node represents a construct in the language (such as
expressions, statements, or control structures).
There are several variants of syntax trees that serve different purposes in various stages of the
compilation process. These variants optimize the representation of the program's structure to facilitate
later phases like semantic analysis, optimization, and code generation. Below are the key variants of
syntax trees used in compiler design:
1. Abstract Syntax Tree (AST)
Definition:
The Abstract Syntax Tree (AST) is a simplified and more abstract version of the syntax tree that focuses
on the logical structure of the program rather than its syntactic form. It abstracts away unnecessary
syntactic details, such as parentheses or other punctuation, and represents the program in a way that makes
it easier for later compiler phases (like semantic analysis, optimization, and code generation) to process.
Key Features:
• Simplified Representation: It eliminates unnecessary syntactic elements like parentheses,
commas, and other delimiters that do not contribute to the meaning of the program.
• Focus on Semantics: The AST captures the logical structure of the program. For example,
an arithmetic expression might be represented by nodes for addition, multiplication, and
their operands rather than the full syntactic expression with every operator.
Example:
For the expression:
a + (b * c)
The corresponding AST might look like:
+
/\
a *
/\
b c
In this AST:
• The addition (+) node has two children: a and a multiplication (*).
• The multiplication node has two children: b and c.
• Parentheses are not included in the AST, as their role is only syntactic.
Usage:
• ASTs are used in the later stages of the compiler to perform semantic analysis (e.g., type
checking) and optimization.
• They are the starting point for generating intermediate code and then converting it into the
target machine code.
2. Concrete Syntax Tree (CST)
Definition:
The Concrete Syntax Tree (CST), also known as a Parse Tree, represents the syntactic structure of the
source code according to the grammar of the programming language. It includes all the terminals and non-
terminals defined by the grammar and corresponds directly to the derivation process used to generate the
code.
Key Features:
• Detailed Representation: Unlike the AST, the CST includes all syntax elements such as
parentheses, operators, and punctuation.
• Direct Representation of Grammar: Each node in the CST corresponds directly to a
grammar rule or token in the source language.
Example:
For the expression:
a + (b * c)
The corresponding CST would include all the grammar rules, such as:
Expr
/ \
Expr +
/ / \
a ( Expr )
/ \
Expr *
/ / \
b c
In this tree:
• The root node is Expr, representing the entire expression.
• Internal nodes represent non-terminals like Expr and operators like + and *.
• The leaves represent terminal symbols (like a, b, c, and parentheses).
Usage:
• The CST is useful in the parsing phase because it represents exactly how the input conforms
to the grammar.
• It is typically used by the parser to verify that the source code is syntactically valid and to
build a structure for further analysis.
3. Syntax-Directed Translation Tree (SDT)
Definition:
A Syntax-Directed Translation Tree (SDT) is an extension of the abstract syntax tree that associates
semantic actions with the tree's nodes. The nodes in an SDT may carry additional information or actions
that correspond to a translation or transformation of the source program into another representation (e.g.,
intermediate code).
Key Features:
• Augmented with Semantic Information: Each node in the tree can be associated with an
action, such as the computation of a value, type checking, or code generation.
• Semantic Analysis: The SDT is particularly useful for syntax-directed translation where
semantic information (e.g., type information, values, etc.) is propagated through the tree.
Example:
For the expression:
a + (b * c)
An SDT might associate actions like calculating values or types with each node.
• +: Perform addition.
• *: Perform multiplication.
The SDT would look similar to an AST but with semantic actions attached to each node:
+
/\
a *
/\
b c
Actions associated with each node might include:
• a and b, c: Retrieve values or types.
• *: Multiply b and c.
• +: Add a to the result of b * c.
Usage:
• SDTs are used in semantic analysis and code generation.
• They provide a way to carry out syntax-directed translation (e.g., converting source code
into intermediate code) based on the syntactic structure of the input.
4. Control Flow Graph (CFG)
Definition:
A Control Flow Graph (CFG) is a representation of the flow of control within a program. While not a
traditional syntax tree, it can be considered a variant of a syntax tree in the context of control structures
(such as loops and conditionals).
Key Features:
• Nodes represent basic blocks: A basic block is a straight-line code sequence with no
branches in except to the entry, and no branches out except at the exit.
• Edges represent control flow: The edges between nodes represent the flow of control from
one block to another.
Example:
For a program containing an if statement and a loop:
if (x > 0) {
y = 10;
} else {
y = 20;
}
while (y > 0) {
y--;
}
The corresponding CFG might look like:
Entry
|
Condition (if x > 0)
/ \
/ \
Block1 Block2
| |
y = 10 y = 20
\ /
Exit
|
Loop Condition (y > 0)
|
Block3
|
y--;
|
Exit
Usage:
• The Control Flow Graph is used in optimization and code generation to represent how
control flows between different parts of the program.
• It is particularly useful for tasks like data flow analysis, register allocation, and optimizing
loops and branches.
5. Intermediate Representation (IR) Tree
Definition:
The Intermediate Representation (IR) Tree is a form of tree that represents the program in an
intermediate form, typically used in the optimization and code generation phases of compilation. The
structure of an IR tree is closer to the target machine architecture than an AST or CST, and it may involve
operations, variables, and memory locations directly.
Key Features:
• Simplified representation: It abstracts away most of the high-level language constructs and
represents the program closer to machine instructions.
• Abstracts low-level details: While an IR tree may still use high-level constructs like
variables, it is focused on the operations needed to produce machine code.
Example:
For an expression:
a+b*c
The corresponding IR might represent the operations as:
+
/\
a *
/\
b c
The IR structure is similar to an AST, but it could be expanded to show specific operations like loading
values into registers and performing arithmetic.
Usage:
• The IR is used during optimization to improve the performance of the code.
• It is a key intermediate step between high-level programming constructs and machine code
generation.
4.6 THREE ADDRESS CODE
Three-Address Code (TAC) is an intermediate representation used in compiler design to express
computations in a simplified form. It is called "three-address" because each instruction in TAC typically
involves at most three addresses or operands: two source operands and one destination. These operands
can represent variables, constants, or intermediate results.
Key Features of Three-Address Code:
1. Simple and Expressive: It allows complex expressions and operations to be broken down
into simple instructions that involve at most three addresses.
2. Intermediate Representation: TAC serves as an intermediate step in the compilation
process between high-level language code and machine code, helping optimizations and
target code generation.
3. Clear Operations: It simplifies the process of analyzing and optimizing code by breaking
down high-level operations into smaller, more manageable components.
Format of Three-Address Code:
Each TAC statement has the form:
x = y op z
Where:
• x is the result variable (destination).
• y and z are the operands (source).
• op is the operation (e.g., +, -, *, /, etc.).
Here, x, y, and z are usually variables, constants, or temporary variables generated during compilation.
Example of Three-Address Code:
Consider the following expression in a high-level language like C:
a=b+c*d
The corresponding TAC would be:
t1 = c * d
t2 = b + t1
a = t2
In this case:
• t1 is a temporary variable that stores the result of c * d.
• t2 stores the result of b + t1.
• Finally, the result is assigned to a.
Example 2:
For a more complex expression like:
a = (b + c) * (d - e)
The TAC representation would be:
t1 = b + c
t2 = d - e
t3 = t1 * t2
a = t3
Types of Operations in TAC:
• Arithmetic operations: +, -, *, /
• Relational operations: <, >, ==, !=, <=, >=
• Logical operations: &&, ||, !
• Assignments: =, :=
• Control flow: goto, if, while, return
• Memory accesses: Dereferencing pointers or array accesses.
Benefits of Three-Address Code:
1. Simplification: TAC simplifies the structure of the original program, making it easier for
optimizations.
2. Intermediate Representation: It acts as a bridge between high-level language and machine
code, allowing for target-specific code generation.
3. Analysis and Optimization: TAC makes it easier to apply optimization techniques like
constant folding, dead code elimination, etc.
4.7 BOOLEAN EXPRESSIONS
In the context of Three-Address Code (TAC), Boolean expressions are represented in a simplified form,
using temporary variables to hold intermediate results. Each TAC instruction typically has the format:
x = y op z
Where:
• x is a temporary variable (destination).
• y and z are operands (variables, constants, or temporary variables).
• op is the Boolean operation (AND, OR, NOT, etc.).
Example: Boolean Expression in Three-Address Code
Consider the following Boolean expression in a high-level language:
a = (b AND c) OR (d AND e)
The corresponding Three-Address Code (TAC) would be:
t1 = b AND c
t2 = d AND e
a = t1 OR t2
Here:
• t1 = b AND c: The result of b AND c is stored in temporary variable t1.
• t2 = d AND e: The result of d AND e is stored in temporary variable t2.
• a = t1 OR t2: Finally, the result of t1 OR t2 is assigned to variable a.
Example 2: More Complex Boolean Expression
Consider a more complex Boolean expression:
a = (b AND (c OR d)) AND (e OR f)
This can be broken down into TAC as:
t1 = c OR d
t2 = b AND t1
t3 = e OR f
a = t2 AND t3
Here:
• t1 = c OR d: The result of c OR d is stored in temporary variable t1.
• t2 = b AND t1: The result of b AND t1 is stored in temporary variable t2.
• t3 = e OR f: The result of e OR f is stored in temporary variable t3.
• a = t2 AND t3: Finally, the result of t2 AND t3 is assigned to variable a.
Example 3: Using NOT (Negation)
Consider the Boolean expression:
a = NOT (b AND c)
The TAC for this expression would be:
t1 = b AND c
a = NOT t1
Here:
• t1 = b AND c: The result of b AND c is stored in temporary variable t1.
• a = NOT t1: The result of negating t1 (i.e., NOT (b AND c)) is assigned to variable a.
Example 4: Boolean Expression with XOR
For a Boolean expression like:
a = (b XOR c) AND d
The corresponding TAC is:
t1 = b XOR c
a = t1 AND d
Here:
• t1 = b XOR c: The result of b XOR c is stored in temporary variable t1.
• a = t1 AND d: The result of t1 AND d is assigned to variable a.
More Examples of TAC with Boolean Expressions:
1. Expression:
a = (b AND c) OR (d AND e) AND f
TAC:
t1 = b AND c
t2 = d AND e
t3 = t2 AND f
a = t1 OR t3
2. Expression:
a = b OR (c AND d)
TAC:
t1 = c AND d
a = b OR t1
General Rules for TAC with Boolean Expressions:
AND (∧): For a Boolean expression like x AND y, the TAC will be:
t1 = x AND y
where t1 is a temporary variable holding the result.
OR (∨): For an expression like x OR y, the TAC will be:
t1 = x OR y
NOT (¬): For an expression like NOT x, the TAC will be:
t1 = NOT x
XOR (⊕): For an expression like x XOR y, the TAC will be:
t1 = x XOR y
4.8 FLOW-OF-CONTROL STATEMENTS
In compiler design, Flow-of-Control Statements are crucial for directing the execution of a program
based on certain conditions. In Three-Address Code (TAC), flow-of-control statements such as
conditional jumps, loops, and function calls are translated into instructions that manage the program's
control flow.
These flow-of-control statements allow the program to:
1. Make decisions (if, if-else).
2. Repeat sections of code (while, for).
3. Exit or transfer control to another part of the program (goto, break, return).
Common Flow-of-Control Statements in TAC
Here are the main types of flow-of-control statements in Three-Address Code:
1. Conditional Jumps (if, if-else)
2. Unconditional Jumps (goto)
3. Loops (while, for)
4. Function Calls and Returns
5. Break and Continue
1. Conditional Jump (if-else)
A conditional jump is used to control the execution based on a condition. If the condition is true, one block
of code is executed; otherwise, another block is executed.
Example:
High-level expression:
if (a > b) {
x = 10;
} else {
x = 20;
}
In Three-Address Code, this would be translated to:
if a > b goto L1 // If a > b, jump to label L1
x = 20 // Else, execute this line
goto L2 // Jump to the end
L1:
x = 10 // If a > b, execute this line
L2:
• if a > b goto L1: Check the condition a > b. If true, jump to label L1.
• x = 20: If the condition is false, assign 20 to x.
• goto L2: Skip the x = 10 part by jumping to L2.
• L1: and L2: are labels marking points in the code.
2. Unconditional Jump (goto)
An unconditional jump transfers control to another part of the program, regardless of any conditions.
Example:
High-level expression:
goto L1;
In Three-Address Code:
goto L1
This means the program will jump to the label L1.
3. Loops (while, for)
In a while loop, the loop body is executed as long as the condition is true. In Three-Address Code, this
involves checking the condition before each iteration and jumping if the condition is false.
Example:
High-level expression:
while (a < b) {
a = a + 1;
}
In Three-Address Code:
L1:
if a >= b goto L2 // If a >= b, exit loop
a=a+1 // Execute loop body
goto L1 // Go back to the start of the loop
L2:
• L1: marks the start of the loop.
• if a >= b goto L2: Check if the loop condition is false (a >= b). If true, exit the loop and jump
to L2.
• a = a + 1: The body of the loop that executes if the condition is true.
• goto L1: Jump back to the start of the loop.
4. Function Calls and Returns
When a function is called or a return statement is encountered, the control flow jumps to the function
definition or returns from the current function.
Example:
High-level expression:
y = foo(x);
In Three-Address Code:
t1 = foo(x) // Call function foo with argument x, store result in t1
y = t1 // Assign the result of foo(x) to y
Example with Return Statement:
High-level expression:
return a + b;
In Three-Address Code:
t1 = a + b // Compute a + b
return t1 // Return the result
• t1 = foo(x): The result of the function call foo(x) is stored in t1.
• return t1: Return the value of t1 to the calling function.
5. Break and Continue
In many programming languages, break is used to exit from loops prematurely, and continue is used to
skip the current iteration and move to the next one.
Example with Break (Exit Loop):
High-level expression:
while (a < b) {
if (c > 0) {
break; // Exit the loop
}
a = a + 1;
}
In Three-Address Code:
L1:
if a >= b goto L2 // If a >= b, exit loop
if c > 0 goto L3 // If c > 0, break the loop
a=a+1 // Loop body
goto L1 // Continue loop
L3:
goto L2 // Break: Exit the loop
L2:
• if c > 0 goto L3: If c > 0, jump to L3, which exits the loop.
• goto L2: When break is encountered, control is transferred to L2 (outside the loop).
Example with Continue (Skip Iteration):
High-level expression:
while (a < b) {
if (c > 0) {
continue; // Skip the rest of the loop body
}
a = a + 1;
}
In Three-Address Code:
L1:
if a >= b goto L2 // If a >= b, exit loop
if c > 0 goto L3 // If c > 0, skip the rest of the loop body
a=a+1 // Loop body
goto L1 // Continue loop
L3:
goto L1 // Continue to next iteration
L2:
• if c > 0 goto L3: If c > 0, jump to L3, which skips the loop body for this iteration.
Summary of Common Flow-of-Control TAC Instructions
Conditional Jump:
if condition goto label
Unconditional Jump:
goto label
Loop (while, for):
label1:
if condition goto label2
// loop body
goto label1
label2:
Function Call:
t1 = function(args) // Call function and store result
Return Statement:
return value // Return from function
Break:
goto label // Exit the loop
Continue:
goto label // Skip to the next iteration of the loop
4.9 CONTROL- FLOW TRANSLATION OF BOOLEAN EXPRESSIONS
Control-Flow Translation of Boolean Expressions refers to how Boolean expressions, typically
involving logical operators (such as AND, OR, NOT, etc.), are transformed into a sequence of control-
flow operations in intermediate representations, such as Three-Address Code (TAC). The idea is to
manage the flow of execution depending on the evaluation of Boolean conditions, which commonly
appear in if, while, for, or other control structures.
In this context, the translation involves breaking down Boolean expressions into simpler steps, using
conditional and unconditional jumps (or branches) to control the flow based on the truth values of the
Boolean expressions. Let's explore how Boolean expressions are typically translated into control-flow
operations in TAC.
Types of Boolean Expressions and Their Control Flow Translation
We'll cover how to translate various types of Boolean expressions into control flow in TAC, focusing on
how these expressions are used in conditionals, loops, and other control-flow statements.
1. Simple Boolean Expressions (Relational Conditions)
For basic Boolean expressions like A AND B, A OR B, or NOT A, we need to use conditional jumps that
evaluate the truth of the expression.
Example 1: A AND B
High-level expression:
if (A AND B)
In Three-Address Code:
if A == 0 goto L1 // If A is false, jump to label L1
if B == 0 goto L1 // If B is false, jump to label L1
// Both A and B are true, continue execution
L1:
• First, it checks if A is false. If so, the control jumps to label L1, effectively skipping the true
path of the AND.
• Then, it checks if B is false and performs the same logic.
• If neither is false, the execution continues, implying the AND condition is true.
Example 2: A OR B
High-level expression:
if (A OR B)
In Three-Address Code:
if A != 0 goto L1 // If A is true, jump to label L1
if B != 0 goto L1 // If B is true, jump to label L1
goto L2 // Both A and B are false, jump to label L2
L1:
// Code for true condition
L2:
• It checks if A is true. If so, control jumps to L1 because the OR expression is true.
• If A is false, it checks B. If B is true, it jumps to L1.
• If both A and B are false, control goes to L2, skipping the true path of the OR.
Example 3: NOT A
High-level expression:
if (NOT A)
In Three-Address Code:
if A != 0 goto L1 // If A is true, jump to L1 (since NOT A is false)
L2:
// Code for NOT A (i.e., A is false)
L1:
• If A is true, the condition NOT A is false, and we jump to L1.
• If A is false, control continues to L2.

2. Boolean Expressions in if-else Statements


For more complex conditional statements like if-else, the Boolean expression is evaluated to decide
whether to execute the "if" block or the "else" block.
Example: if (A AND B) { x = 1; } else { x = 0; }
In Three-Address Code:
if A == 0 goto L1 // If A is false, jump to L1
if B == 0 goto L1 // If B is false, jump to L1
x=1 // A and B are both true, assign 1 to x
goto L2 // Skip else part
L1:
x=0 // A or B is false, assign 0 to x
L2:
• The expression A AND B is evaluated first.
• If A or B is false, control jumps to L1 and assigns 0 to x.
• If both A and B are true, x = 1 is executed.
• The goto L2 ensures that the execution jumps to the end of the conditional block, skipping the else
part.
3. Boolean Expressions in while Loops
In a loop like while (A AND B), the condition is checked before each iteration. If the condition evaluates
to false, the loop terminates.
Example: while (A AND B) { a = a + 1; }
In Three-Address Code:
L1:
if A == 0 goto L2 // If A is false, exit loop
if B == 0 goto L2 // If B is false, exit loop
a=a+1 // Execute loop body
goto L1 // Repeat loop
L2:
• The loop begins by checking the condition A AND B. If either A or B is false, control jumps to L2,
exiting the loop.
• If both A and B are true, the loop body (a = a + 1) is executed.
• The loop continues by jumping back to L1 to check the condition again.
4. Short-circuiting in Boolean Expressions
In short-circuiting, the evaluation stops as soon as the result is determined. This is particularly useful
with the AND and OR operators, as further checks can be skipped if the result is already known.
Example: if (A AND B)
• In a short-circuit evaluation, if A is false, there is no need to check B, because the result will
always be false.
TAC Translation with Short-Circuit:
if A == 0 goto L1 // If A is false, skip the next check
if B == 0 goto L1 // If B is false, skip the true path
L2:
// Code for A AND B is true
L1:
• If A is false, we don’t evaluate B, as the AND condition is guaranteed to be false.
• This results in fewer instructions and more efficient execution.
4.10 RUN TIME ENVIRONMENTS
In compiler design and program execution, a runtime environment refers to the collection of resources
that are used during the execution of a program. This includes memory management, organization, and
the structures used to hold and access variables, data, and control information while the program is
running.
4.10.1 STORAGE ORGANIZATION
The storage organization of a runtime environment deals with how various types of data (variables,
arrays, etc.) are allocated and organized in memory. The organization determines how memory is used
during program execution, including stack memory, heap memory, static memory, and dynamic memory.
In simpler terms, storage organization defines how and where various program data are stored during
execution to ensure efficient access and correct program behavior.
Key Concepts of Storage Organization in Runtime Environments
1. Memory Layout During Execution
2. Static vs. Dynamic Storage Allocation
3. The Stack and the Heap
4. Global, Local, and Temporary Variables
5. Activation Records
6. Garbage Collection
1. Memory Layout During Execution
When a program is executed, its memory is typically divided into several regions. These regions are
designed to handle different kinds of data:
1. Code Segment (Text Segment): This section contains the machine code of the program (the
instructions).
2. Data Segment: This is used for static variables, constants, and global variables.
3. Stack Segment: This segment handles the function calls, local variables, and control
information.
4. Heap Segment: This is used for dynamic memory allocation (objects, arrays, etc.) that can
grow or shrink during the program's execution.
2. Static vs. Dynamic Storage Allocation
• Static Allocation: This refers to memory that is allocated at compile time and remains fixed
throughout the program's execution. Examples include global variables, constants, and static
variables. Their memory addresses are determined before execution begins.
• Dynamic Allocation: This refers to memory that is allocated during the execution of the
program, often using functions like malloc in C or new in C++. Dynamic memory is stored
in the heap and can grow or shrink at runtime depending on the program’s needs.
3. The Stack and the Heap
Memory in most modern systems is divided into two main regions that handle most of the runtime memory
allocation:
Stack Memory:
• The stack is used for local variables, function parameters, and return addresses.
• Every time a function is called, an activation record (also called a stack frame) is pushed
onto the stack.
• The stack grows and shrinks as functions are called and return.
• It follows a Last In, First Out (LIFO) order for memory allocation and deallocation.
• The stack is faster than the heap due to its simple, structured nature.
Example of stack usage:
int foo() {
int a = 10; // 'a' is allocated on the stack
return a;
}
• When foo() is called, the memory for a is pushed onto the stack.
• When foo() returns, the memory for a is popped off the stack.
Heap Memory:
• The heap is used for dynamic memory allocation. Memory can be allocated and deallocated
at any time during the program’s execution.
• It allows for variable-sized data structures like arrays, linked lists, trees, etc.
• The heap is slower than the stack due to its need to handle dynamic allocation and
deallocation.
• Memory from the heap must be explicitly managed by the programmer (e.g., with free in C
or delete in C++).
Example of heap usage:
int *ptr = malloc(sizeof(int)); // Allocate memory on the heap
*ptr = 10;
free(ptr); // Deallocate memory
4. Global, Local, and Temporary Variables
Global Variables:
• These variables are declared outside any function and can be accessed by all functions in
the program.
• They are stored in the data segment of memory.
Local Variables:
• Local variables are declared within a function and can only be accessed within that function.
• They are stored on the stack during function execution.
Temporary Variables:
• Temporary variables are often used by the compiler during intermediate computations or
expression evaluations.
• These are typically stored in registers or stack frames.
5. Activation Records (Stack Frames)
Each time a function is called, an activation record (AR), also called a stack frame, is created to manage
the function's local data, such as:
• Return Address: The memory location to which the program will return after the function
call.
• Function Parameters: Variables passed to the function.
• Local Variables: Variables defined within the function.
• Saved Registers: Registers that need to be saved and restored during the function call.
An activation record typically looks like this:
+--------------------+
| Return Address | <-- Top of the stack
+--------------------+
| Saved Registers |
+--------------------+
| Parameters |
+--------------------+
| Local Variables |
+--------------------+
When the function finishes execution, its activation record is popped off the stack, and the program
continues execution from the return address.
6. Garbage Collection
In languages with dynamic memory allocation, such as Java, C#, and Python, garbage collection is used
to automatically manage memory. This system automatically frees memory that is no longer in use,
preventing memory leaks and other issues caused by manual memory management.
• Garbage collectors periodically check for objects that are no longer referenced and free the
associated memory.
• Mark-and-sweep, reference counting, and generational garbage collection are common
algorithms for this task.
Example of Memory Layout:
Let’s take a look at an example of how memory is organized during the execution of a simple C program:
int global_var = 100; // Global variable

void function(int param) {


int local_var = 10; // Local variable
int *dynamic_var = malloc(sizeof(int)); // Dynamically allocated memory
*dynamic_var = 20;

printf("%d %d\n", local_var, *dynamic_var); // Output: 10 20


free(dynamic_var); // Free dynamically allocated memory
}

int main() {
function(5);
return 0;
}
• Global variable global_var is stored in the data segment.
• Function param is a local variable and is stored on the stack.
• Dynamic memory (dynamic_var) is allocated on the heap.
4.10.2 STACK ALLOCATION OF SPACE
Stack allocation of space refers to how memory is allocated for function calls, local variables, and control
information during program execution. The stack is a special region of memory that operates on a Last
In, First Out (LIFO) principle, where data is pushed onto the stack when a function is called and popped
off when the function exits.
In stack allocation, memory for local variables, function parameters, return addresses, and other
temporary data is allocated dynamically as functions are called, and the memory is deallocated when the
function finishes executing.
Key Features of Stack Allocation
1. Automatic Memory Management:
o Memory for local variables and function calls is automatically managed by the stack.
When a function is called, memory is allocated on the stack, and when the function
exits, the memory is freed.
2. Last In, First Out (LIFO):
o The stack operates on a LIFO principle, meaning that the most recent function call
is the first to return, and the most recently allocated memory is the first to be
deallocated.
3. Function Call and Return:
o When a function is called, an activation record (or stack frame) is pushed onto the
stack. This record stores the function's local variables, parameters, return address,
and saved registers.
o When the function exits, its activation record is popped from the stack, and the
program continues from the return address.
How Stack Allocation Works
1. Function Calls
When a function is called, an activation record (also called a stack frame) is created on the stack. This
record contains the following:
• Return Address: The location to which the program will return after the function execution
is completed.
• Function Parameters: The values passed to the function.
• Local Variables: Variables that are defined inside the function.
• Saved Registers: Registers that are preserved across the function call.
• Control Information: Additional information like the previous stack pointer and frame
pointer.
The activation record is pushed onto the stack when the function is called, and it is popped off when the
function returns.
2. Local Variables
Local variables are declared within a function and are stored in the activation record. Their memory is
allocated when the function is called and deallocated when the function exits.
Each time a function is called, a new activation record is created on the stack, and the local variables are
placed in this record. For example:
int foo(int x) {
int y = 10; // y is a local variable
return x + y;
}
In this example, when foo() is called, a new activation record is created, and memory for x and y is
allocated on the stack. Once the function returns, this memory is freed when the activation record is
popped off the stack.
3. Parameters
Function parameters are also passed using the stack. When a function is called, the arguments are pushed
onto the stack, either from left to right or right to left (depending on the calling convention used). These
parameters are placed in the activation record.
For example:
void bar(int a, int b) {
int c = a + b;
}
When bar(5, 3) is called, the parameters a and b are pushed onto the stack, and the corresponding local
variable c is allocated in the activation record.
Example of Stack Allocation
Consider the following example in C:
void funcA() {
int x = 5; // local variable
funcB(x); // call funcB with parameter x
}

void funcB(int y) {
int z = y + 1; // local variable
}
Execution Flow:
1. funcA() is called:
o An activation record is pushed onto the stack for funcA.
o The local variable x is stored in the activation record.
o funcB(x) is called.
2. funcB() is called:
o A new activation record is pushed onto the stack for funcB.
o The parameter y is pushed onto the stack (passed from x).
o The local variable z is created in funcB's activation record.
3. funcB() completes execution:
o The activation record for funcB is popped from the stack, and memory is deallocated.
4. funcA() completes execution:
o The activation record for funcA is popped from the stack, and memory is deallocated.
Activation Record Structure
An activation record contains the following sections:
1. Return Address:
o The address to return to after the function call finishes.
2. Saved Registers:
o Registers that need to be saved and restored across function calls (e.g., the frame
pointer, the stack pointer).
3. Parameters:
o The parameters passed to the function.
4. Local Variables:
o The local variables declared inside the function.
5. Control Information:
o Miscellaneous information like the previous stack pointer or frame pointer, which
helps to restore the stack to its correct state when the function returns.
The layout might look like this in memory:
+----------------------+
| Return Address | <-- Top of the stack
+----------------------+
| Saved Registers |
+----------------------+
| Parameters |
+----------------------+
| Local Variables |
+----------------------+
| Control Information |
+----------------------+
Stack Allocation vs. Heap Allocation
• Stack Allocation:
o Faster and simpler.
o Memory is allocated and deallocated automatically when functions are called and
return.
o Memory is automatically freed when the function exits.
o Suitable for small, short-lived variables (local variables, function parameters).
• Heap Allocation:
o Memory is dynamically allocated using functions like malloc in C or new in C++.
o Requires explicit deallocation (using free or delete).
o Suitable for variables that need to persist across function calls or whose size is not
known in advance.
Advantages of Stack Allocation
1. Efficiency:
o Stack memory allocation is very efficient because it uses a simple pointer arithmetic
technique (incrementing and decrementing the stack pointer) to allocate and
deallocate memory.
2. Automatic Memory Management:
o The stack is managed automatically, so the programmer does not need to manually
allocate or free memory (unlike heap memory).
3. No Fragmentation:
o Stack memory allocation does not suffer from memory fragmentation, unlike heap
allocation, because memory is allocated and deallocated in a predictable order.
Disadvantages of Stack Allocation
1. Limited Size:
o The size of the stack is limited. If too much memory is allocated on the stack (e.g.,
by making too many function calls with large local variables), a stack overflow may
occur.
2. Short-lived Variables:
o The variables allocated on the stack exist only during the execution of the function.
Once the function exits, the memory is automatically freed, which can be a limitation
for some use cases.
4.10.3 ACCESS TO NONLOCAL DATA ON THE STACK
In the context of stack-based memory management and compiler design, nonlocal data refers to data
that is not local to the currently executing function, but instead belongs to functions higher up in the call
stack or is part of the global scope.
When discussing access to nonlocal data on the stack, we are concerned with how a function can access
variables or data that are in other scopes, such as its caller’s local variables, global variables, or static
variables.
Types of Nonlocal Data
1. Nonlocal Variables in a Calling Function: These are the variables declared in the calling
function, which are not local to the currently executing function, but can be accessed by that
function.
2. Global Variables: These are variables declared outside any function, typically at the global
level of a program, and they are accessible by all functions.
3. Static Variables: These are variables whose lifetime extends beyond a single function call
but are limited to the scope of a single function. They retain their value across function calls.
Mechanisms for Accessing Nonlocal Data
When accessing nonlocal data on the stack, the access usually involves navigating through the stack
frames (or activation records) and using pointers or references to specific data in these frames.
1. Accessing Caller’s Local Variables
2. Accessing Global Variables
3. Accessing Static Variables
4. Using the Frame Pointer and Stack Pointer
1. Accessing Caller’s Local Variables
In stack-based memory, each function call creates a new stack frame (also known as an activation
record) that contains the function’s local variables and parameters. These stack frames are linked
together in a stack chain, with each frame pointing to its caller’s frame.
To access nonlocal variables from the calling function (or from a function that is not the immediate caller),
a function can navigate through the stack frames. This is usually done using the frame pointer (FP),
which points to the base of the current stack frame.
To access a caller’s local variable, the function needs to know the location of the calling function’s frame.
This can be achieved by looking at the frame pointer of the current function’s stack frame and then
offsetting it to find the caller's frame.
Example:
Consider the following C code:
void funcA() {
int x = 10; // Local variable in funcA
funcB(x); // Passing 'x' to funcB
}

void funcB(int y) {
printf("Value of x: %d\n", y); // Accessing 'x' from funcA via parameter 'y'
}
• funcA calls funcB, passing x as the argument.
• Inside funcB, the parameter y receives the value of x (which was passed from funcA).
• Here, funcB is accessing nonlocal data from funcA via the stack.
In this case, funcB accesses x indirectly through y, which was passed via the stack, allowing access to
nonlocal data.
2. Accessing Global Variables
Global variables are declared outside any function and are stored in a separate region of memory (often
in the data segment). These variables are accessible by any function within the program. Since global
variables are not placed on the stack, they can be accessed directly using their names in any function.
Example:
int global_var = 42; // Global variable

void func() {
printf("Global variable: %d\n", global_var); // Directly accessing the global variable
}
• The global variable global_var can be accessed from any function (e.g., func) because it
resides in global memory, not in the stack.
3. Accessing Static Variables
Static variables are similar to local variables in that they are declared inside a function, but they persist
across function calls. These variables are allocated in a special region of memory (often in the data
segment, not the stack). Static variables retain their values between function calls and are accessible only
within the function in which they are declared.
To access a static variable in a function, you don’t need to worry about navigating the stack. The static
variable is managed by the program’s runtime system and is retained across invocations of the function.
Example:
void func() {
static int count = 0; // Static variable
count++;
printf("Static count: %d\n", count);
}
• The variable count retains its value between calls to func because it is static. Each time the
function is called, it increments count and prints its value.
How Stack Frames Manage Nonlocal Data
In most programming languages, the stack contains an ordered sequence of activation records, each of
which represents a function call. These activation records usually consist of:
• Return address: The location where control should return after the function completes.
• Function parameters: The values passed to the function from the calling function.
• Local variables: Variables declared within the function.
• Saved registers: Registers that need to be preserved between function calls.
• Control information: Additional information needed to restore the stack frame, such as the
previous frame pointer.
For nonlocal data, there are typically two ways to access data outside the current function’s frame:
1. Using the Caller’s Frame: To access the caller's local variables, a function can use the frame
pointer (FP) to navigate to the previous activation record in the stack. This allows the
function to access the caller's local variables, parameters, or saved registers.
2. Using Global or Static Memory: Global and static variables are stored in separate memory
regions, so they don’t depend on stack frames. Global variables are accessed using their
symbolic names, and static variables are maintained in a special region of memory that can
be accessed directly.
Example: Navigating the Stack for Nonlocal Data
Here’s an example in C, demonstrating access to nonlocal data on the stack by using the caller’s local
variables:
#include <stdio.h>

void funcA() {
int a = 5; // Local variable in funcA
funcB(); // Call funcB, which accesses 'a' in funcA
}

void funcB() {
int *p = (int*)0x7fffffffe5c0; // Assume this points to 'a' in funcA's stack frame
printf("Value of 'a' in funcA: %d\n", *p); // Access nonlocal data (local variable in funcA)
}

int main() {
funcA();
return 0;
}
In this example, funcB tries to access the local variable a from funcA by navigating the stack, which is
typically done by knowing the stack frame's structure and using the correct memory addresses.
MODULE-V
CODE OPTIMIZATION AND CODE GENERATION
5.1 BASIC BLOCKS AND FLOW GRAPHS
In compiler design, Basic Blocks and Flow Graphs are essential concepts for analyzing and optimizing
control flow within a program. These concepts help in various optimization techniques like loop
optimization, dead code elimination, and instruction scheduling. Let's break down what basic blocks and
flow graphs are, and how they are used.
1. Basic Blocks
A Basic Block is a sequence of consecutive statements or instructions in a program with the following
characteristics:
• Single Entry Point: There is exactly one entry point into the basic block (no jumps or
branches except at the beginning).
• Single Exit Point: There is exactly one exit point (no jumps or branches except at the end).
• No Branching Inside: Once control enters a basic block, it runs through all the instructions
sequentially without any internal control flow (i.e., no branches or returns in the middle).
In essence, a basic block is a straight-line piece of code where the flow of control is linear.
Example of Basic Block
Consider the following code:
int x = 10;
if (x > 5) {
x = x + 2;
} else {
x = x - 2;
}
x = x * 2;
This code can be divided into several basic blocks:
Block 1:
int x = 10;
Block 2 (if condition):
if (x > 5) {
x = x + 2;
}
This is a conditional statement, so it leads to two different paths:
• If x > 5, it will enter the "then" part.
• If x <= 5, it will enter the "else" part.
Block 3:
x = x * 2;
Each block is independent in terms of execution flow, and control can only transfer between blocks using
jumps (branch instructions like if, goto, return, etc.).
2. Flow Graphs
A Flow Graph (also called Control Flow Graph or CFG) is a directed graph that represents the flow of
control through a program. It consists of nodes and edges:
• Nodes: Each node represents a basic block in the program.
• Edges: A directed edge from one node to another indicates the possible control flow from
one basic block to the next. These edges represent possible jumps in the program’s control
flow.
The primary purpose of a flow graph is to visualize how the control flows between different basic blocks
of a program. It is particularly useful for optimizing compilers, as it helps identify dead code, loops, and
unreachable code.
Example of a Flow Graph
For the program above, the flow graph would look like this:
1. Block 1 (introduction of x).
2. Block 2 (conditional check and assignment).
3. Block 3 (final multiplication).
The flow graph can be represented as:
[Block 1] → [Block 2] → [Block 3]

(else path)
In this flow graph:
• The control flow starts at Block 1, where x is initialized.
• Then, it moves to Block 2, where the condition if (x > 5) is evaluated.
• Depending on the condition, the flow either follows the "then" branch or the "else" branch
(both leading to Block 3).
• Finally, the program ends at Block 3, where x is multiplied by 2.
This flow graph captures all the potential control flow paths that can occur during the program's execution.
Key Points of Basic Blocks and Flow Graphs
1. Basic Block:
o A straight-line code segment with no internal branching.
o It has one entry point and one exit point.
o Essential for understanding control flow and for optimization techniques.
2. Control Flow Graph (CFG):
o A directed graph where:
▪ Nodes represent basic blocks.
▪ Edges represent control flow between blocks.
o Helps visualize program execution and optimize compilers.
o Used in various analyses, such as:
▪ Dominance analysis (which basic blocks must be executed before others).
▪ Reaching definitions (which variables are available at a point in the
program).
▪ Loop detection (which blocks form loops).
Example of Flow Graph Construction
Let’s consider the following C code:
int foo(int x) {
if (x > 0) {
x = x + 1;
} else {
x = x - 1;
}
return x;
}
We can break this program into basic blocks:
Block 1:
int x = 0;
if (x > 0) { ... }
Block 2 (If condition branch):
x = x + 1;
Block 3 (Else condition branch):
x = x - 1;
Block 4 (Return):
return x;
The Control Flow Graph (CFG) can be represented as follows:
[Block 1] → [Block 2] → [Block 4]

[Block 3]
Flow Graph Use in Optimization
Flow graphs are used in compilers for various optimization tasks:
1. Dead Code Elimination:
o Using a flow graph, a compiler can analyze whether any basic block or statement in
a block is never reached (i.e., no edge leads to it). Such code can be eliminated.
2. Loop Optimization:
o By identifying cycles in a flow graph (i.e., loops), a compiler can optimize loops by
performing tasks such as loop unrolling or strength reduction.
3. Control Flow Analysis:
o Flow graphs help detect unreachable code or code that will never be executed (due
to conditions or prior statements).
4. Reachability Analysis:
o Understanding which basic blocks can be reached from others is useful for many
compiler optimizations.
5.2 OPTIMIZATION OF BASIC BLOCKS
Optimization of basic blocks is a key phase in the compiler optimization process. Since basic blocks are
sequences of instructions that have a single entry and a single exit, they provide a simplified view of
control flow. The goal of optimizing basic blocks is to improve the performance of the generated machine
code, reduce execution time, and minimize resource usage while maintaining the correctness of the
program.
There are several optimization techniques that can be applied within basic blocks. These optimizations
generally focus on simplifying the instructions, removing redundant operations, and improving the usage
of registers. Below are some common types of basic block optimizations.
Types of Basic Block Optimizations
1. Constant Folding
2. Constant Propagation
3. Common Subexpression Elimination (CSE)
4. Dead Code Elimination
5. Strength Reduction
6. Loop-Invariant Code Motion
7. Register Allocation Optimizations
Let’s explore each of these optimizations.
1. Constant Folding
Constant folding involves simplifying expressions involving only constants at compile time. This reduces
the number of operations to be performed at runtime.
• Example:
int x = 5 + 3; // x is assigned the result of a constant expression
After constant folding, this can be simplified to:
int x = 8; // x is directly assigned the constant value
Why it’s important:
• Constant folding reduces the number of runtime calculations by evaluating constant
expressions at compile time.
2. Constant Propagation
Constant propagation is the process of replacing variables with constant values where possible. It
propagates known constant values through expressions and assignments, thereby simplifying the program.
• Example:
int a = 5;
int b = a + 2;
Since a is known to be 5, the expression a + 2 can be replaced with 5 + 2, simplifying the code to:
int b = 7;
Why it’s important:
• This optimization helps reduce the number of computations by making use of known
constants, improving execution speed and reducing resource usage.
3. Common Subexpression Elimination (CSE)
Common Subexpression Elimination (CSE) identifies expressions that are computed multiple times
within a basic block and eliminates redundant calculations by reusing the result of the previous
computation.
• Example:
int a = x * y;
int b = x * y + z;
The expression x * y is computed twice. CSE eliminates the redundancy:
int temp = x * y;
int a = temp;
int b = temp + z;
Why it’s important:
• CSE reduces redundant calculations and saves computational resources, especially in loops
where expressions may be recalculated repeatedly.
4. Dead Code Elimination
Dead code elimination removes instructions that do not affect the program’s outcome, either because
their results are never used or they are unreachable.
• Example:
int x = 10;
int y = 5;
x = x + 1;
y = y + 1; // This variable 'y' is never used afterward
• The statement y = y + 1 is dead code because y is never used after its assignment. This can
be eliminated.
• Why it’s important:
o Removing dead code reduces the size of the executable, optimizes memory usage,
and improves execution speed.
5. Strength Reduction
Strength reduction is a technique used to replace expensive operations (like multiplication or division)
with cheaper operations (like addition or bit shifts), particularly when the operation is done repeatedly in
loops.
• Example:
for (int i = 0; i < n; i++) {
x = 2 * i; // Multiplication is expensive
}
Instead of using multiplication (2 * i), this can be converted to a cheaper addition (i + i):
for (int i = 0; i < n; i++) {
x = i + i; // Adding instead of multiplying
}
for (int i = 0; i < n; i++) {
x = i + i; // Adding instead of multiplying
}
Why it’s important:
• Strength reduction reduces the time complexity of operations, making them more efficient,
particularly in loops.
6. Loop-Invariant Code Motion (LICM)
Loop-Invariant Code Motion (LICM) is an optimization that moves computations that do not change
across iterations of a loop out of the loop. This ensures that expensive operations are not repeated
unnecessarily in each iteration of the loop.
• Example:
for (int i = 0; i < n; i++) {
x = a * b; // a * b is constant inside the loop
y = x + i;
}
The expression a * b is invariant inside the loop because a and b do not change during the loop. It can be
moved outside the loop:
x = a * b;
for (int i = 0; i < n; i++) {
y = x + i;
}
Why it’s important:
• By moving invariant code outside the loop, the number of operations performed inside the
loop is reduced, improving performance, especially in loops with a large number of
iterations.
7. Register Allocation Optimizations
Efficient register allocation ensures that frequently used variables are stored in CPU registers rather than
memory, reducing the time required for accessing them. A register is much faster to access than memory.
Optimizations like register renaming and allocation aim to reduce the number of memory accesses by
minimizing the number of variables stored in memory.
• Example:
If there are variables a and b, and the program uses them repeatedly, the compiler may assign them to
CPU registers rather than pushing them to the stack each time.
• Why it’s important:
o Register allocation optimization improves performance by reducing memory access
overhead, which can be a significant bottleneck in modern CPUs.
General Flow of Basic Block Optimization
1. Identifying Basic Blocks:
o The compiler first splits the program into basic blocks. This is done by identifying
sequences of instructions with a single entry and exit point.
2. Applying Optimizations:
o The compiler then applies various optimizations to each basic block. These
optimizations are done in a sequence or in a combination to make the generated code
more efficient.
3. Rebuilding Flow Graphs:
o After optimizing the basic blocks, the compiler may regenerate the control flow
graph to reflect any changes, such as the removal of dead code or changes in the
order of instructions.
4. Final Code Generation:
o The final step involves using the optimized basic blocks to generate machine or
intermediate code.
Example: Optimizing a Basic Block
Consider the following code segment:
int a = 10;
int b = a * 2;
int c = b + a;
int d = c - 1;
• Step 1: Constant Folding:
o b = a * 2 can be simplified to b = 10 * 2 → b = 20.
• Step 2: Common Subexpression Elimination:
o The expression b + a can be rewritten as 20 + 10 = 30.
• Step 3: Dead Code Elimination:
o After simplifying the code, we notice that b and c are only used to compute d, so we
could directly compute d without storing intermediate values.
Optimized code:
int a = 10;
int d = (a * 2) + a - 1; // Simplified computation
This optimization reduces the number of operations and memory accesses by eliminating unnecessary
variables and simplifying expressions.
5.3 THE PRINCIPAL SOURCES OF OPTIMIZATION
In compiler design, optimization refers to the process of improving the performance of a program by
transforming the intermediate or machine code into a more efficient version. The aim is to reduce the
resource consumption (like time and space) while maintaining the correctness of the program. There are
several sources of optimization that can be targeted at different levels of the program: machine-level
optimizations, high-level optimizations, intermediate code optimizations, and data flow
optimizations. These sources are generally applied in various stages of the compilation process.
Here are the principal sources of optimization in compiler design:
1. Instruction-Level Optimization
At the instruction level, the focus is on optimizing individual machine instructions or intermediate code.
This involves transforming the code to improve performance and reduce resource usage.
Key Techniques:
• Common Subexpression Elimination (CSE):
o Identifying and eliminating redundant expressions that are calculated multiple times.
o Example: If an expression x + y is computed twice, it can be stored in a temporary
variable.
• Dead Code Elimination:
o Removing instructions or variables that do not affect the final output of the program
(i.e., their results are not used).
o Example: If a variable is assigned a value but never used, that assignment can be
removed.
• Constant Folding and Propagation:
o Constant folding evaluates expressions with constants at compile time.
o Constant propagation substitutes variables with known constant values throughout
the program.
o Example: int x = 5 + 3; can be replaced by int x = 8;.
• Strength Reduction:
o Replacing expensive operations like multiplication and division with cheaper
alternatives like addition or shifts.
o Example: Replacing i * 8 with i << 3 (bit-shift).
• Peephole Optimization:
o It involves examining a small window of code (a "peephole") to find and replace
inefficient instruction patterns with more efficient ones.
o Example: Replacing a sequence of instructions like MOV A, B; ADD A, C; with
ADD A, B, C if possible.
2. Control Flow Optimization
Control flow optimization involves improving the flow of control between various sections of the
program. This can reduce the number of branches and condition checks, improving runtime efficiency.
Key Techniques:
• Loop Optimizations:
o Loop unrolling: Reduces the overhead of loop control by increasing the number of
operations per loop iteration.
o Loop-invariant code motion (LICM): Moves calculations that don’t change across
loop iterations outside the loop to reduce redundant work.
o Loop fusion: Combines two loops into one when they iterate over the same range,
which can reduce overhead and improve cache performance.
• Branch Prediction Optimization:
o Rearranging code to improve branch prediction accuracy, thus reducing the penalty
from mispredicted branches.
o Example: Reorganizing if-else blocks so that the most likely branch is checked first.
• Inlining Functions:
o Replacing function calls with the body of the function itself to eliminate the overhead
of calling the function. This is particularly useful for small, frequently-called
functions.
o Example: Instead of calling a simple function int add(int x, int y) { return x + y; },
the code can be replaced by x + y wherever the function is used.
3. Data Flow Optimization
Data flow optimization focuses on improving the flow of data through a program. It aims to reduce
unnecessary computations, minimize memory accesses, and improve data locality.
Key Techniques:
• Register Allocation:
o Efficient allocation of variables to registers can significantly improve performance
since registers are much faster than memory.
o Techniques like graph coloring are used to allocate registers in a way that minimizes
conflicts and maximizes the usage of available registers.
• Memory Access Optimization:
o Locality optimization: Reorganizing data structures to improve cache usage (spatial
locality and temporal locality).
o Array access optimization: Changing the order of array accesses to ensure better
cache utilization.
• Data Dependency Analysis:
o Detecting and optimizing data dependencies between instructions. For example,
loop-carried dependencies can be minimized or removed to enable parallel
execution.
• Loop-Carried Dependence Elimination:
o Identifying loop-carried dependencies and eliminating them to allow parallelism.
This might include techniques like loop interchange or loop splitting.
4. High-Level Optimization (Intermediate Code Optimization)
At the intermediate code level, optimizations focus on making the program more efficient before it is
translated into machine code. These optimizations are independent of the target architecture and are more
general.
Key Techniques:
• Dead Code Elimination (DCE):
o Removing instructions that have no effect on the program’s final output. This can
apply to variables that are never used or calculations whose results are never
referenced.
• Function Inlining:
o At the intermediate code level, inlining functions can reduce function call overhead
and allow further optimizations like constant folding and propagation.
• Loop Optimization:
o Transforming loops to reduce the number of iterations, the complexity of
computations, and redundant memory accesses.
• Redundant Load Elimination:
o If a value is loaded from memory multiple times but doesn’t change between loads,
those repeated memory accesses can be eliminated.
5. Interprocedural Optimization
Interprocedural optimization is performed across function boundaries, considering the entire program’s
structure rather than individual functions. This type of optimization can significantly reduce overhead and
improve overall performance.
Key Techniques:
• Function Cloning:
o Creating multiple versions of a function with different argument types to optimize
for different use cases, especially when functions are called in different contexts.
• Global Value Numbering (GVN):
o This optimization tracks the values of variables globally and removes redundant
computations across different functions.
• Inlining across Functions:
o Inlining functions not only within a single function but also across function
boundaries to eliminate call overhead and improve optimizations across the entire
program.
6. Parallelism and Vectorization
In modern compilers, parallelism and vectorization are essential for optimizing programs on multi-core
processors and SIMD (Single Instruction, Multiple Data) architectures.
Key Techniques:
• Loop Parallelization:
o Identifying independent iterations in loops and parallelizing them so that they can be
executed simultaneously across multiple CPU cores.
• SIMD Vectorization:
o Converting scalar operations into vector operations that can be performed on
multiple data elements in parallel.
• Task Parallelism:
o Identifying independent tasks or functions that can be executed concurrently and
parallelizing them.
7. Platform-Specific Optimization
These optimizations are specific to the target platform (e.g., processor architecture, cache structure, and
instruction set). The goal is to generate code that makes the best use of the available hardware.
Key Techniques:
• Instruction Scheduling:
o Rearranging instructions to avoid pipeline stalls or to exploit processor features like
pipelining and out-of-order execution.
• Cache Optimization:
o Optimizing memory accesses to improve cache hit rates, such as through blocking
or tiling techniques to enhance spatial locality.
• Branch Prediction Optimizations:
o Arranging branches in a way that maximizes the effectiveness of the CPU’s branch
prediction mechanisms.
8. Code Size Reduction
In certain applications (like embedded systems or mobile devices), optimizing for code size becomes
crucial. These techniques focus on reducing the size of the final executable without sacrificing too much
performance.
Key Techniques:
• Code Compression:
o Reducing the size of the generated code using various compression techniques
without sacrificing performance.
• Inlining and Dead Function Elimination:
o Removing unused functions and replacing function calls with the actual code to
reduce function call overhead.
5.4 INTRODUCTION TO DATA FLOW ANALYSIS
Data flow analysis is a crucial technique used in compiler optimization and program analysis. It
involves tracking how data moves through a program to detect potential optimizations, identify errors,
and improve overall efficiency. The goal is to analyze the flow of data between variables or program
elements (like expressions) in a program during its execution. By understanding how data is propagated
and modified, the compiler can make decisions about how to optimize the program, such as eliminating
redundant computations or improving memory usage.
In a compiler, data flow analysis typically operates on intermediate code representations (like Control
Flow Graphs, CFGs) of the program, which makes it easier to reason about the flow of data, regardless of
the original high-level language.
Why is Data Flow Analysis Important?
1. Optimization: It helps identify opportunities for optimization, such as constant
propagation, dead code elimination, and common subexpression elimination.
2. Error Detection: It can detect possible errors such as uninitialized variables or unused
variables.
3. Parallelism: Data flow analysis can be used to identify independent computations that can
be parallelized.
4. Control Flow Analysis: It aids in understanding how data changes across different paths in
control flow, which is essential for accurate program transformation and optimization.
Key Concepts in Data Flow Analysis
Data flow analysis revolves around tracking values of variables and understanding how those values
propagate through the program during its execution. Below are some essential concepts related to data
flow analysis:
1. Control Flow Graph (CFG)
• A Control Flow Graph (CFG) is a directed graph where each node represents a basic block
of the program (a sequence of instructions with no control flow interruptions), and each edge
represents a control flow from one block to another.
• The CFG helps in visualizing the flow of control during program execution, which is
essential for data flow analysis.
2. Definitions and Uses
• Definition: A statement that assigns a value to a variable.
• Use: A statement that reads or uses the value of a variable.
• Data flow analysis identifies where variables are defined and where they are used, helping
to track the flow of data through the program.
3. Transfer Functions
• A transfer function describes how a particular statement in the program modifies the state
of data. For example, in the case of constant propagation, the transfer function would
update the state of a variable to reflect the constant value it holds.
• These functions help in understanding how different program elements (like assignments or
conditional statements) affect the flow of data.
4. Meet Operator
• The meet operator is used to combine the results of data flow analysis from different paths
in the control flow graph. This is particularly important when multiple paths converge at a
point (e.g., at a branch in a program).
• The meet operator is typically an intersection (for problems like live variable analysis) or a
union (for problems like reaching definitions).
Types of Data Flow Analysis Problems
Different types of data flow analysis focus on different aspects of the program's behavior. Below are some
of the most common data flow problems:
1. Reaching Definitions
• A reaching definition is a definition of a variable that can potentially reach a particular
point in the program's control flow.
• Goal: To find out which definitions of a variable can reach a certain point in the code.
• Example: If a variable x is assigned a value in multiple places, we need to know which
assignments to x can affect the program state at a given point.
2. Live Variable Analysis
• A variable is live at a particular point if its value is used later in the program.
• Goal: To find out which variables are live (i.e., used) at each point in the program and which
ones are not.
• Example: In the code:
int x = 10;
int y = x + 1;
• After the statement y = x + 1, the variable x is still live since its value is used to compute y.
3. Constant Propagation
• Constant propagation tracks the values of variables that are constants and propagates them
through the program.
• Goal: To replace variables that hold constant values with the actual constant wherever
possible.
• Example:
int x = 5;
int y = x + 3;
• The value of x is known to be 5, so y can be simplified to y = 8.
4. Available Expressions
• An available expression is an expression that has already been computed earlier in the
program and can be reused.
• Goal: To identify expressions that are available for reuse, thereby eliminating redundant
computations.
• Example: If the expression a + b appears multiple times in the program, it can be computed
once and reused.
5. Uninitialized Variable Analysis
• This analysis checks for variables that are used before they are initialized with a value.
• Goal: To detect errors where a variable might be used without being assigned a value.
• Example:
int x;
printf("%d", x); // Error: x is used without initialization.
Data Flow Analysis Framework
The process of performing data flow analysis typically follows these steps:
1. Program Representation:
o Represent the program using a Control Flow Graph (CFG). The nodes in the graph
represent basic blocks, and the edges represent control flow between blocks.
2. Data Flow Equations:
o For each analysis problem (e.g., reaching definitions, live variable analysis), define
the data flow equations that describe how data moves through the program.
o The equations describe how data is propagated from one block to another, using
transfer functions and the meet operator to combine the data.
3. Initialization:
o Initialize the data flow values for the entry and exit points of the program.
o For example, in live variable analysis, initialize the entry point to assume that no
variable is live and work backward to find which variables are live.
4. Propagation:
o Propagate the data flow through the program’s control flow graph, updating the data
flow values iteratively until the process reaches a stable state (i.e., no further changes
are made).
5. Convergence:
o The algorithm iterates through the graph until it converges to a fixed point, meaning
no further updates are necessary.
Example: Reaching Definitions

Let’s consider an example of reaching definitions analysis:

1. int a = 5;
2. int b = a + 3;
3. a = b * 2;
4. int c = a + 1;
In the above example, we want to track which definitions of a (and other variables) can reach different
parts of the code.
1. Initialization:
o Initially, no definition reaches any instruction.
2. Propagation:
o At line 2, the definition a = 5 reaches the statement b = a + 3.
o At line 3, the definition a = 5 no longer reaches since a is redefined as b * 2. So, the
definition a = 5 does not reach line 4, but a = b * 2 does.
3. Fixed Point:
o After iterating through all the instructions, we reach a fixed point where no further
updates are necessary.
5.5 CODE GENERATION
Code generation is one of the most crucial phases in the compilation process, where the intermediate
representation (IR) of the program is translated into the target machine code (or assembly code). This
phase is responsible for generating the final executable code that will run on a specific machine or
platform. Designing a code generator involves addressing several key issues to ensure that the generated
code is efficient, correct, and suitable for the target hardware.
5.5.1 ISSUES IN THE DESIGN OF A CODE GENERATOR
Here, we discuss the issues in the design of a code generator and the challenges that arise in generating
high-quality, efficient machine code.
1. Target Architecture and Instruction Set
One of the most important aspects of code generation is target architecture. The design of the code
generator must consider the instruction set architecture (ISA) of the target machine. Different machines
may have vastly different ISAs, meaning that the compiler must handle the translation of intermediate
code into machine-specific instructions.
Challenges:
• Different Instruction Formats: Different target architectures may have different formats
for instructions, including the number of operands, addressing modes, and instruction length.
• Instruction Set: Some architectures may have more complex instructions (e.g., vector
instructions, floating-point operations), while others may have simpler, less powerful sets.
• Register Set: The number of available registers, their roles, and the efficiency of register
use (e.g., general-purpose registers, floating-point registers, etc.) vary between architectures.
Solution:
• The code generator must generate machine-specific code by converting the intermediate
code into instructions that are understood by the target machine's architecture.
• It must handle instruction selection, register allocation, and instruction scheduling, which
depend on the ISA.
2. Register Allocation
Register allocation refers to the process of deciding which variables and temporary values should be
stored in the processor’s registers and which should be stored in memory. Efficient use of registers is
crucial for generating fast and compact code.
Challenges:
• Limited Number of Registers: Many machines have a limited number of registers, so the
code generator must choose which variables to store in registers and which to keep in
memory.
• Spilling: When there are more live variables than available registers, the compiler may have
to spill some variables to memory, which can result in slower code due to memory accesses.
• Register Renaming: In some architectures, registers have to be renamed to avoid conflicts
between different variables using the same register.
Solution:
• Use algorithms like graph coloring or linear scan allocation to efficiently assign registers
to variables.
• Minimize spilling by choosing the most critical variables for registers.
• Optimize for register usage to improve runtime efficiency and reduce memory access time.
3. Instruction Selection
Instruction selection involves mapping the intermediate representation (IR) of the program into actual
machine instructions. This is one of the most fundamental tasks of the code generator.
Challenges:
• Complex Intermediate Code: The IR may contain high-level constructs that don't map
directly to machine instructions. The code generator must select appropriate machine-level
instructions to implement these high-level constructs.
• Instruction Constraints: Some target machines have complex instruction formats with
specific constraints (e.g., operand types, operand order). For example, an instruction might
only support immediate values or require certain operands to be in specific registers.
• Suboptimal Instruction Sequences: Sometimes the compiler has to choose between
different possible sequences of instructions. Some sequences might be more efficient than
others, and the code generator needs to make this decision.
Solution:
• Use pattern matching or instruction templates to select the most efficient instructions for
each construct.
• Implement peephole optimization to replace suboptimal sequences of instructions with
more efficient ones.
4. Instruction Scheduling
Instruction scheduling is the process of reordering instructions to improve the efficiency of the generated
code. This typically involves minimizing the number of pipeline stalls, reducing execution time, and
improving instruction-level parallelism (ILP).
Challenges:
• Pipeline Stalls: Modern processors have pipelined execution units, and certain instructions
(e.g., memory access instructions) may cause delays in subsequent instructions (called
pipeline stalls).
• Instruction Dependencies: Some instructions depend on the results of previous
instructions, which limits the reordering that can be done.
• Parallelism: The code generator needs to identify independent instructions that can be
executed in parallel.
Solution:
• Instruction reordering: The code generator can reorder independent instructions to avoid
pipeline stalls and maximize the use of available execution units.
• Out-of-order execution: In some cases, the code generator may exploit the target
processor’s ability to execute instructions out of order.
• Use algorithms that balance between data hazards and instruction-level parallelism to
minimize execution time.
5. Handling Control Flow
Generating code for control flow (like branches, loops, and function calls) is a significant challenge in
code generation. Control flow can significantly impact performance, especially when dealing with
conditional branches or jump instructions.
Challenges:
• Branch Prediction: Modern processors use branch prediction to reduce the performance hit
caused by branches. Poor predictions can result in pipeline flushes and slowdowns.
• Function Calls: Handling function calls efficiently can be tricky, especially with recursion
or variadic functions. The code generator must ensure that the correct function prologues
and epilogues are generated, and registers are properly saved and restored.
Solution:
• Use branch prediction hints or loop unrolling to improve branch prediction performance.
• For function calls, generate efficient calling conventions (such as managing the stack,
saving registers, and passing parameters) to minimize overhead.
6. Optimization
Optimization at the code generation phase focuses on producing the most efficient machine code possible,
considering the constraints of the target architecture and the needs of the program.
Challenges:
• Trade-offs between speed and size: The generated code should be both fast (low runtime)
and compact (low memory usage). Optimizing for one often sacrifices the other.
• Complexity of Optimizations: Some optimizations, such as loop unrolling, inlining, or
constant folding, require complex analysis and decision-making.
Solution:
• Perform peephole optimization at the code generation level to replace inefficient code
patterns with better alternatives.
• Use global optimization strategies like instruction combining, strength reduction, and
constant propagation to make the code as efficient as possible.
7. Debugging and Error Handling
Compilers must generate debuggable code to facilitate debugging of the program by developers. This
involves generating the necessary debugging information (such as line numbers, variable names, and
source positions).
Challenges:
• Debug Information: Generating debug information that allows debugging tools to map the
machine code back to the original source code is essential.
• Error Detection: If the compiler generates incorrect code, it must be able to detect errors
early in the process and provide meaningful error messages.
Solution:
• Include debug symbols in the generated code that link back to the original source code (e.g.,
through a symbol table).
• Ensure error detection mechanisms are in place for issues like undefined symbols,
uninitialized variables, and incorrect register usage.
8. Target-Specific Issues
Different targets may impose specific restrictions or features that must be addressed during code
generation.
Challenges:
• Memory Layout: The target machine might have unique requirements for data alignment
and memory layout that must be respected during code generation.
• Cross-Compilation: If the compiler is generating code for a different architecture (cross-
compilation), it needs to consider the differences in instruction sets and available resources.
Solution:
• Implement target-specific code generation strategies that consider the target’s unique
features (e.g., memory model, instruction constraints, etc.).
• Ensure the compiler can cross-compile by separating architecture-specific code generation
tasks from general ones.
5.5.2 THE TARGET LANGUAGE
In the context of compiler design, the target language refers to the machine-readable language (typically
machine code or assembly language) into which the source program (written in a high-level programming
language like C, Java, or Python) is eventually translated. The process of code generation takes place in
the final phase of a compiler, where the intermediate code (IR) generated during earlier phases is converted
into the target language.
The target language is influenced by the hardware architecture (the machine's Instruction Set
Architecture or ISA) and may include assembly language instructions, machine-level instructions, or
bytecode (for virtual machines like Java's JVM).
Here’s an overview of the target language and the various aspects of code generation that interact with
it:
1. Types of Target Languages
There are different types of target languages that a compiler can generate depending on the nature of the
system being compiled for:
a. Machine Code
• Machine code is the lowest-level language, directly executable by the CPU. It consists of
binary instructions tailored to the target machine’s Instruction Set Architecture (ISA).
• Example: The Intel x86 or ARM instruction sets define the machine code format for those
processors.
• Characteristics:
o Directly executed by the hardware.
o Very efficient but machine-specific.
o Requires the code generator to be tightly coupled with the target machine's
architecture.
b. Assembly Language
• Assembly language is a low-level human-readable representation of machine code. It
consists of mnemonics for operations and human-readable labels for memory addresses.
• Example: An assembly instruction like MOV AX, 5 would represent the machine-level
instruction that moves the value 5 into register AX.
• Characteristics:
o Easier for humans to understand and debug compared to machine code.
o Requires an assembler to convert to machine code.
o Still machine-specific, though slightly higher level than machine code.
c. Intermediate Code for Virtual Machines
• In some compilers, particularly those for virtual machines (like the Java Virtual Machine
(JVM) or .NET's Common Language Runtime (CLR)), the target language is a platform-
independent intermediate code (bytecode) that runs on a virtual machine.
• Example: Java code is compiled into bytecode that can be run on any machine with a JVM.
• Characteristics:
o Target machine-independent.
o Not directly executed by the CPU but by a virtual machine.
o Offers portability across different hardware platforms.
d. High-Level Intermediate Code (For Cross-Compilation)
• Sometimes, instead of targeting machine code or assembly code directly, compilers may
generate a high-level intermediate representation that can be further compiled to different
machine codes for various platforms.
• Example: LLVM IR is a lower-level intermediate language used by the LLVM compiler
infrastructure, which can be further compiled to different target languages.
• Characteristics:
o Used in cross-compilation scenarios.
o More abstract than machine code but closer to hardware than high-level languages.
2. Features of the Target Language
The design of the target language and its structure affects several aspects of code generation:
a. Instruction Set Architecture (ISA)
The target language is heavily determined by the machine’s ISA, which defines the available instructions,
their formats, and how the CPU interacts with memory. Different CPUs (e.g., ARM, x86) have different
ISAs, and the code generator must tailor the code to suit these specifications.
• CISC (Complex Instruction Set Computing): CPUs with a large and varied set of
instructions, where individual instructions can perform complex operations (e.g., x86
architecture).
• RISC (Reduced Instruction Set Computing): CPUs with a smaller, more simplified set of
instructions designed for high performance with simpler operations (e.g., ARM
architecture).
b. Data Representation
The target language must accommodate the way data is represented and manipulated in memory. This
includes:
• Data Types: The handling of integers, floating-point numbers, characters, and arrays.
• Memory Layout: The alignment and packing of data types, which may vary between target
architectures.
• Addressing Modes: The methods by which memory locations are accessed (e.g., direct
addressing, indirect addressing, indexed addressing).
c. Control Flow
Control flow instructions (e.g., branches, loops, function calls) are essential in any target language. The
target machine may have specific instructions for:
• Branching: Conditional jumps, unconditional jumps, and return addresses for function
calls.
• Stack Operations: For managing function calls and local variables.
• Function Calling Conventions: The standard way to pass arguments and return values
between functions, which varies between systems (e.g., C calling convention).
d. Optimization Constraints
The target language might impose limitations or provide opportunities for optimization:
• Register Allocation: The number of available registers for storing data affects how the code
generator manages variables and temporary values.
• Instruction Scheduling: Some ISAs allow instructions to be scheduled for parallel
execution, while others have more rigid constraints that the code generator must work
around.
e. Debugging Information
For the generated code to be useful for debugging, it needs to include debugging symbols that map
machine code back to the high-level source code. This often involves adding debugging metadata to the
target language.
3. Issues in the Target Language Design
When designing a code generator, various challenges arise from the target language's characteristics and
requirements. Some of the key issues include:
a. Instruction Selection
• The process of translating intermediate code to actual machine or assembly instructions is
called instruction selection. This requires selecting the most appropriate machine
instructions for each intermediate operation, ensuring that the generated code is efficient and
meets the constraints of the target language.
• Challenge: Mapping high-level operations (e.g., arithmetic, logical operations) to a target
machine’s specific instruction set can be complex.
b. Register Allocation
• Most machines have a limited number of registers for storing data. The target language must
accommodate efficient usage of these registers to avoid excessive memory access (which is
slower than register access).
• Challenge: Deciding which variables should be stored in registers and which should be
spilled into memory can significantly impact performance.
c. Instruction Scheduling and Optimization
• Instruction scheduling ensures that instructions are executed in the most efficient order,
taking advantage of the CPU’s execution pipeline.
• Challenge: Some instructions may depend on previous instructions’ results, which limits
reordering. Optimizing for CPU pipelines (to reduce delays) is a non-trivial task.
d. Target-Specific Features
• Different hardware platforms may have unique features, such as SIMD (Single Instruction,
Multiple Data) instructions, memory hierarchies (L1, L2 caches), or multi-core processing.
The target language and the code generator must account for these to maximize performance.
• Challenge: Generating code that takes full advantage of these features while remaining
correct across different hardware platforms can be difficult.
e. Portability
• When targeting a platform-independent intermediate language (e.g., Java bytecode, LLVM
IR), the code generator must produce output that works across multiple hardware
architectures.
• Challenge: Ensuring that the generated code can run on different platforms without needing
to be recompiled for each architecture.
4. Target Language Design for Cross-Compilers
In some cases, a cross-compiler is used to generate code for a different target machine than the one on
which the compiler is running. This requires careful consideration of both the host and target
environments.
• The code generator must produce target-specific code while running on a different machine,
often using an intermediate language to separate the code generation from the specifics of
the target architecture.
• Example: An ARM cross-compiler running on an x86 machine might generate ARM
assembly code that can be assembled and run on an ARM processor.
5.5.3 SIMPLE CODE GENERATOR
A code generator is the component of a compiler responsible for translating an intermediate
representation (IR) of a program into the target language (machine code or assembly code). The design of
a simple code generator involves converting intermediate code (which is often in a high-level form) into
lower-level representations that are specific to the target architecture.
In this section, we'll explain the steps involved in creating a simple code generator and provide a basic
example of how it works.
Overview of a Simple Code Generator
A simple code generator typically performs the following tasks:
1. Intermediate Code Representation (IR) Handling: The code generator takes the
intermediate code produced by the earlier stages of the compiler and processes it to generate
target-specific code.
2. Instruction Selection: The code generator maps the intermediate code operations to the
target architecture’s machine or assembly instructions.
3. Register Allocation: The code generator determines how to allocate registers for variables,
as registers are a limited resource.
4. Code Emission: The generator produces the actual machine code or assembly instructions
that can be executed by the target machine.
5. Optimization (optional): Although simple code generators may skip complex
optimizations, certain basic optimizations can be incorporated into the code generation
process.
Components of a Simple Code Generator
1. Input Intermediate Code (IR): The code generator processes intermediate representations,
which may be in the form of an abstract syntax tree (AST) or intermediate representations
like Three Address Code (TAC).
2. Instruction Selection: Given a piece of IR, the code generator chooses the correct machine
instruction or assembly instruction.
3. Register Allocation: The generator decides where to store variables (in memory or
registers). This can be done using basic techniques like simple register assignment.
4. Emit Target Code: The code generator emits the final instructions in the target language,
either as machine code or assembly code.
Example: A Simple Code Generator
Step 1: Intermediate Code (IR) Example
Let’s assume we have an intermediate representation (IR) for a simple program like this:
t1 = a + b
t2 = t1 * c
t3 = t2 – d
In Three Address Code (TAC) form, the code would look like this:
1. t1 = a + b
2. t2 = t1 * c
3. t3 = t2 – d
This IR represents three operations: addition, multiplication, and subtraction.
Step 2: Instruction Selection
The next step is to translate each of these intermediate operations into assembly instructions. Suppose the
target architecture is a simple assembly language that uses the following instructions:
• LOAD R, var — Load the value of var into register R.
• ADD R1, R2, R3 — Add the values in R2 and R3 and store the result in R1.
• MUL R1, R2, R3 — Multiply the values in R2 and R3 and store the result in R1.
• SUB R1, R2, R3 — Subtract R3 from R2 and store the result in R1.
• STORE R, var — Store the value of R in var.
Step 3: Register Allocation
We’ll assume we have the following registers available: R1, R2, R3, etc.
• t1, t2, and t3 will be assigned to registers for computation.
• Variables a, b, c, and d will be loaded from memory.
Step 4: Generate Assembly Code
Now, let’s translate the TAC into assembly code, assuming the variables a, b, c, and d are stored in
memory.
1. LOAD R1, a ; Load the value of 'a' into R1
2. LOAD R2, b ; Load the value of 'b' into R2
3. ADD R3, R1, R2 ; t1 = a + b, store result in R3 (register for t1)
4. LOAD R4, c ; Load the value of 'c' into R4
5. MUL R5, R3, R4 ; t2 = t1 * c, store result in R5 (register for t2)
6. LOAD R6, d ; Load the value of 'd' into R6
7. SUB R7, R5, R6 ; t3 = t2 - d, store result in R7 (register for t3)
8. STORE R7, result ; Store the value of t3 in 'result'
This assembly code now represents the sequence of operations needed to perform the computation
described in the IR.
Step 5: Emit Target Code
The assembly code generated in the previous step can now be assembled into machine code specific to
the target architecture. For example, on a hypothetical machine, the instructions might map to binary
values as follows:
• LOAD R1, a might translate to 0001 a
• ADD R3, R1, R2 might translate to 0010 R3 R1 R2
• And so on…
Key Points in Simple Code Generation
1. Instruction Selection: The primary task is selecting the right machine instruction for each
intermediate operation.
2. Register Allocation: In simple code generation, this is often done in a straightforward
manner, either by assigning each temporary variable to a register or spilling them to memory
if needed.
3. Code Emission: The code generator outputs the assembly or machine code corresponding
to the intermediate code operations.
4. Simplicity: A simple code generator focuses on translating intermediate code directly to
target code without sophisticated optimizations.
5.5.4 PEEPHOLE OPTIMIZATION
Peephole optimization is a technique in compiler design that focuses on improving the efficiency of the
generated machine or intermediate code by examining a small "window" or "peephole" of a few
instructions at a time. This technique looks for patterns or opportunities to replace sequences of
instructions with more efficient ones.
Peephole optimization is a form of local optimization that operates on short sequences of code, typically
consisting of just a few instructions. It aims to eliminate redundant operations, simplify code, or exploit
patterns that the target machine architecture can handle more efficiently.
How Peephole Optimization Works
In the context of code generation, peephole optimization works by scanning the generated assembly or
intermediate code for small patterns of instructions and then replacing these patterns with optimized,
simpler, or more efficient alternatives.
The "peephole" refers to a small, localized view of the code (usually only a few consecutive instructions),
which is analyzed for opportunities to reduce or optimize the instructions.
Common Peephole Optimizations
Here are some of the common peephole optimization techniques:
Constant Folding
• Description: If a sequence of operations involves constants (e.g., adding or multiplying
constant values), it can be simplified at compile time rather than at runtime.
• Example:
MOV R1, 5 ; Load 5 into R1
ADD R1, 3 ; Add 3 to R1
After peephole optimization, it can be replaced with:
MOV R1, 8 ; Directly load the result of 5 + 3
Constant Propagation
• Description: If a value is assigned a constant, all subsequent references to that value can be
replaced with the constant itself.
• Example:
MOV R1, 10 ; Assign 10 to R1
ADD R2, R1, 5 ; Add the value of R1 (which is 10) to 5
After peephole optimization, it can be simplified to:
ADD R2, 10, 5 ; Directly add the constants 10 and 5
Dead Code Elimination
• Description: If a particular instruction does not contribute to the final result (i.e., its result
is never used), it can be removed.
• Example:
MOV R1, 10 ; Load 10 into R1
ADD R2, R1, 5 ; Add 10 and 5 into R2
MOV R3, R2 ; Store the result in R3 (but R3 is never used)
After peephole optimization:
ADD R2, 10, 5 ; Remove the unnecessary MOV instruction
4. Redundant Load Elimination
• Description: If a register is loaded with the same value multiple times without being
modified in between, the redundant load can be eliminated.
• Example:
MOV R1, 5 ; Load 5 into R1
MOV R2, 5 ; Load 5 into R2
After peephole optimization:
MOV R1, 5 ; Only load 5 into R1, remove redundant load for R2
5. Jump Elimination and Simplification
• Description: If a jump or branch instruction is followed by another unconditional jump, the
first jump is unnecessary and can be removed. This is often seen in the case of simple control
flow optimizations.
• Example:
JMP Label1
JMP Label2 ; This jump is redundant
After peephole optimization:
JMP Label2 ; Only the second jump is needed
6. Combining Adjacent Operations
• Description: If two or more adjacent operations can be combined into a single operation,
this can help reduce the number of instructions.
• Example:
ADD R1, R2, R3 ; R1 = R2 + R3
SUB R1, R1, R4 ; R1 = R1 - R4
After peephole optimization:
ADD R1, R2, R3 ; Combine the two operations: R1 = (R2 + R3) - R4
SUB R1, R1, R4 ; becomes ADD R1, R2, R3 - R4
7. Strength Reduction
• Description: This optimization replaces expensive operations with cheaper equivalents.
This is particularly common with loop optimizations, like replacing multiplication with
addition.
• Example:
MUL R1, R2, 4 ; Multiply R2 by 4
After peephole optimization:
ADD R1, R2, R2 ; Replace multiplication by 4 with addition (R2 + R2 + R2 + R2)
8. Merging Identical Instructions
• Description: If multiple instructions have the same effect or identical operations, they can
be combined into one.
• Example:
MOV R1, 10 ; Load 10 into R1
MOV R2, 10 ; Load 10 into R2
After peephole optimization:
MOV R1, 10 ; Only one load is needed
Example of Peephole Optimization
Given the following sequence of assembly instructions:
MOV R1, 5 ; R1 = 5
MOV R2, 5 ; R2 = 5
ADD R3, R1, R2; R3 = R1 + R2 = 5 + 5
MOV R4, R3 ; R4 = R3
The peephole optimization would perform the following transformations:
Constant Folding:
MOV R1, 5
MOV R2, 5
ADD R3, 5, 5 ; Directly add constants
MOV R4, R3
Redundant Load Elimination: The second MOV R2, 5 is unnecessary, so it is removed.
MOV R1, 5
ADD R3, 5, 5
MOV R4, R3
The optimized version would now look like this:
MOV R3, 10 ; Result after addition
MOV R4, R3 ; Store result
Thus, we have eliminated unnecessary operations and reduced the total number of instructions.
Benefits of Peephole Optimization
• Improves performance: By removing redundant operations and simplifying instructions,
the resulting code can be executed more efficiently.
• Reduces code size: Eliminating unnecessary instructions helps in reducing the overall size
of the compiled program.
• Faster execution: By eliminating redundant or inefficient code, the program executes faster,
which is crucial for performance-critical applications.
• Simple to implement: Since it focuses on small patterns and short sequences of instructions,
peephole optimization is relatively easy to implement and doesn't require complex global
analysis.
Limitations of Peephole Optimization
• Local nature: Peephole optimization only works on small sections of code. It cannot
perform global optimizations that require information from a broader context, such as loop
optimizations or cross-procedural optimizations.
• Limited impact: In many cases, peephole optimization can only improve code so much.
More significant performance improvements often require advanced global optimizations.
• Time-consuming for large codebases: While peephole optimization is efficient for small
code snippets, applying it repeatedly to a large codebase can still be time-consuming.
5.5.5 REGISTER ALLOCATION AND ASSIGNMENT
Register allocation is a critical phase in compiler design where the compiler decides how to allocate the
limited set of CPU registers to variables or temporaries (such as those created by intermediate code) during
program execution. Since most modern CPUs have a relatively small number of registers, effective
register allocation is crucial to optimizing performance and minimizing the need for slower memory
accesses.
Register assignment refers to the actual assignment of variables or temporaries to specific physical
registers. Together, these concepts form a key component of optimizing the runtime performance of a
program.
Key Concepts in Register Allocation
1. Registers in a CPU:
o A register is a small, fast storage location within the CPU. CPUs have a small
number of registers, typically between 8 and 32 in modern processors.
o Registers are used for storing variables and intermediate results of computations.
Access to registers is much faster than accessing data stored in main memory (RAM).
2. Why Register Allocation is Important:
o Efficient register allocation helps minimize the use of memory and improves
performance because operations on registers are significantly faster than on memory.
o Register allocation is particularly important for high-performance applications or in
embedded systems where every cycle counts.
Steps in Register Allocation
Register allocation typically follows these steps:
1. Live Variable Analysis
• Live variable analysis determines which variables are "live" (i.e., in use) at each point in
the program.
• A variable is "live" if its value is used at some point in the future, before it is redefined.
• The goal is to identify which variables need to be stored in registers at any given time.
Example:
x = a + b ; x is live after this instruction
y = x + c ; y and x are live after this instruction
z = y + d ; z, y, and x are live after this instruction
2. Interference Graph Construction
• An interference graph is constructed where each node represents a variable or temporary,
and an edge between two nodes indicates that the corresponding variables cannot share a
register because they are live at the same time.
• For example, if two variables are used in overlapping portions of the code (e.g., one is
assigned and the other is used without intervening redefinitions), they "interfere" with each
other and cannot share the same register.
Example: If x and y are live at the same time, they interfere and must be assigned to different registers
x -- y (x and y interfere)
3. Register Allocation Using Graph Coloring
• The most common technique for register allocation is graph coloring, which is a way of
assigning registers to variables using the interference graph.
• In graph coloring, each node (variable) must be assigned a color (register) such that no two
adjacent nodes (interfering variables) share the same color.
• The number of colors required is limited by the number of available registers. If the graph
cannot be colored with the available registers, spilling (storing variables in memory) may be
necessary.
4. Spilling
• If there are not enough registers to hold all live variables at once, spilling occurs. This means
that some variables must be moved to memory temporarily.
• When a variable is spilled, it is stored in memory, and the register is freed up for another
variable. Accessing spilled variables is slower than accessing variables in registers, so
spilling reduces performance.
Register Assignment
Once the variables have been allocated registers, the compiler must assign the actual physical registers to
each variable. This is called register assignment.
1. Simple Register Assignment
• In simple register assignment, variables are directly mapped to available registers. A
straightforward mapping ensures that each variable gets a distinct register, as long as the
interference graph allows it.
• This is generally used when the number of variables is smaller than the number of available
registers.
2. Efficient Register Assignment (Using Register Classes)
• More advanced register assignment can make use of register classes, which group similar
types of registers (such as integer registers, floating-point registers, etc.).
• The compiler can assign a variable to any register within the appropriate register class. For
example, if a register class for integers is available, the compiler can choose any integer
register rather than a specific one.
3. Register Renaming
• In some cases, the compiler can also use register renaming, which involves assigning a
different register to a variable each time it is used, helping to avoid conflicts with other
variables.
• This is useful in highly optimized or out-of-order execution models (e.g., in some CPU
architectures).
Optimization Strategies in Register Allocation
Several optimization techniques can be employed during register allocation to improve performance:
1. Minimal Spilling
• Spilling should be minimized as it leads to slower performance due to memory access.
Modern compilers use advanced strategies to avoid spilling, such as lookahead techniques
and priority-based allocation to determine which variables should be spilled.
2. Register Coalescing
• Register coalescing combines two variables that do not interfere with each other into a
single register. This reduces the number of registers required and improves performance.
3. Pre-colored Registers
• Some registers, like those used for function arguments or return values, may already have a
designated role in the calling convention. These can be pre-colored (pre-assigned to specific
registers), making the allocation easier and more efficient.
4. Optimizing for Calling Conventions
• The compiler may need to ensure that certain registers are preserved across function calls
according to the platform's calling convention. These registers may need to be allocated
before a function call, and any register assigned to such variables must be saved before the
call and restored afterward.
Example of Register Allocation and Assignment
Consider the following intermediate code (in TAC form):
t1 = a + b
t2 = t1 * c
t3 = t2 – d
Step 1: Determine live variables.
• a, b, c, and d are input variables, and t1, t2, and t3 are temporary variables.
• From the code, we can deduce that t1 is live before t2, t2 is live before t3, and t3 is live at
the end of the sequence.
Step 2: Build the interference graph.
• t1 and t2 interfere because they are live at the same time.
• t2 and t3 also interfere.
Step 3: Perform graph coloring to assign registers. Suppose there are 2 registers available (R1 and R2).
• Assign t1 to R1.
• Assign t2 to R2 (since t2 interferes with t1).
• Assign t3 to R1 (it can reuse R1 after t1 is no longer live).
Step 4: Emit code with assigned registers:
MOV R1, a ; Load a into R1 (t1 = a + b)
ADD R1, b ; Add b to R1 (t1 = a + b)
MOV R2, R1 ; Move t1 to R2 (t2 = t1 * c)
MUL R2, c ; Multiply t1 by c (t2 = t1 * c)
MOV R1, R2 ; Move t2 to R1 (t3 = t2 - d)
SUB R1, d ; Subtract d from t2 (t3 = t2 - d)

You might also like