0% found this document useful (0 votes)
8 views25 pages

Compiler Design: Key Concepts Explained

The document provides an introduction to compiler design, defining a compiler's role in translating high-level programming languages into machine code through various stages such as lexical analysis, syntax analysis, and code generation. It explains the components involved in the language processing system, including preprocessors, interpreters, assemblers, linkers, and loaders, as well as the architecture of compilers, which consists of analysis and synthesis phases. Additionally, it discusses the phases of compilation, focusing on lexical analysis, tokenization, and regular expressions used for pattern matching in programming languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views25 pages

Compiler Design: Key Concepts Explained

The document provides an introduction to compiler design, defining a compiler's role in translating high-level programming languages into machine code through various stages such as lexical analysis, syntax analysis, and code generation. It explains the components involved in the language processing system, including preprocessors, interpreters, assemblers, linkers, and loaders, as well as the architecture of compilers, which consists of analysis and synthesis phases. Additionally, it discusses the phases of compilation, focusing on lexical analysis, tokenization, and regular expressions used for pattern matching in programming languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit1:Introduction to Compiler Design

I-Definition and role of compilers

A compiler is a software tool that translates source code written in a high-level programming
language into a lower-level representation, typically machine code or bytecode, which can be
executed directly by a computer's hardware or a virtual machine. The process of compilation
involves several stages, including lexical analysis, syntax analysis, semantic analysis, code
optimization, and code generation.
The primary role of a compiler is to bridge the gap between human-readable high-level
programming languages and the machine language understood by computers. It takes the entire
source code as input and produces an executable program or an intermediate representation that can
be further processed or executed. Overall, compilers play a crucial role in the software development
process, enabling programmers to write code in high-level languages and translating it into
executable or runnable programs that can be executed on a computer. They help in improving code
efficiency, portability, and maintainability by performing various optimizations and transformations
during the compilation process.
II.1-Language processing system
Any computer system is made of hardware and software. The hardware understands a language,
which humans cannot understand. So we write programs in high-level language, which is easier for
us to understand and remember. These programs are then fed into a series of tools and OS
components to get the desired code that can be used by the machine. This is known as Language
Processing System.

The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language.

1
Let us first understand how a program, using C compiler, is executed on a host machine.
• User writes a program in C language (high-level language).
• The C compiler, compiles the program and translates it to assembly program (low-level
language).
• An assembler then translates the assembly program into machine code (object).
• A linker tool is used to link all the parts of the program together for execution (executable
machine code).
• A loader loads all of them into memory and then the program is executed.

Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces input for
compilers. It deals with macro-processing, augmentation, file inclusion, language extension, etc.

Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language. The
difference lies in the way they read the source code or input. A compiler reads the whole source
code at once, creates tokens, checks semantics, generates intermediate code, executes the whole
program and may involve many passes. In contrast, an interpreter reads a statement from the input,
converts it to an intermediate code, executes it, then takes the next statement in sequence. If an error
occurs, an interpreter stops execution and reports it. whereas a compiler reads the whole program
even if it encounters several errors.

Assembler
An assembler translates assembly language programs into machine [Link] output of an assembler
is called an object file, which contains a combination of machine instructions as well as the data
required to place these instructions in memory.

Linker
Linker is a computer program that links and merges various object files together in order to make an
executable file. All these files might have been compiled by separate assemblers. The major task of
a linker is to search and locate referenced module/routines in a program and to determine the
memory location where these codes will be loaded, making the program instruction to have absolute
references.

Loader
Loader is a part of operating system and is responsible for loading executable files into memory and
execute them. It calculates the size of a program (instructions and data) and creates memory space
for it. It initializes various registers to initiate execution.

2
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform (B)
is called a cross-compiler.

Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it into the source
code of another programming language is called a source-to-source compiler.
I.2-Compiler architecture
Compiler architecture refers to the overall structure and organization of a compiler. It encompasses
the various components and their interactions during the compilation process. A compiler can
broadly be divided into two phases based on the way they compile.

Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler reads the source
program, divides it into core parts and then checks for lexical, grammar and syntax [Link]
analysis phase generates an intermediate representation of the source program and symbol table,
which should be fed to the Synthesis phase as input. The Front End includes:
• Lexical Analyzer: This component, also known as a scanner or tokenizer, reads the source
code and breaks it down into tokens.
• Syntax Analyzer: The syntax analyzer, also called a parser, checks the sequence of tokens
against the grammar of the programming language to ensure syntactic correctness.
• Semantic Analyzer: The semantic analyzer performs semantic analysis, checking the
meaning and correctness of the program's constructs. It includes tasks such as type checking,
symbol table management, and semantic rule enforcement.
• Intermediate Representation (IR) Generator: The IR generator constructs an intermediate
representation of the program that captures its essential semantics. This representation is
often in the form of an abstract syntax tree (AST) or an intermediate language.
The second component includes the Optimizer:
• Code Optimizer: The code optimizer analyzes the intermediate representation and applies
various transformations and optimizations to improve the efficiency and performance of the
generated code. These optimizations can include constant folding, dead code elimination,
loop optimization, and register allocation.

3
Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target program with the
help of intermediate source code representation and symbol table. A compiler can have many phases
and passes.

• Pass : A pass refers to the traversal of a compiler through the entire program.
• Phase : A phase of a compiler is a distinguishable stage, which takes input from the previous
stage, processes and yields output that can be used as input for the next stage. A pass can
have more than one phase.
The Back End includes:
• Code Generator: The code generator translates the optimized intermediate representation
into the target language, which can be machine code for a specific hardware architecture or
bytecode for a virtual machine. It maps the program constructs to the corresponding
instructions or bytecode instructions of the target platform.
• Linker: In some cases, a linker is employed to resolve external references and combine
multiple object files into a single executable or library.
Symbol Table:
• The symbol table is a data structure used to store information about identifiers (e.g.,
variables, functions, classes) used in the program. It tracks their names, types,
scopes, and memory locations. The symbol table is used by various compiler
components for name resolution, type checking, and code generation.
Error handling:
• The error handling component detects and reports errors encountered during the
compilation process. It provides informative error messages or diagnostics to aid the
programmer in identifying and resolving issues in the source code.
Optimiation framework:
• Some compilers include a separate optimization framework that allows for advanced
and customizable optimization strategies. This framework provides a platform for
implementing and integrating various optimization algorithms and techniques.
The above components interact with each other, passing data and control flow through the different
stages of the compilation process. The front end analyzes and transforms the source code into an
intermediate representation, the optimizer applies optimizations, and the back end generates the
target code. The symbol table and error handling components provide support and information
throughout the compilation process.
II-Phases of a compiler

A compiler typically goes through several distinct phases or stages during the compilation process.
These phases work together to transform the source code into an executable or runnable form. The
most common phases of a compiler are as follows:

4
1. Lexical Analysis (Scanning): In this phase, the source code is divided into a sequence of
tokens. It involves scanning the source code character by character and grouping them into
meaningful units such as keywords, identifiers, literals, and operators. The output of this
phase is a stream of tokens.
2. Syntax Analysis (Parsing): The purpose of this phase is to analyze the structure of the source
code and determine whether it adheres to the grammar of the programming language. It
involves creating a parse tree or an abstract syntax tree (AST) that represents the syntactic
structure of the program. This phase also performs error checking to identify syntax errors in
the code.
3. Semantic Analysis: This phase focuses on the meaning of the program. It checks the validity
of the program's semantics, including type checking, variable declarations, scoping rules,
and adherence to language-specific rules. It ensures that the code makes sense from a
semantic perspective and generates appropriate error messages if any semantic errors are
detected.
4. Intermediate Code Generation: In this phase, the compiler generates an intermediate
representation of the source code. The intermediate code is a lower-level representation that
is closer to the target machine or virtual machine. It may be in the form of an intermediate
language, such as three-address code or bytecode.
5. Code Optimization: The compiler performs various transformations and optimizations on the
intermediate code to improve the efficiency and performance of the generated code. This
phase includes techniques such as constant folding, dead code elimination, loop
optimization, and register allocation. The goal is to produce optimized code that executes
faster and uses fewer resources.
6. Code Generation: In the final phase, the compiler translates the optimized intermediate code
into the target language, which can be machine code for a specific hardware architecture or
bytecode for a virtual machine. This phase involves mapping the intermediate code
constructs to the corresponding instructions or bytecode instructions of the target platform.
7. Symbol Table Management: Throughout the compilation process, the compiler maintains a
symbol table that stores information about identifiers, such as variables, functions, and
classes, used in the source code. The symbol table is used for name resolution, type
checking, and code generation.
These phases are typically executed sequentially, with each phase taking the output of the previous
phase as input. However, some modern compilers may employ techniques such as interprocedural
analysis and just-in-time compilation, which can blur the boundaries between these phases or
introduce additional optimizations.

5
Unit2 Lexical Analysis

Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes
into a series of tokens, by removing any whitespace or comments in the source code. If the lexical
analyzer finds a token invalid, it generates an error. The lexical analyzer works closely with the
syntax analyzer. It reads character streams from the source code, checks for legal tokens, and passes
the data to the syntax analyzer when it demands.

I-Tokenization and lexemes

Tokenization is the process of breaking down a sequence of characters, such as source code or
natural language text, into smaller units called tokens. Tokens are the basic building blocks of a
programming language or natural language, representing meaningful units of information.
A lexeme, on the other hand, refers to the sequence of characters in the source code that matches a
specific token pattern defined by the language's lexical rules. In other words, a lexeme is the actual
text that represents a particular token.

6
Let's take an example to illustrate the concept. Consider the following line of code in the C
programming language:
int x = 10;
In this code, we can identify the following tokens:
1. Token: int

• Lexeme: int
2. Token: x

• Lexeme: x
3. Token: =

• Lexeme: =
4. Token: 10

• Lexeme: 10
5. Token: ;

• Lexeme: ;

During the tokenization process, the lexical analyzer (or tokenizer) scans the source code character
by character and identifies these tokens. Each token is associated with its corresponding lexeme,
which represents the actual text that matches the token pattern.
The tokenizer uses lexical rules defined by the programming language to determine the token
patterns. These rules define the syntax of the language and specify how different tokens can be
formed. For example, in C, an identifier token can consist of letters, digits, and underscores, and
must start with a letter or underscore. The tokenizer applies these rules to identify the lexemes that
match specific token patterns, such as identifiers, keywords, operators, literals, and punctuation
symbols. Tokenization is an essential initial step in the compilation process. It breaks down the
source code into manageable units that can be further processed and analyzed by subsequent phases
of the compiler, such as the syntax analyzer and semantic analyzer.
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns are
defined by means of regular expressions. In programming language, keywords, constants,
identifiers, strings, numbers, operators and punctuations symbols can be considered as tokens. For
example, in C language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).

Specifications of Tokens
Let us understand how the language theory undertakes the following terms:

7
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a
set of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.

Strings
Any finite sequence of alphabets (characters) is called a string. Length of the string is the total
number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted
by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is known as an
empty string and is denoted by ε (epsilon).

Special Symbols
A typical high-level language contains the following symbols:

Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.
II-Regular expressions and finite automata
Regular expressions and finite automata are closely related concepts in the field of theoretical
computer science and formal language theory. They are used to describe and recognize regular
languages, which are a type of formal language with simple and regular patterns.
II.1- Regular expressions
A regular expression is a compact and expressive notation used to describe patterns in strings. It is a
sequence of characters that defines a search pattern. Regular expressions can represent a wide range
of patterns, including sequences of characters, alternatives, repetitions, and more. They are
commonly used in text processing, pattern matching, and lexical analysis.
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that
belong to the language in hand. It searches for the pattern defined by the language rules.

8
Regular expressions have the capability to express finite languages by defining a pattern for finite
strings of symbols. The grammar defined by regular expressions is known as regular grammar.
The language defined by regular grammar is known as regular language. Regular expression is an
important notation for specifying patterns. Each pattern matches a set of strings, so regular
expressions serve as names for a set of strings. Programming language tokens can be described by
regular languages. The specification of regular expressions is an example of a recursive definition.
Regular languages are easy to understand and have efficient implementation.
Regular expressions consist of several elements, including:
1. Literal characters: Matches the exact characters as they appear in the input text.
Example: The regular expression cat matches the string "cat" in the input text.

2. Metacharacters: Special characters that have a predefined meaning in regular expressions.


Example: The dot (.) matches any single character, and the asterisk (*) denotes zero or more
repetitions of the preceding element.
3. Character classes: Sets of characters enclosed in square brackets ([ ]).
Example: The regular expression [aeiou] matches any vowel character.

4. Quantifiers: Specify the number of repetitions or occurrences of the preceding element.


Example: The regular expression ab* matches "a" followed by zero or more occurrences of
"b".
There are a number of algebraic laws that are obeyed by regular expressions, which can be used to
manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
• Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
• Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
• The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
• Union : (r)|(s) is a regular expression denoting L(r) U L(s)
• Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
• Kleene closure : (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denoting L(r)

9
Precedence and associativity
• *, concatenation (.), and | (pipe sign) are left associative
• * has the highest precedence
• Concatenation (.) has the second highest precedence.
• | (pipe sign) has the lowest precedence of all.

Representing valid tokens of a language in regular expression


If x is a regular expression, then:
• x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … }
• x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
• x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.

Representing occurrence of symbols using regular expressions


letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]

Representing language tokens using regular expressions


Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a regular expression
used in specifying the patterns of keywords of a language. A well-accepted solution is to use finite
automata for verification.
III-Finite automata
Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular expression
string is fed into finite automata, it changes its state for each literal. If the input string is
successfully processed and the automata reaches its final state, it is accepted, i.e., the string just fed
was said to be a valid token of the language in hand.

10
The mathematical model of finite automata consists of:
• Finite set of states (Q)
• Finite set of input symbols (Σ)
• One Start state (q0)
• Set of final states (qf)
• Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q × Σ
➔Q

Finite Automata Construction


A finite automaton, also known as a finite state machine, is a mathematical model used to describe
the behavior of systems that can be in a finite number of states and transition between them based
on inputs. In the context of regular languages, finite automata are used to recognize or accept strings
that conform to a given regular expression or pattern.
A finite automaton consists of the following components:
1. States: Represent different configurations or conditions of the system. The automaton can be
in one state at a time.
2. Transitions: Define the rules for moving between states based on the input symbols. Each
transition specifies the current state, an input symbol, and the next state.
3. Start state: Specifies the initial state from which the automaton begins its operation.
4. Accepting states: Indicate the states in which the automaton considers the input string to be
accepted or recognized.
Finite automata can be deterministic (DFA) or non-deterministic (NFA) based on the rules for
transitioning between states. DFAs have a unique next state for each combination of current state
and input symbol, while NFAs can have multiple possible next states. The relationship between
regular expressions and finite automata is significant. Regular expressions can be transformed into
equivalent finite automata, and vice versa. This relationship is formalized by the Kleene's theorem,
which states that regular languages can be described by either regular expressions or finite
automata. Finite automata are often used in the implementation of regular expression matching
engines, as they provide an efficient mechanism to recognize and process strings that match a given
regular expression.
Let L(r) be a regular language recognized by some finite automata (FA).
• States : States of FA are represented by circles. State names are written inside circles.
• Start state : The state from where the automata starts, is known as the start state. Start state
has an arrow pointed towards it.
• Intermediate states : All intermediate states have at least two arrows; one pointing to and
another pointing out from them.

11
• Final state : If the input string is successfully parsed, the automata is expected to be in this
state. Final state is represented by double circles. It may have any odd number of arrows
pointing to it and even number of arrows pointing out from it. The number of odd arrows are
one greater than even, i.e. odd = even+1.
• Transition : The transition from one state to another state happens when a desired symbol in
the input is found. Upon transition, automata can either move to the next state or stay in the
same state. Movement from one state to another is shown as a directed arrow, where the
arrows points to the destination state. If automata stays on the same state, an arrow pointing
from a state to itself is drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q 0, qf),
Σ(0,1), q0, qf, δ}

Longest Match Rule


When the lexical analyzer read the source-code, it scans the code letter by letter; and when it
encounters a whitespace, operator symbol, or special symbols, it decides that a word is completed.
For example: int intvalue;
While scanning both lexemes till ‘int’, the lexical analyzer cannot determine whether it is a
keyword int or the initials of identifier int value. The Longest Match Rule states that the lexeme
scanned should be determined based on the longest match among all the tokens available. The
lexical analyzer also follows rule priority where a reserved word, e.g., a keyword, of a language is
given priority over user input. That is, if the lexical analyzer finds a lexeme that matches with any
existing reserved word, it should generate an error.
Unit3 Syntax Analysis

Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the basic
concepts used in the construction of a parser. We have seen that a lexical analyzer can identify
tokens with the help of regular expressions and pattern rules. But a lexical analyzer cannot check
the syntax of a given sentence due to the limitations of the regular expressions. Regular expressions
cannot check balancing tokens, such as parenthesis. Therefore, this phase uses context-free
grammar (CFG), which is recognized by push-down automata. CFG, on the other hand, is a superset
of Regular Grammar, as depicted below:

12
It implies that every Regular Grammar is also context-free, but there exists some problems, which
are beyond the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of
programming languages.
I-Context-free grammars
Context-free grammars (CFGs) are a formalism used to describe the syntax or structure of context-
free languages. They are widely used in the field of formal language theory, compiler design, and
natural language processing. CFGs are less restrictive than regular grammars and can describe more
complex languages.
A context-free grammar consists of the following components:
1. Terminals: These are the basic symbols or atomic units of the language. Terminals represent
the individual elements or tokens in the language. Examples of terminals in a programming
language could be keywords, operators, identifiers, or literals.
2. Non-terminals: These are symbols that represent groups or categories of language elements.
Non-terminals are placeholders for sequences of terminals and/or other non-terminals. They
help define the structure and rules of the language. Non-terminals are often represented by
uppercase letters.
3. Productions or Rules: Productions define the relationships between non-terminals and
terminals/non-terminals in the language. They specify how non-terminals can be expanded
or replaced by sequences of terminals and non-terminals. Productions are typically
represented in the form "A -> α", where A is a non-terminal and α is a sequence of terminals
and/or non-terminals.
4. Start Symbol: The start symbol represents the initial non-terminal from which the derivation
of the language starts. It is often denoted by 'S'.
The language generated by a context-free grammar is the set of all strings that can be derived from
the start symbol using the productions according to the rules of the grammar. The process of
deriving a string from a grammar involves repeatedly applying productions until no non-terminals
remain, resulting in a string composed entirely of terminals.
For example, consider a simple context-free grammar for arithmetic expressions:
1. Terminals: +, -, *, /, (, ), numbers
2. Non-terminals: E (expression), T (term), F (factor)
3. Productions:
• E -> E + T

13
• E -> E - T
• E -> T
• T -> T * F
• T -> T / F
• T -> F
• F -> (E)
• F -> number
Using this grammar, we can generate valid arithmetic expressions like "3 + (4 * 2)" or "5 * (7 - 2)".
Context-free grammars provide a formal framework to describe the syntax of languages. They are
used in compiler design to define the syntax of programming languages, in natural language
processing to model the structure of sentences, and in various other applications involving the
analysis and generation of structured data.
A context-free grammar has four components:
• A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings.
The non-terminals define sets of strings that help define the language generated by the
grammar.
• A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from
which strings are formed.
• A set of productions (P). The productions of a grammar specify the manner in which the
terminals and non-terminals can be combined to form strings. Each production consists of a
non-terminal called the left side of the production, an arrow, and a sequence of tokens
and/or on- terminals, called the right side of the production.
• One of the non-terminals is designated as the start symbol (S); from where the production
begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the
start symbol) by the right side of a production, for that non-terminal.

Example
We take the problem of palindrome language, which cannot be described by means of Regular
Expression. That is, L = { w | w = w R } is not a regular language. But it can be described by means
of CFG, as illustrated below:
G = ( V, Σ, P, S )

Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S = { Q }

This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111,
etc.

14
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The
parser analyzes the source code (token stream) against the production rules to detect any errors in
the code. The output of this phase is a parse tree.

This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and
generating a parse tree as the output of the phase. Parsers are expected to parse the whole code even
if some errors exist in the program.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During
parsing, we take two decisions for some sentential form of input:
• Deciding the non-terminal which is to be replaced.
• Deciding the production rule, by which, the non-terminal will be replaced.
To decide which non-terminal to be replaced with production rule, we can have two options.

Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most derivation is called the left-sentential form.

Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-most
derivation. The sentential form derived from the right-most derivation is called the right-sentential
form.
Example
Production rules:
E → E + E
E → E * E
E → id

Input string: id + id * id
The left-most derivation is:
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E

15
E → id + id * id

Notice that the left-most side non-terminal is always processed first.


The right-most derivation is:
E → E + E
E → E + E * E
E → E + E * id
E → E + id * id
E → id + id * id

Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived
from the start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us
see this by an example from the last topic.
We take the left-most derivation of a + b * c
The left-most derivation is:
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id

Step 1:

E→E*E
Step 2:

E→E+E*E

Step 3:

E → id + E * E

16
Step 4:

E → id + id * E

Step 5:

E → id + id * id

In a parse tree:
• All leaf nodes are terminals.
• All interior nodes are non-terminals.
• In-order traversal gives original input string.
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed
first, therefore the operator in that sub-tree gets precedence over the operator which is in the parent
nodes.

Types of Parsing
Syntax analyzers follow production rules defined by means of context-free grammar. The way the
production rules are implemented (derivation) divides parsing into two types : top-down parsing
and bottom-up parsing.

Top-down Parsing
When the parser starts constructing the parse tree from the start symbol and then tries to transform
the start symbol to the input, it is called top-down parsing.

17
• Recursive descent parsing : It is a common form of top-down parsing. It is called recursive
as it uses recursive procedures to process the input. Recursive descent parsing suffers from
backtracking.
• Backtracking : It means, if one derivation of a production fails, the syntax analyzer restarts
the process using different rules of same production. This technique may process the input
string more than once to determine the right production.

Bottom-up Parsing
As the name suggests, bottom-up parsing starts with the input symbols and tries to construct the
parse tree up to the start symbol.
Example:
Input string : a + b * c
Production rules:
S → E
E → E + T
E → E * T
E → T
T → id

Let us start bottom-up parsing


a + b * c

Read the input and check if any production matches with the input:
a + b * c
T + b * c
E + b * c
E + T * c
E * c
E * T
E
S

Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for
at least one string.
Example
E → E + E
E → E – E
E → id

For the string id + id – id, the above grammar generates two parse trees:

18
The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity
in grammar is not good for a compiler construction. No method can detect and remove ambiguity
automatically, but it can be removed by either re-writing the whole grammar without ambiguity, or
by setting and following associativity and precedence constraints.

Associativity
If an operand has operators on both sides, the side on which the operator takes this operand is
decided by the associativity of those operators. If the operation is left-associative, then the operand
will be taken by the left operator or if the operation is right-associative, the right operator will take
the operand.
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the
expression contains:
id op id op id

it will be evaluated as:


(id op id) op id

For example, (id + id) + id


Operations like Exponentiation are right associative, i.e., the order of evaluation in the same
expression will be:
id op (id op id)

For example, id ^ (id ^ id)

Precedence
If two different operators share a common operand, the precedence of operators decides which will
take the operand. That is, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4
and another corresponding to 2+(3*4). By setting precedence among operators, this problem can be

19
easily removed. As in the previous example, mathematically * (multiplication) has precedence over
+ (addition), so the expression 2+3*4 will always be interpreted as:
2 + (3 * 4)

These methods decrease the chances of ambiguity in a language or its grammar.

Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself
as the left-most symbol. Left-recursive grammar is considered to be a problematic situation for top-
down parsers. Top-down parsers start parsing from the Start symbol, which in itself is non-terminal.
So, when the parser encounters the same non-terminal in its derivation, it becomes hard for it to
judge when to stop parsing the left non-terminal and it goes into an infinite loop.
Example:
(1) A => Aα | β

(2) S => Aα | β
A => Sd

(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents
a string of non-terminals.
(2) is an example of indirect-left recursion.

A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and
the parser may go into a loop forever.

Removal of Left Recursion


One way to remove left recursion is to use the following technique:
The production
A => Aα | β

is converted into following productions


A => βA’
A => αA’ | ε

20
This does not impact the strings derived from the grammar, but it removes immediate left recursion.
Second method is to use the following algorithm, which should eliminate all direct and indirect left
recursions.
Algorithm
START
Arrange non-terminals in some order like A1, A2, A3,…, An
for each i from 1 to n
{
for each j from 1 to i-1
{
replace each production of form Ai⟹Aj𝜸
with Ai ⟹ δ1𝜸 | δ2𝜸 | δ3𝜸 |…| 𝜸
where Aj ⟹ δ1 | δ2|…| δn are current Aj productions
}
}
eliminate immediate left-recursion
END

Example
The production set
S => Aα | β
A => Sd

after applying the above algorithm, should become


S => Aα | β
A => Aαd | βd

and then, remove immediate left recursion using the first technique.
A => βdA’
A => αdA’ | ε

Now none of the production has either direct or indirect left recursion.

Left Factoring
If more than one grammar production rules has a common prefix string, then the top-down parser
cannot make a choice as to which of the production it should take to parse the string in hand.
Example
If a top-down parser encounters a production like
A ⟹ αβ | α𝜸 | …

Then it cannot determine which production to follow to parse the string as both productions are
starting from the same terminal (or non-terminal). To remove this confusion, we use a technique
called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we
make one production for each common prefixes and the rest of the derivation is added by new
productions.

21
Example
The above productions can be written as
A => αA’
A’=> β | 𝜸 | …

Now the parser has only one production per prefix which makes it easier to take decisions.

First and Follow Sets


An important part of parser table construction is to create first and follow sets. These sets can
provide the actual position of any terminal in the derivation. This is done to create the parsing table
where the decision of replacing T[A, t] = α with some production rule.

First Set
This set is created to know what terminal symbol is derived in the first position by a non-terminal.
For example,
α → t β

That is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).

Algorithm for calculating First set


Look at the definition of FIRST(α) set:
• if α is a terminal, then FIRST(α) = { α }.
• if α is a non-terminal and α → ℇ is a production, then FIRST(α) = { ℇ }.
• if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any FIRST(𝜸) contains t then t is in
FIRST(α).

First set can be seen as: FIRST(α) = { t | α →* t β } ∪ { ℇ | α →* ε}

Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-terminal α in production
rules. We do not consider what the non-terminal can generate but instead, we see what would be the
next terminal symbol that follows the productions of a non-terminal.

Algorithm for calculating Follow set:


• if α is a start symbol, then FOLLOW() = $
• if α is a non-terminal and has a production α → AB, then FIRST(B) is in FOLLOW(A)
except ℇ.
• if α is a non-terminal and has a production α → AB, where B ℇ, then FOLLOW(A) is in
FOLLOW(α).
Follow set can be seen as: FOLLOW(α) = { t | S *αt*}

22
Error-recovery Strategies
A parser should be able to detect and report any error in the program. It is expected that when an
error is encountered, the parser should be able to handle it and carry on parsing the rest of the input.
Mostly it is expected from the parser to check for errors but errors may be encountered at various
stages of the compilation process. A program may have the following kinds of errors at various
stages:
• Lexical : name of some identifier typed incorrectly
• Syntactical : missing semicolon or unbalanced parenthesis
• Semantical : incompatible value assignment
• Logical : code not reachable, infinite loop
There are four common error-recovery strategies that can be implemented in the parser to deal with
errors in the code.

Panic mode
When a parser encounters an error anywhere in the statement, it ignores the rest of the statement by
not processing input from erroneous input to delimiter, such as semi-colon. This is the easiest way
of error-recovery and also, it prevents the parser from developing infinite loops.

Statement mode
When a parser encounters an error, it tries to take corrective measures so that the rest of inputs of
statement allow the parser to parse ahead. For example, inserting a missing semicolon, replacing
comma with a semicolon etc. Parser designers have to be careful here because one wrong correction
may lead to an infinite loop.

Error productions
Some common errors are known to the compiler designers that may occur in the code. In addition,
the designers can create augmented grammar to be used, as productions that generate erroneous
constructs when these errors are encountered.

Global correction
The parser considers the program in hand as a whole and tries to figure out what the program is
intended to do and tries to find out a closest match for it, which is error-free. When an erroneous
input (statement) X is fed, it creates a parse tree for some closest error-free statement Y. This may
allow the parser to make minimal changes in the source code, but due to the complexity (time and
space) of this strategy, it has not been implemented in practice yet.

Abstract Syntax Trees


Parse tree representations are not easy to be parsed by the compiler, as they contain more details
than actually needed. Take the following parse tree as an example:

23
If watched closely, we find most of the leaf nodes are single child to their parent nodes. This
information can be eliminated before feeding it to the next phase. By hiding extra information, we
can obtain a tree as shown below:

Abstract tree can be represented as:

ASTs are important data structures in a compiler with least unnecessary information. ASTs are more
compact than a parse tree and can be easily used by a compiler.

24
Limitations of Syntax Analyzers
Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical
analyzers are responsible for the validity of a token supplied by the syntax analyzer. Syntax
analyzers have the following drawbacks:
• it cannot determine if a token is valid,
• it cannot determine if a token is declared before it is being used,
• it cannot determine if a token is initialized before it is being used,
• it cannot determine if an operation performed on a token type is valid or not.
These tasks are accomplished by the semantic analyzer, which we shall study in Semantic Analysis.

25

You might also like