0% found this document useful (0 votes)
5 views68 pages

ATCD-Unit 4 Compiler Design

This document provides an introduction to compilers, detailing their definition, functions, and phases including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. It explains the roles of lexical analyzers, input buffering techniques, and the specification and recognition of tokens using regular expressions and finite automata. Additionally, it discusses the differences between compilers and interpreters, as well as the concept of bootstrapping in compiler development.

Uploaded by

sararakesh21
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views68 pages

ATCD-Unit 4 Compiler Design

This document provides an introduction to compilers, detailing their definition, functions, and phases including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. It explains the roles of lexical analyzers, input buffering techniques, and the specification and recognition of tokens using regular expressions and finite automata. Additionally, it discusses the differences between compilers and interpreters, as well as the concept of bootstrapping in compiler development.

Uploaded by

sararakesh21
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT –IV

Introduction to Compiler: Introduction to Compilers: Definition of


compiler – Interpreter-phases of compiler.
Lexical Analysis: Roles of Lexical analyzer –Input buffering – specification of tokens –

Recognition of Tokens – A language for specifying lexical analyzers – design of a Lexical analyzer.

 Introduction to Compiler:

 A compiler is a translator that converts the high-level language into the machine
language.

 High-level language is written by a developer and machine language can be understood


by the processor.

 Compiler is used to show errors to the programmer.

 The main purpose of compiler is to change the code written in one language without
changing the meaning of the program.

 When you execute a program which is written in HLL programming language then it
executes into two parts.

 In the first part, the source program compiled and translated into the object program
(low level language).

 In the second part, object program translated into the target program through the
assembler.

Fig: Execution process of source program in Compiler

Definition of compiler – Interpreter:

What is a Compiler?

A compiler is a translator that produces an output of low-level language (like an assembly or machine
language) by taking an input of high-level language. It is basically a computer program used to
transform codes written in a programming language into machine code (human-readable code to a
binary 0 and 1 bits language for a computer processor to understand). The computer then processes
the machine code for performing the corresponding tasks.

 Compilers check all types of errors, limits, and ranges. Thus, it’s more intelligent.
 The run time of its program is longer, and it occupies more memory.

What Is an Interpreter?

It is a program that functions for the translation of a programming language into a comprehensible
one. It is a computer program used for converting high-level program statements into machine
codes. It includes pre-compiled code, source code, and scripts.

An interpreter translates only one statement at a time of the program.

They create an exe of the programming language before the program runs.

Compiler Passes

Pass is a complete traversal of the source program. Compiler has two passes to traverse the source
program.

Multi-pass Compiler

 Multi pass compiler is used to process the source code of a program several times.

 In the first pass, compiler can read the source program, scan it, extract the tokens and
store the result in an output file.

 In the second pass, compiler can read the output file produced by first pass, build the
syntactic tree and perform the syntactical analysis. The output of this phase is a file that
contains the syntactical tree.

 In the third pass, compiler can read the output file produced by second pass and check
that the tree follows the rules of language or not. The output of semantic analysis phase
is the annotated tree syntax.

 This pass is going on, until the target output is produced.

One-pass Compiler

 One-pass compiler is used to traverse the program only once. The one-pass compiler passes
only once through the parts of each compilation unit. It translates each part into its final
machine code.

 In the one pass compiler, when the line source is processed, it is scanned and the token is
extracted.

 Then the syntax of each line is analyzed and the tree structure is build. After the semantic
part, the code is generated.

 The same process is repeated for each line of code until the entire program is compiled.

Difference Between Compiler and Interpreter


Bootstrapping:

 Bootstrapping is widely used in the compilation development.

 Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type


of compiler that can compile its own source code.

 Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.

A compiler can be characterized by three languages:

1. Source Language

2. Target Language

3. Implementation Language

The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.

Follow some steps to produce a new language L for machine A:

1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and that
compiler runs on machine A.
2. Create a compiler LCSA for language L written in a subset of L.

3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L, which runs on
machine A and produces code for machine A.

The process described by the T-diagrams is called bootstrapping.

Compiler Phases/ Structure of a Compiler


The compilation process contains the sequence of various phases. Each phase takes source program
in one representation and produces output in another representation. Each phase takes input from
its previous stage.

There are the various phases of compiler:


Fig: phases of compiler

Lexical Analysis:

Lexical analyzer phase is the first phase of compilation process. It takes source code as input. It reads
the source program one character at a time and converts it into meaningful lexemes. Lexical analyzer
represents these lexemes in the form of tokens.

Syntax Analysis

Syntax analysis is the second phase of compilation process. It takes tokens as input and generates a
parse tree as output. In syntax analysis phase, the parser checks that the expression made by the
tokens is syntactically correct or not.

Semantic Analysis:Semantic analysis is the third phase of compilation process. It checks whether the
parse tree follows the rules of language. Semantic analyzer keeps track of identifiers, their types and
expressions. The output of semantic analysis phase is the annotated tree syntax.

Intermediate Code Generation


In the intermediate code generation, compiler generates the source code into the intermediate code.
Intermediate code is generated between the high-level language and the machine language. The
intermediate code should be generated in such a way that you can easily translate it into the target
machine code.

Code Optimization

Code optimization is an optional phase. It is used to improve the intermediate code so that the
output of the program could run faster and take less space. It removes the unnecessary lines of the
code and arranges the sequence of statements in order to speed up the program execution.

Code Generation

Code generation is the final stage of the compilation process. It takes the optimized intermediate
code as input and maps it to the target machine language. Code generator translates the
intermediate code into the machine code of the specified computer.

Example:

Roles of Lexical analyzer


The lexical analysis is the first phase of the compiler where a lexical analyser operate as an interface
between the source code and the rest of the phases of a compiler. It reads the input characters of
the source program, groups them into lexemes, and produces a sequence of tokens for each lexeme.
The tokens are sent to the parser for syntax analysis.

If the lexical analyzer is located as a separate pass in the compiler it can need an intermediate file to
locate its output, from which the parser would then takes its input. It can eliminate the need for the
intermediate file, the lexical analyzer and the syntactic analyser (parser) are often grouped into the
same pass where the lexical analyser operates either under the control of the parser or as a
subroutine with the parser.

The lexical analyzer also interacts with the symbol table while passing tokens to the parser.
Whenever a token is discovered, the lexical analyzer returns a representation for that token to the
parser. If the token is a simple construct including parenthesis, comma, or a colon, then it returns an
integer program. If the token is a more complex items including an identifier or another token with a
value, the value is also passed to the parser.

Lexical analyzer separates the characters of the source language into groups that logically belong
together, called tokens. It includes the token name which is an abstract symbol that define a type of
lexical unit and an optional attribute value called token values. Tokens can be identifiers, keywords,
constants, operators, and punctuation symbols including commas and parenthesis. A rule that
represent a group of input strings for which the equal token is make as output is called the pattern.

Regular expression plays an essential role in specifying patterns. If a keyword is treated as a token,
the pattern is only a sequence of characters. For identifiers and various tokens, patterns form a
difficult structure.

The lexical analyzer also handles issues including stripping out the comments and whitespace (tab,
newline, blank, and other characters that are used to separate tokens in the input). The correlating
error messages that are generated by the compiler during lexical analyzer with the source program.

For example, it can maintain track of all newline characters so that it can relate an ambiguous
statement line number with each error message. It can be implementing the expansion of macros, in
the case of macro, pre-processors are used in the source program.

Input Buffering in Compiler Design:


Lexical Analysis has to access secondary memory each time to identify tokens. It is time-consuming
and costly. So, the input strings are stored into a buffer and then scanned by Lexical Analysis.

Lexical Analysis scans input string from left to right one character at a time to identify tokens. It uses
two pointers to scan tokens −

 Begin Pointer (bptr) − It points to the beginning of the string to be read.

 Look Ahead Pointer (lptr) − It moves ahead to search for the end of the token.

Example − For statement int a, b;

 Both pointers start at the beginning of the string, which is stored in the buffer.
 Look Ahead Pointer scans buffer until the token is found.

 The character ("blank space") beyond the token ("int") have to be examined before the token
("int") will be determined.

 After processing token ("int") both pointers will set to the next token ('a'), & this process will
be repeated for the whole program.

A buffer can be divided into two halves. If the look Ahead pointer moves towards halfway in First
Half, the second half is filled with new characters to be read. If the look Ahead pointer moves
towards the right end of the buffer of the second half, the first half will be filled with new characters,
and it goes on.

Line Input Buffer: This input buffer reads input characters one line at a time and returns a token
when a complete line has been read. This type of input buffer is commonly used in scripting
languages where each line of code is treated as a separate command.

Block Input Buffer: This input buffer reads input characters in blocks or chunks of fixed size,
typically specified by the programmer. Input buffering in compiler design This type of input buffer is
used in languages that support block structures, such as C or Pascal.

The advantages of using input buffering include:

Reduced I/O operations: Reading a large block of input data into the buffer can significantly reduce
the number of I/O operations required to process the input. This can result in faster execution times
and better performance.

Improved memory usage: Input buffering allows the compiler to use a fixed-size buffer to store the
input data, Input buffering in compiler design, which can help prevent memory allocation errors and
improve overall memory usage.

Simplified code: Input buffering can simplify the code required to read and process input data, which
can make the lexer easier to implement and maintain.

Better error handling: Input buffering can help the lexer identify and report errors more efficiently,
Input buffering in compiler design, as it can process the input data in larger chunks and identify
errors earlier in the process.

Overall, input buffering is a useful technique for improving the efficiency and effectiveness of the
lexical analysis phase in compiler design.
Sentinels − Sentinels are used to making a check, each time when the forward pointer is converted, a
check is completed to provide that one half of the buffer has not converted off. If it is completed,
then the other half should be reloaded.

Buffer Pairs − A specialized buffering technique can decrease the amount of overhead, which is
needed to process an input character in transferring characters. It includes two buffers, each includes
N-character size which is reloaded alternatively.

There are two pointers such as the lexeme Begin and forward are supported. Lexeme Begin points to
the starting of the current lexeme which is discovered. Forward scans ahead before a match for a
pattern are discovered. Before a lexeme is initiated, lexeme begin is set to the character directly after
the lexeme which is only constructed, and forward is set to the character at its right end.

Preliminary Scanning − Certain processes are best performed as characters are moved from the
source file to the buffer. For example, it can delete comments. Languages like FORTRAN which
ignores blank can delete them from the character stream. It can also collapse strings of several
blanks into one blank. Pre-processing the character stream being subjected to lexical analysis saves
the trouble of moving the look ahead pointer back and forth over a string of blanks.

Specification of Tokens

Specification of tokens depends on the pattern of the lexeme. Here we will be using regular
expressions to specify the different types of patterns that can actually form tokens.

Although the regular expressions are inefficient in specifying all the patterns forming tokens. Yet it
reveals almost all types of pattern that forms a token.

There are 3 specifications of tokens:

1. String

2. Language

3. Regular Expression

Recognition of Tokens
 Tokens obtained during lexical analysis are recognized by Finite Automata.

 Finite Automata (FA) is a simple idealized machine that can be used to recognize patterns
within input taken from a character set or alphabet (denoted as C). The primary task of an FA
is to accept or reject an input based on whether the defined pattern occurs within the input.

 There are two notations for representing Finite Automata. They are:

1. Transition Table

2. Transition Diagram

1. Transition Table

It is a tabular representation that lists all possible transitions for each state and input symbol
combination.

EXAMPLE
Assume the following grammar fragment to generate a specific language:
where the terminals if, then, else, relop, id and num generates sets of strings given by following
regular definitions.

Specification of Tokens

Specification of tokens depends on the pattern of the lexeme. Here we will be using regular
expressions to specify the different types of patterns that can actually form tokens.

Although the regular expressions are inefficient in specifying all the patterns forming tokens. Yet it
reveals almost all types of pattern that forms a token.

There are 3 specifications of tokens:

1. String

2. Language

3. Regular Expression

1. String

 An alphabet or character class is a finite set of symbols.

 A string over an alphabet is a finite sequence of symbols drawn from that alphabet.

 In language theory, the term "word" is often used as synonyms for "string".

 The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For
example, "banana" is a string of length six.

 The empty string, denoted ε, is the string of length zero.

Operations on String:

1. Prefix of String
The prefix of the string is the preceding symbols present in the string and the string(s) itself.

For Example: s = abcd

The prefix of the string abcd: ∈, a, ab, abc, abcd

2. Suffix of String

Suffix of the string is the ending symbols of the string and the string(s) itself.

For Example: s = abcd

Suffix of the string abcd: ∈, d, cd, bcd, abcd

3. Proper Prefix of String

The proper prefix of the string includes all the prefixes of the string excluding ∈ and the string(s)
itself.

Proper Prefix of the string abcd: a, ab, abc

The proper suffix of the string includes all the suffixes excluding ∈ and the string(s) itself.
4. Proper Suffix of String

Proper Suffix of the string abcd: d, cd, bcd

5. Substring of String

You may also like...

 Regular Expression in Compiler Design

 Input Buffering in Compiler Design

 Operator Precedence Parsing in Compiler Design

The substring of a string s is obtained by deleting any prefix or suffix from the string.

Substring of the string abcd: ∈, abcd, bcd, abc, …

The proper substring of a string s includes all the substrings of s excluding ∈ and the string(s) itself.
6. Proper Substring of String

Proper Substring of the string abcd: bcd, abc, cd, ab…

7. Subsequence of String
The subsequence of the string is obtained by eliminating zero or more (not necessarily consecutive)
symbols from the string.

A subsequence of the string abcd: abd, bcd, bd, …

8. Concatenation of String
If s and t are two strings, then st denotes concatenation.

s = abc t = def

Concatenation of string s and t i.e. st = abcdef

2. Language

A language is any countable set of strings over some fixed alphabet.

Operation on Language

As we have learnt language is a set of strings that are constructed over some fixed alphabets. Now
the operation that can be performed on languages are:

1. Union
Union is the most common set operation. Consider the two languages L and M. Then the union of
these two languages is denoted by:

L ∪ M = { s | s is in L or s is in M}

That means the string s from the union of two languages can either be from language L or from
language M.

If L = {a, b} and M = {c, d} Then L ∪ M = {a, b, c, d}

2. Concatenation
Concatenation links the string from one language to the string of another language in a series in all
possible ways. The concatenation of two different languages is denoted by:

L ⋅ M = {st | s is in L and t is in M} If L = {a, b} and M = {c, d}

Then L ⋅ M = {ac, ad, bc, bd}

3. Kleene Closure
Kleene closure of a language L provides you with a set of strings. This set of strings is obtained by
concatenating L zero or more time. The Kleene closure of the language L is denoted by:

If L = {a, b} then L* = {∈, a, b, aa, bb, aaa, bbb, …}

4. Positive Closure
The positive closure on a language L provides a set of strings. This set of strings is obtained by
concatenating ‘L’ one or more times. It is denoted by:
It is similar to the Kleene closure. Except for the term L0, i.e. L+ excludes ∈ until it is in L itself.

If L = {a, b} then L+ = {a, b, aa, bb, aaa, bbb, …}

So, these are the four operations that can be performed on the languages in the lexical analysis
phase.

3. Regular Expression

A regular expression is a sequence of symbols used to specify lexeme patterns. A regular expression
is helpful in describing the languages that can be built using operators such as union, concatenation,
and closure over the symbols.

A regular expression ‘r’ that denotes a language L(r) is built recursively over the smaller regular
expression using the rules given below.

The following rules define the regular expression over some alphabet Σ and the languages denoted
by these regular expressions.

1. ∈ is a regular expression that denotes a language L(∈). The language L(∈) has a set of
strings {∈} which means that this language has a single empty string.

2. If there is a symbol ‘a’ in Σ then ‘a’ is a regular expression that denotes a language L(a). The
language L(a) = {a} i.e. the language has only one string of length one and the string holds ‘a’
in the first position.

3. Consider the two regular expressions r and s then:

 r|s denotes the language L(r) ∪ L(s).

 (r) (s) denotes the language L(r) ⋅ L(s).

 (r)* denotes the language (L(r))*.

 (r)+ denotes the language L(r).

Regular Definitions:

Giving names to regular expressions is referred to as a Regular definition. If Σ is an alphabet of basic


symbols, then a regular definition is a sequence of definitions of the form.

d1 → r1
d2 → r2
………

dn → rn

 Each di is a distinct name.

 Each ri is a regular expression over the alphabet Σ U {d1, d2,. . . , di-1}.


Example: Identifiers is the set of strings of letters and string beginning with a letter. Regular definition
for this set:

letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9

id → letter ( letter | digit ) *.

Peephole Optimization

Peephole optimization involves analyzing small sections of code, known as peepholes or windows,
and optimizing them by replacing inefficient or unnecessary instructions with smaller, faster
equivalents. This technique works on the principle of code replacement while preserving the
program output. By examining local parts of the code, compilers can make targeted optimizations
that have a significant impact on overall performance.

How Peephole Optimization Works

Here's a detailed explanation of how peephole optimization works:

1. Identifying Patterns: Peephole optimization begins by scanning through the generated machine
code and identifying common patterns or sequences of instructions that can be optimized. These
patterns may involve unnecessary instructions, redundant computations, or opportunities for more
efficient instruction sequences.

2. Defining Peepholes: A peephole is a small window that moves through the generated code,
typically spanning a fixed number of consecutive instructions. The size of the peephole window is
determined based on the specific optimization goals and the complexity of the patterns targeted for
optimization. Common sizes range from 2 to 10 instructions.

3. Analyzing Peephole Contents: As the peephole window slides through the code, the compiler
analyzes the instructions contained within it. This analysis involves examining the sequence of
instructions, their operands, and their effects on program state (e.g., register values, memory
accesses).

4. Detecting Optimization Opportunities: Within each peephole window, the compiler looks for
opportunities to optimize the code. This may involve:

 Eliminating redundant instructions: Identifying and removing instructions that have no


effect on program behavior or whose results are overwritten by subsequent instructions.

 Replacing expensive operations with cheaper alternatives: Substituting sequences of


instructions with more efficient equivalents, such as replacing a series of arithmetic
operations with a single optimized instruction or using specialized CPU instructions for
common operations like multiplication or division.

 Exploiting instruction scheduling opportunities: Reordering instructions within the


peephole window to take advantage of pipelining or other CPU execution characteristics,
reducing stalls and improving overall performance.
 Folding constants: Evaluating constant expressions at compile-time and replacing them with
their computed values, reducing runtime overhead.

5. Applying Optimizations: When an optimization opportunity is identified, the compiler modifies


the code within the peephole window to apply the optimization. This may involve removing,
replacing, or reordering instructions to achieve the desired effect.

6. Moving the Peephole Window: After optimizing the instructions within a peephole window, the
window slides to the next position in the code, and the process repeats. Peephole optimization
continues until the entire generated code has been scanned and optimized, or until a predefined
limit on optimization iterations is reached.

7. Considering Constraints: Peephole optimization must operate within certain constraints to ensure
that optimizations do not inadvertently change program behavior or violate correctness properties.
These constraints may include preserving the semantics of the original program, adhering to CPU
architecture-specific constraints, and avoiding optimizations that could introduce new performance
bottlenecks or increase code size excessively.

8. Finalization and Code Emission: Once peephole optimization is complete, the compiler finalizes
the optimized machine code and emits it as the output of the compilation process. The optimized
code is then ready for further processing, such as linking and execution on the target platform.

Objectives of Peephole Optimization

The primary objectives of peephole optimization include:

1. Improving Performance: By eliminating redundant operations and optimizing instruction


sequences, peephole optimization enhances code execution speed.

2. Reducing Memory Footprint: Optimized code consumes fewer resources, leading to efficient
memory utilization.

3. Minimizing Code Size: Peephole optimization aims to produce compact code, reducing
storage requirements and facilitating faster program loading.

Peephole Optimization Techniques

Let's explore some common peephole optimization techniques with examples:

You may also like...

 Regular Expression in Compiler Design

 Code Optimization in Compiler Design - BtechVibes

 Type Checking in Compiler Design

1. Redundant Load and Store Elimination: Eliminates redundancy in memory operations, optimizing
code like eliminating unnecessary load and store instructions.
Example:
// Initial code
y = x + 5;

i = y;

z = i;

w = z * 3;

// Optimized code

y = x + 5;

i = y;

w = y * 3;

2. Constant Folding: Simplifies expressions containing constants at compile-time.


Example:
// Initial code

x = 2 * 3;

// Optimized code

x = 6;

3. Strength Reduction: Replaces expensive operations with cheaper alternatives.


Example:
// Initial code

y = x * 2;

// Optimized code

y = x + x; // or y = x << 1;

// Initial code

y = x / 2;

// Optimized code

y = x >> 1;

4. Null Sequences: Eliminates useless operations from the code.


5. Combine Operations: Merges multiple operations into a single equivalent operation.
Conclusion

Overall, peephole optimization plays a crucial role in improving the efficiency and performance of
compiled code by identifying and eliminating small-scale inefficiencies and redundancies in
generated machine code. While peephole optimization alone may not yield significant performance
gains, it complements other optimization techniques and contributes to overall code quality and
execution speed.

Operator Precedence Parsing


Operator Precedence Parsing is a parsing technique used to parse expressions and resolve
ambiguities in programming languages. It relies on the precedence levels assigned to operators to
guide the parsing process and construct a parse tree. By analyzing the tokens in an expression based
on their precedence, the parser can make informed decisions on how to combine and group them
correctly.

To understand operator precedence parsing better, let's consider the expression: 3 + 4 * 2. The
operators here are + and *, and they have different precedence levels. In this case, the * operator
has higher precedence than the + operator. So, the parsing process will give preference to the *
operator and construct the parse tree accordingly.

How Does Operator Precedence Parsing Work?

1. Shifting and Reducing Operations

In operator precedence parsing, the parser examines the input expression one token at a time,
moving from left to right. It uses a stack data structure to keep track of the tokens and their
precedence levels as it processes them.

The two primary actions performed by the parser are shifting and reducing:

 Shifting: As the parser reads each token from the input expression, it "shifts" or places it
onto the top of the stack. The stack holds a sequence of tokens encountered so far, forming a
partial expression.

 Reducing: After shifting a token onto the stack, the parser checks if the token's operator
precedence is higher than that of the operator currently on top of the stack. If it is, the
parser "reduces" or combines the operands using the higher precedence operator. It then
replaces the operands and the operator on top of the stack with a new non-terminal symbol,
representing the simplified sub-expression.

The parser repeats these steps until the entire expression is parsed and reduced to a single non-
terminal symbol, which represents the fully simplified expression.

Let's illustrate this process with the expression 3 + 4 * 2:

1. Read '3': Shift it onto the stack.

Stack: [3]

You may also like...

 Type Checking in Compiler Design

 Three Address Code in Compiler Design

 Peephole Optimization in Compiler Design: Techniques & Explanation

2. Read '+': Shift it onto the stack.

Stack: [3, +]

3. Read '4': Shift it onto the stack.

Stack: [3, +, 4]
4. Read '*': Since '*' has higher precedence than '+', reduce the stack. Apply the '*' operator to the
top two elements of the stack, i.e., 4 * 2. Replace the top three elements on the stack with the result
of the reduction.

Stack: [3, +, 8]

5. The expression is fully parsed, and the final result(8) is on the top of the stack.
2. Operator Precedence Table

The core of operator precedence parsing lies in the operator precedence table. This table contains all
the operators in the language along with their precedence levels. The parser refers to this table to
determine the order in which operators should be applied during reduction. Operators with higher
precedence appear higher in the table. In case of equal precedence, associativity rules are used to
decide the order of operations.
Consider the operator precedence table for a simple arithmetic language:

Operator Precedence

* High

/ High

+ Low

- Low

For example, with the expression 2 + 3 * 4, the parser will first shift 2 onto the stack, then +. When it
encounters 3, it compares the precedence of '*' (higher) with '+' (lower) in the table. Since '*' has
higher precedence, it reduces the stack by applying the '*' operation first before proceeding with '+',
resulting in the correct parse tree.

Example: Mixed Arithmetic Expression

Expression: 5 * 3 + 6 / 2

Stack Input Action

[5] *3+6/2 Shift

[5, *] 3+6/2 Shift

[5, *, 3] +6/2 Reduce (5 * 3 -> 15)

[15, +] 6/2 Shift

[15, +, 6] /2 Reduce (6 / 2 -> 3)

[15, +, 3] Reduce (15 + 3 -> 18)

Result: The parse tree represents the expression (5 * 3) + (6 / 2), and the final result is 18.
Advantages of Operator Precedence Parsing

 Simplicity: Operator precedence parsing is relatively easy to implement compared to other


parsing techniques like LR or LALR parsing.

 Efficiency: It requires only one pass through the input expression, making it more efficient
for simple expressions.

 Deterministic Parsing: Operator precedence parsing eliminates the need for backtracking,
resulting in deterministic parsing.

Disadvantages of Operator Precedence Parsing

 Limited Grammar Support: Operator precedence parsing may not handle all types of
grammars, especially those with left-recursive or ambiguous rules.

 Lack of Context Sensitivity: The parser only considers the precedence of operators and may
miss certain context-specific language constructs.

Conclusion

Operator precedence parsing is an efficient method for parsing expressions in programming


languages. By utilizing operator precedence tables and shifting-reducing operations, this technique
helps compilers correctly interpret expressions without excessive backtracking. However, it is crucial
to be aware of its limitations and potential conflicts. By understanding the nuances of operator
precedence parsing, compiler designers can implement this parsing technique effectively and create
robust and efficient compilers for various programming languages.

 where letter and digits are defined as - (letter → [A-Z a-z] & digit → [0-9])

 For this language, the lexical analyzer will recognize the keywords if, then, and else, as well
as lexemes that match the patterns for relop, id, and number.

 To simplify matters, we make the common assumption that keywords are also reserved
words: that is they cannot be used as identifiers.

 The num represents the unsigned integer and real numbers of Pascal.

 In addition, we assume lexemes are separated by white space, consisting of nonnull


sequences of blanks, tabs, and newlines.

 Our lexical analyzer will strip out white space. It will do so by comparing a string against the
regular definition ws, below.

 If a match for ws is found, the lexical analyzer does not return a token to the parser.

 It is the following token that gets returned to the parser.


2. Transition Diagram

It is a directed labeled graph consisting of nodes and edges. Nodes represent states, while edges
represent state transitions.

Components of Transition Diagram

1. One state is labelled the Start State. It is the initial state of transition diagram where control
resides when we begin to recognize a token.

2. Position is a transition diagram are drawn as circles and are called states.

3. The states are connected by Arrows called edges. Labels on edges are indicating the input
characters

4. Zero or more final states or Accepting states are represented by double circle in which the
tokens has been found.

5. Example:
 Where state "1" is initial state and state 3 is final state.

Here is the transition diagram of Finite Automata that recognizes the lexemes matching the

token relop.
Here is the Finite Automata Transition Diagram for the Identifiers and Keywords.

Here is the Finite Automata Transition Diagram for recognizing white spaces.

Note:

These Finite Automata can be constructed using either the transition diagram or the transition table
representation. Both transition diagrams and transition tables serve the same purpose of defining
and representing the behavior of an FA. They provide different visual and structural representations,
allowing designers to choose the format that best suits their preferences or requirements.

A Language For Specifying Lexical Analyzer:


There is a wide range of tools for constructing lexical analyzers.

Lex

YACC

Lex is a computer program that generates lexical analyzers. Lex is commonly used with the yacc
parser generator.

Creating a lexical analyzer

• First, a specification of a lexical analyzer is prepared by creating a program lex.l in the Lex
language. Then, lex.l is run through the Lex compiler to produce a C program [Link].c.

• Finally, [Link].c is run through the C compiler to produce an object progra m [Link], which is the
lexical analyzer that transforms an input stream into a sequence of tokens.

Fig1.11 Creating a lexical analyzer with lex

Lex Specification

A Lex program consists of three parts:

{ definitions }

%%

{ rules }

%%

{ user subroutines }

Definitions include declarations of variables, constants, and regular definitions


Ø Rules are statements of the form

p1 {action1}

p2 {action2}

where pi is regular expression and actioni describes what action the lexical analyzer should take

when pattern pi matches a lexeme. Actions are written in C code.

 User subroutines are auxiliary procedures needed by the actions. These can be compiled
separately and loaded with the lexical analyzer.

YACC- YET ANOTHER COMPILER-COMPILER

Yacc provides a general tool for describing the input to a computer program. The Yacc user specifies
the structures of his input, together with code to be invoked as each such structure is recognized.

Yacc turns such a specification into a subroutine that handles the input process; frequently, it is
convenient and appropriate to have most of the flow of control in the user's application handled by
this subroutine.

Finite Automata

Finite Automata is one of the mathematical models that consist of a number of states and edges. It
is a transition diagram that recognizes a regular expression or grammar.

There are tow types of Finite Automata :

· Non-deterministic Finite Automata (NFA)

· Deterministic Finite Automata (DFA)

Dpn {actionn}

Deterministic Finite Automata

DFA is a special case of a NFA in which

i) no state has an ε-transition.

ii) there is at most one transition from each state on any input.

DFA has five tuples denoted by

M = {Qd, Ʃ, δ, q0, fd}


Qd – finite set of states

Ʃ – finite set of input symbols

Construction of DFA from regular expression

The following steps are involved in the construction of DFA from regular expression:

Convert RE to NFA using Thomson’s rules

Convert NFA to DFA

Construct minimized DFA

The role of the parser:

The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and verifies that
the string can be generated by the grammar for the source language.

 It reports any syntax errors in the program. It also recovers from commonly occurring errors
so that it can continue processing its input.

Role of the Parser:

 Parser builds the parse tree.

 Parser verifies the structure generated by the tokens based on the grammar (performs
context free syntax analysis).

 Parser helps to construct intermediate code.

 Parser produces appropriate error messages.

 Parser performs error recovery.

Issues:

 Parser cannot detect errors such as:

 Variable re-declaration

 Variable initialization before use


 Data type mismatch for an operation

GRAMMARS

 Constructs that begin with keywords like while or int, are relatively easy to parse, because
the keyword guides the choice of the grammar production that must be applied to match the
input. Consider the grammar for expressions, which present more of challenge, because of
the associativity and precedence of operators.

 In the following expression grammar E represents expressions consisting of terms separated


by + signs, T represents terms consisting of factors separated by * signs, and F represents
factors that can be either parenthesized expressions or identifiers.

Error Handling:

Programs can contain errors at many different levels.

For example:

 Lexical, such as misspelling an identifier, keyword or operators

 Syntactic, such as misplaced semicolons or extra or missing braces, a case statement


without an enclosing switch statement.

 Semantic, such as type mismatches between operators and operands.

 Logical, such as an assignment operator = instead of the comparison operator ==,


infinitely recursive call.

Functions of error handler:

 It should report the presence of errors clearly and accurately.

 It should recover from each error quickly enough to be able to detect subsequent errors.

 It should not significantly slow down the processing of correct programs.

Error-Recovery Strategies:

Once an error is detected, the parser should recover from the error.

The following recovery strategies are

 Panic-mode: On discovering an error, the parser discards input symbols one at a time until
one of a designated set of synchronizing tokens is found. The synchronizing tokens are
usually delimiters, such as semicolon or}, whose role in the source program is clear and
unambiguous

 Phrase-level: When an error is identified, a parser may perform local correction on the
remaining input;that is, it may replace a prefix of the remaining input by some string that
allows the parser to continue. A typical local correction is to replace a comma by a
semicolon, delete an extraneous semicolon, or insert a missing semicolon. Phrase-level
replacement has been used in several error-repairing compilers, as it can correct any input
string. Its major drawback is the difficulty it has in coping with situations in which the actual
error has occurred before the point of detection

 Error-productions: The common errors that might be encountered are anticipated and
augment the grammar for the language at hand with productions that generate the
erroneous constructs.

 Global-correction: A compiler need to make few changes as possible in processing an


incorrect input string. There are algorithms for choosing a minimal sequence of changes to
obtain a globally least-cost correction. Given an incorrect input string x and grammar G,
these algorithms will find a parse tree for a related string y, such that the number of
insertions, deletions, and changes of tokens required to transform x into y is as small as
possible. These methods are in general too costly to implement in terms of time and space.

Parser

Parser is a compiler that is used to break the data into smaller elements coming from lexical analysis
phase.

A parser takes input in the form of sequence of tokens and produces output in the form of parse
tree.

Parsing is of two types: top down parsing and bottom up parsing.

Top down paring

 The top down parsing is known as recursive parsing or predictive parsing.

 Bottom up parsing is used to construct a parse tree for an input string.

 In the top down parsing, the parsing starts from the start symbol and transform it into the
input symbol.

Parse Tree representation of input string "acdb" is as follows:


Bottom up parsing

 Bottom up parsing is also known as shift-reduce parsing.

 Bottom up parsing is used to construct a parse tree for an input string.

 In the bottom up parsing, the parsing starts with the input symbol and construct the parse
tree up to the start symbol by tracing out the rightmost derivations of string in reverse.

Example

Production

E→T

T→T*F

T → id

F→T

F → id

Parse Tree representation of input string "id * id" is as follows:


Bottom up parsing is classified in to various parsing. These are as follows:

1. Shift-Reduce Parsing

2. Operator Precedence Parsing

3. Table Driven LR Parsing

a. LR( 1 )

b. SLR( 1 )

c. CLR ( 1 )

d. LALR( 1 )

Classification of Top Down Parsers:

Parsing is classified into two categories, i.e. Top-Down Parsing, and Bottom-Up Parsing. Top-Down
Parsing is based on Left Most Derivation whereas Bottom-Up Parsing is dependent on Reverse Right
Most Derivation.

The process of constructing the parse tree which starts from the root and goes down to the leaf is
Top-Down Parsing.

1. Top-Down Parsers constructs from the Grammar which is free from ambiguity and left
recursion.

2. Top-Down Parsers use leftmost derivation to construct a parse tree.

3. It does not allow Grammar With Common Prefixes.

Classification of Top-Down Parsing

1. With Backtracking: Brute Force Technique


2. Without Backtracking:

1. Recursive Descent Parsing

2. Predictive Parsing or Non-Recursive Parsing or LL(1) Parsing or Table Driver Parsing.

Recursive Descent Parsing

1. Whenever a Non-terminal spends the first time then go with the first alternative and
compare it with the given I/P String

2. If matching doesn’t occur then go with the second alternative and compare it with the given
I/P String.

3. If matching is not found again then go with the alternative and so on.

4. Moreover, If matching occurs for at least one alternative, then the I/P string is parsed
successfully.

Recursive Descent Parsing

S()
{ Choose any S production, S ->X1X2…..Xk;
for (i = 1 to k)
{
If ( Xi is a non-terminal)
Call procedure Xi();
else if ( Xi equals the current input, increment input)
Else /* error has occurred, backtrack and try another possibility */
}
}

Let’s understand it better with an example:


A recursive descent parsing program consists of a set of procedures, one for each nonterminal.
Execution begins with the procedure for the start symbol which halts if its procedure body scans the
entire input string.

Non-Recursive Predictive Parsing: This type of parsing does not require backtracking. Predictive
parsers can be constructed for LL(1) grammar, the first ‘L’ stands for scanning the input from left to
right, the second ‘L’ stands for leftmost derivation, and ‘1’ for using one input symbol lookahead at
each step to make parsing action decisions.

LL(1) or Table Driver or Predictive Parser

1. In LL1, the first L stands for Left to Right, and the second L stands for Left-most Derivation. 1
stands for the number of Looks Ahead tokens used by the parser while parsing a sentence.

2. LL(1) parsing is constructed from the grammar which is free from left recursion, common
prefix, and ambiguity.

3. LL(1) parser depends on 1 look ahead symbol to predict the production to expand the parse
tree.

4. This parser is Non-Recursive

Construction of LL(1)predictive parsing table

 For each production A -> α repeat the following steps:

 Add A -> α under M[A, b] for all b in FIRST(α)

 If FIRST(α) contains ε then add A -> α under M[A,c] for all c in FOLLOW(A).

 Size of parsing table = (No. of terminals + 1) * #variables

 Example:

 Consider the grammar


S → (L) | a
L → SL’
L’ → ε | SL’
For any grammar, if M has multiple entries then it is not LL(1) grammar.

Example:
S → iEtSS’/a
S’ →eS/ε
E→b
Important Notes

If a grammar contains left factoring then it can not be LL(1).

Eg - S -> aS | a
---- both productions go in a

If a grammar contains left recursion it can not be LL(1)

Eg - S -> Sa | b
S -> Sa goes to FIRST(S) = b
S -> b goes to b, thus b has 2 entries hence not LL(1)

If the grammar is ambiguous then it can not be LL(1).

Every regular grammar need not be LL(1) because regular grammar may contain left factoring, left
recursion or ambiguity.
Disadvantages of Backtracking

 Performance Impact: Backtracking can lead to inefficient parsing, particularly in cases where
there are numerous backtracking points or long ambiguous sequences in the input. In such
scenarios, alternative parsing techniques may be more efficient.

 Complexity: Managing backtracking points and tracking alternative choices can introduce
additional complexity to the parsing algorithm, requiring careful implementation and
optimization.

Conclusion

Backtracking is a fundamental technique in top-down parsing that enables the handling of ambiguity
and the exploration of alternative choices. It allows parsers to systematically backtrack to previous
decision points and try different production rules when a chosen path fails. Although backtracking
provides the necessary flexibility to handle ambiguous grammars, its usage should be carefully
considered due to potential performance implications. Understanding and effectively implementing
backtracking in top-down parsing contributes to more robust parsing algorithms and enhances the
accuracy of language processing systems.

EXAMPLE :

Top-Down Parsing with Backtracking, Parser will attempt multiple rules or production to identify the
match for input string by backtracking at every step of derivation. So, if the applied production does
not give the input string as needed, or it does not match with the needed string, then it can undo
that shift.

Example1 − Consider the Grammar

S→aAd

A→bc|b

Make parse tree for the string a bd. Also, show parse Tree when Backtracking is required when the
wrong alternative is chosen.

Solution

The derivation for the string abd will be −

S ⇒ a A d ⇒ abd (Required String)

If bc is substituted in place of non-terminal A then the string obtained will be abcd.

S ⇒ a A d ⇒ abcd (Wrong Alternative)

Figure (a) represents S → aAd


Figure (b) represents an expansion of tree with production A → bc which gives string abcd which
does not match with string abd. So, it backtracks & chooses another alternative, i.e., A → b in figure
(c) which matches with abd.

Algorithm for Backtracking

Example2 − Write an Algorithm for Top-Down Parsing with Backtracking for the following Grammar.

S→aAd

A → bc| b

Solution

In the following Algorithm, there is a procedure for each Non-terminal S & A. Terminal symbols a, b,
c, d in Grammar are input symbols or Look ahead symbols to be read by Parser.

advance( ) − It is a function used to move the input pointer to the next input symbol.
Limitations of Top-Down Parsing with Backtracking

Following are some problems, which occur in Top-Down Parsing with Backtracking.

 Backtracking − Backtracking looks very simple and easy to implement but choosing a
different alternative causes lot of problems which are given below −

 Undoing semantic actions requires a lot of overhead.

 Entries made in the symbol table during parsing have to be removed while
backtracking.

Due to these reasons, backtracking is not used for practical compilers.

Left Recursion − A grammar is left recursive if it has the production of form.

A → Aα|β

Which causes the parser to enter into an infinite loop.

 Choosing Right Production − The order in which alternatives are tested can change the
language accepted.

 Difficult to locate error − When failure occurs, it is difficult to know where the error
occurred.

Recursive Decent Parsing:

A recursive descent parser is a specific parsing technique. It is commonly employed in computer


science and compiler design to dissect the syntax and structure of a program or language, guided by
a predefined grammar. This parsing approach is termed "top-down" because it initiates its analysis
from the highest-level constructs in the grammar, such as the start symbol or main production rule.
As it processes the input, the parser recursively descends through the text, continually attempting to
align it with the production rules stipulated by the grammar. The form of recursive descent parser
that does not require backtracking is also called predictive parsing.

Backtracking: Making repeated input scans until we find a correct path.

To implement a recursive descent parser, the grammar must hold the following properties:

1. It should not be left recursive.

2. It should be left-factored. (Alternates should not have common prefixes).

3. Language should have a recursion facility.

If the grammar is not left-factored, the recursive descent parser will have to use backtracking.

Example

Before removing left recursion After removing left recursion

E –> T E’

E’ –> + T E’ | e

E –> E + T | T T –> F T’

T –> T * F | F T’ –> * F T’ | e

F –> ( E ) | id F –> ( E ) | id

**Here e is Epsilon

Example:

In the following Grammar, first symbol, i.e., if, while & begin uniquely determine, which of the
alternative to choose.

As Statement starting with if will be a conditional statement & statement starting with while will be
an Iterative Statement.

Stmt → If condition then Stmt else Stmt

| While condition do Stmt

| begin Stmt end.

Example − Write down the algorithm using Recursive procedures to implement the following
Grammar.

E → TE′

E′ → +TE′

T → FT′
T′ →∗ FT′|ε

F → (E)|id

One of major drawback or recursive-descent parsing is that it can be implemented only for those
languages which support recursive procedure calls and it suffers from the problem of left-recursion.

Recursive descent parsing has the following benefits:

1. Ease of use: Because recursive descent parsing closely mimics the grammar rules of the
language being parsed, it is simple to comprehend and use.

2. Readability: The parsing code is usually set up in a structured and modular way, which makes
it easier to read and maintain.

3. Recursive descent parsers can produce descriptive error messages, which make it simpler to
find and detect syntax mistakes in the input. 3. Error reporting.

4. Predictability: The predictable behavior of recursive descent parsers makes the parsing
process deterministic and clear.

Recursive descent parsing, however, also has certain drawbacks:

1. Recursive descent parsers encounter difficulties with left-recursive grammar rules since they
can result in unbounded recursion. To effectively handle left recursion, care must be made to
avoid it or employ methods like memoization.
2. Recursive descent parsers rely on backtracking when internal alternatives to a grammar rule
are unsuccessful. This could result in inefficiencies, especially if the grammar contains a lot of
ambiguity or options.

3. Recursive descent parsers frequently adhere to the LL(1) constraint, which requires that they
only use one token of lookahead. The grammar's expressiveness is constrained by this
restriction because it is unable to handle some ambiguous or context-sensitive languages.

Predictive Parser is also another method that implements the technique of Top- Down parsing
without Backtracking. A predictive parser is an effective technique of executing recursive-descent
parsing by managing the stack of activation records, particularly.

Predictive Parsers has the following components −

 Input Buffer − The input buffer includes the string to be parsed followed by an end

marker $ to denote the end of the string.

Here a, +, b are terminal symbols.

 Stack − It contains a combination of grammar symbols with $ on the bottom of the stack. At
the start of Parsing, the stack contains the start symbol of Grammar followed by $.

 Parsing Table − It is a two-dimensional array or Matrix M [A, a] where A is nonterminal and


'a' is a terminal symbol.

 Predictive parsing table construction:

The construction of a predictive parser is aided by two functions associated with a grammar G :

1. FIRST

2. FOLLOW
Rules for first( ):

1. If X is terminal, then FIRST(X) is{X}.

2. If X → ε is a production, then add ε toFIRST(X).

3. If X is non-terminal and X → aα is a production then add a to FIRST(X).

4. If X is non-terminal and X → Y1 Y2…Yk is a production, then place a in FIRST(X)

if for some i, a is in FIRST(Yi), and ε is in all of FIRST(Y1),…,FIRST(Yi-1);that is, Y1,….Yi-1=> ε. If ε is in


FIRST(Yj) for all j=1,2,..,k, then add ε to FIRST(X).

Rules for follow( ):

1. If S is a start symbol, then FOLLOW(S) contains $.

2. If there is a production A →αBβ, then every thing in FIRST(β) except ε is placed in

follow(B).

3. If there is a production A → αB, or a production A → αBβ where FIRST(β) contains ε,

then everything in FOLLOW(A) is in FOLLOW(B).

All the terminals are written column-wise, and all the Non-terminals are written row wise.

Algorithm for construction of predictive parsing table:

Input :

Grammar G Output : Parsing table M Method :

1. For each production A → α of the grammar, do steps 2 and3.

2. For each terminal a in FIRST(α), add A → α to M[A, a].

3. If ε is in FIRST(α), add A → α to M[A, b] for each terminal b in FOLLOW(A). If ε is in

FIRST(α) and $ is in FOLLOW(A) , add A → α to M[A,$].

4. Make each undefined entry of M be error.

 Parsing Program − The parsing program performs some action by comparing the symbol on
top of the stack and the current input symbol to be read on the input buffer.
 Actions − Parsing program takes various actions depending upon the symbol on the top of
the stack and the current input symbol. Various Actions taken are given below −

Algorithm to construct Predictive Parsing Table

Input − Context-Free Grammar G

Output − Predictive Parsing Table M

Method − For the production A → α of Grammar G.

 For each terminal, a in FIRST (𝛼) add A → α to M [A, a].

 If ε is in FIRST (α), and b is in FOLLOW (A), then add A → α to M[A, b].

 If ε is in FIRST (α), and $ is in FOLLOW (A), then add A → α to M[A, $].


 All remaining entries in Table M are errors.

Following are the steps to perform Predictive Parsing

 Elimination of Left Recursion

 Left Factoring

 Computation of FIRST & FOLLOW

 Construction of Predictive Parsing Table

 Parse the Input String

Example:

Consider the following grammar :

E→E+T|T

T→T*F|F

F→(E)|id

After eliminating left-recursion the grammar is

E →TE’ E’ → +TE’ | ε

T →FT’

T’ → *FT’ | ε

F → (E)|id

First( ) :

FIRST(E) = { ( ,id} FIRST(E’) ={+ , ε}

FIRST(T) = { ( ,id} FIRST(T’) = {*, ε}

FIRST(F) = { ( , id }

Follow( ):
FOLLOW(E) = { $, ) }

FOLLOW(E’) = { $, ) }

FOLLOW(T) = { +, $, ) }

FOLLOW(T’) = { +, $, ) }

FOLLOW(F) = {+, * , $ , ) }

Predictive parsing Table

LL(1) grammar:

The parsing table entries are single entries. So each location has not more than one entry.

This type of grammar is called LL(1)grammar.

LL(1) Parsing: Here the 1st L represents that the scanning of the Input will be done from the Left to
Right manner and the second L shows that in this parsing technique, we are going to use the Left
most Derivation Tree. And finally, the 1 represents the number of look-ahead, which means how
many symbols are you going to see when you want to make a decision.
Essential conditions to check first are as follows:

1. The grammar is free from left recursion.

2. The grammar should not be ambiguous.

3. The grammar has to be left factored in so that the grammar is deterministic grammar.

Consider this following grammar:

S→iEtS | iEtSeS| a

E→b

After eliminating left factoring, we have

S→iEtSS’|a

S’→ eS | ε

E→b

To construct a parsing table, we need FIRST() and FOLLOW() for all the non terminals.

FIRST(S) = { i, a }

FIRST(S’) = {e, ε }

FIRST(E) = { b}

FOLLOW(S) = { $ ,e }

FOLLOW(S’) = { $ ,e }

FOLLOW(E) = {t}

Since there are more than one production, the grammar is not LL(1) grammar. Actions performed in
predictive parsing:

1. Shift

2. Reduce

3. Accept

4. Error

Implementation of predictive parser:


1. Elimination of left recursion, left factoring and ambiguous grammar.

2. Construct FIRST() and FOLLOW() for all non-terminals.

3. Construct predictive parsing table.

4. Parse the given input string using stack and parsing table.

Example 1: Consider the Grammar:

E --> TE'
E' --> +TE' | ε
T --> FT'
T' --> *FT' | ε
F --> id | (E)

*ε denotes epsilon
Step 1: The grammar satisfies all properties in step 1.

Step 2: Calculate first() and follow().

Find their First and Follow sets:

Step 3: Make a parser table.

Now, the LL(1) Parsing Table is:


As you can see that all the null productions are put under the Follow set of that symbol and all the
remaining productions lie under the First of that symbol.

Note: Every grammar is not feasible for LL(1) Parsing table. It may be possible that one cell may
contain more than one production.

Shift reduce parsing:

o Shift reduce parsing is a process of reducing a string to the start symbol of a grammar.

o Shift reduce parsing uses a stack to hold the grammar and an input tape to hold the string.

 Sift reduce parsing performs the two actions: shift and reduce. That's why it is known as shift
reduces parsing.

 At the shift action, the current symbol in the input string is pushed to a stack.

 At each reduction, the symbols will replaced by the non-terminals. The symbol is the right
side of the production and non-terminal is the left side of the production.

Example:

Grammar: S → S+S / S-S / (S) / a

Input string:a1-(a2+a3)

Parsing table:

Operator precedence parsing

Operator precedence grammar is kinds of shift reduce parsing method. It is applied to a small class of
operator grammars.

A grammar is said to be operator precedence grammar if it has two properties:

 No R.H.S. of any production has a∈.

 No two non-terminals are adjacent.


Operator precedence can only established between the terminals of the grammar. It ignores the non-
terminal.

There are the three operator precedence relations:

a ⋗ b means that terminal "a" has the higher precedence than terminal "b".

a ⋖ b means that terminal "a" has the lower precedence than terminal "b".

a ≐ b means that the terminal "a" and "b" both have same precedence.

Precedence table:

Parsing Action

 Both end of the given input string, add the $ symbol.

 Now scan the input string from left right until the ⋗ is encountered.

 Scan towards left over all the equal precedence until the first left most ⋖ is encountered.

 Everything between left most ⋖ and right most ⋗ is a handle.

 $ on $ means parsing is successful.

Example

Grammar:

1. E → E+T/T

2. T → T*F/F

3. F → id

Given string:

1. w = id + id * id

Let us consider a parse tree for it as follows:


On the basis of above tree, we can design following operator precedence table:

Now let us process the string with the help of the above precedence table:
There are two main categories of shift reduce parsing as follows:

1. Operator-Precedence Parsing

2. LR-Parser

LR Parser

LR parsing is one type of bottom up parsing. It is used to parse the large class of grammars.

In the LR parsing, "L" stands for left-to-right scanning of the input.

"R" stands for constructing a right most derivation in reverse.

"K" is the number of input symbols of the look ahead used to make number of parsing decision.

LR parsing is divided into four parts: LR (0) parsing, SLR parsing, CLR parsing and LALR parsing.
LR algorithm:

The LR algorithm requires stack, input, output and parsing table. In all type of LR parsing, input,
output and stack are same but parsing table is different.

Fig: Block diagram of LR parser

Input buffer is used to indicate end of input and it contains the string to be parsed followed by a $
Symbol.

A stack is used to contain a sequence of grammar symbols with a $ at the bottom of the stack.

Parsing table is a two dimensional array. It contains two parts: Action part and Go To part.

LR (1) Parsing

Various steps involved in the LR (1) Parsing:

 For the given input string write a context free grammar.

 Check the ambiguity of the grammar.

 Add Augment production in the given grammar.

 Create Canonical collection of LR (0) items.

 Draw a data flow diagram (DFA).


 Construct a LR (1) parsing table.

Augment Grammar

Augmented grammar G` will be generated if we add one more production in the given grammar G. It
helps the parser to identify when to stop the parsing and announce the acceptance of the input.

Example

Given grammar

S → AA

A → aA | b

The Augment grammar G` is represented by

S`→ S

S → AA

A → aA | b

Canonical Collection of LR(0) items

An LR (0) item is a production G with dot at some position on the right side of the production.

LR(0) items is useful to indicate that how much of the input has been scanned up to a given point in
the process of parsing.

In the LR (0), we place the reduce node in the entire row.

Example

Given grammar:

S → AA

A → aA | b

Add Augment Production and insert '•' symbol at the first position for every production in G

S` → •S

S → •AA

A → •aA

A → •b

I0 State:

Add Augment production to the I0 State and Compute the Closure

I0 = Closure (S` → •S)

Add all productions starting with S in to I0 State because "•" is followed by the non-terminal. So, the
I0 State becomes
I0 = S` → •S
S → •AA

Add all productions starting with "A" in modified I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes.

I0= S` → •S
S → •AA
A → •aA
A → •b

I1= Go to (I0, S) = closure (S` → S•) = S` → S•

Here, the Production is reduced so close the State.

I1= S` → S•

I2= Go to (I0, A) = closure (S → A•A)

Add all productions starting with A in to I2 State because "•" is followed by the non-terminal. So, the
I2 State becomes

I2 =S→A•A
A → •aA
A → •b

Go to (I2,a) = Closure (A → a•A) = (same as I3)

Go to (I2, b) = Closure (A → b•) = (same as I4)

I3= Go to (I0,a) = Closure (A → a•A)

Add productions starting with A in I3.

A → a•A
A → •aA
A → •b

Go to (I3, a) = Closure (A → a•A) = (same as I3)


Go to (I3, b) = Closure (A → b•) = (same as I4)

I4= Go to (I0, b) = closure (A → b•) = A → b•


I5= Go to (I2, A) = Closure (S → AA•) = SA → A•
I6= Go to (I3, A) = Closure (A → aA•) = A → aA•

Drawing DFA:

The DFA contains the 7 states I0 to I6.


LR(0) Table

 If a state is going to some other state on a terminal then it correspond to a shift move.

 If a state is going to some other state on a variable then it correspond to go to move.

 If a state contain the final item in the particular row then write the reduce node completely.

Explanation:

 I0 on S is going to I1 so write it as 1.

 I0 on A is going to I2 so write it as 2.

 I2 on A is going to I5 so write it as 5.

 I3 on A is going to I6 so write it as 6.

 I0, I2and I3on a are going to I3 so write it as S3 which means that shift 3.

 I0, I2 and I3 on b are going to I4 so write it as S4 which means that shift 4.


 I4, I5 and I6 all states contains the final item because they contain • in the right most end. So
rate the production as production number.

Productions are numbered as follows:

1. S → AA ... (1)

2. A → aA ... (2)

3. A → b ... (3)

 I1 contains the final item which drives(S` → S•), so action {I1, $} = Accept.

 I4 contains the final item which drives A → b• and that production corresponds to the
production number 3 so write it as r3 in the entire row.

 I5 contains the final item which drives S → AA• and that production corresponds to the
production number 1 so write it as r1 in the entire row.

 I6 contains the final item which drives A → aA• and that production corresponds to the
production number 2 so write it as r2 in the entire row.

SLR (1) Parsing

SLR (1) refers to simple LR Parsing. It is same as LR(0) parsing. The only difference is in the parsing
[Link] construct SLR (1) parsing table, we use canonical collection of LR (0) item.

In the SLR (1) parsing, we place the reduce move only in the follow of left hand side.

Various steps involved in the SLR (1) Parsing:

 For the given input string write a context free grammar

 Check the ambiguity of the grammar

 Add Augment production in the given grammar

 Create Canonical collection of LR (0) items

 Draw a data flow diagram (DFA)

 Construct a SLR (1) parsing table

SLR (1) Table Construction

The steps which use to construct SLR (1) Table is given below:

If a state (Ii) is going to some other state (Ij) on a terminal then it corresponds to a shift move in the
action part.
If a state (Ii) is going to some other state (Ij) on a variable then it correspond to go to move in the Go
to part.

If a state (Ii) contains the final item like A → ab• which has no transitions to the next state then the
production is known as reduce production. For all terminals X in FOLLOW (A), write the reduce entry
along with their production numbers.

Example

1. S -> •Aa

2. A->αβ•

1. Follow(S) = {$}

2. Follow (A) = {a}


SLR ( 1 ) Grammar

S→E
E→E+T|T
T→T*F|F
F → id

Add Augment Production and insert '•' symbol at the first position for every production in G

S` → •E
E → •E + T
E → •T
T → •T * F
T → •F
F → •id

I0 State:

Add Augment production to the I0 State and Compute the Closure

I0 = Closure (S` → •E)

Add all productions starting with E in to I0 State because "." is followed by the non-terminal. So, the
I0 State becomes

I0 = S` → •E
E → •E + T
E → •T

Add all productions starting with T and F in modified I0 State because "." is followed by the non-
terminal. So, the I0 State becomes.

I0= S` → •E
E → •E + T
E → •T
T → •T * F
T → •F
F → •id

I1= Go to (I0, E) = closure (S` → E•, E → E• + T)


I2= Go to (I0, T) = closure (E → T•T, T• → * F)
I3= Go to (I0, F) = Closure ( T → F• ) = T → F•
I4= Go to (I0, id) = closure ( F → id•) = F → id•
I5= Go to (I1, +) = Closure (E → E +•T)

Add all productions starting with T and F in I5 State because "." is followed by the non-terminal. So,
the I5 State becomes

I5 = E → E +•T
T → •T * F
T → •F
F → •id
Go to (I5, F) = Closure (T → F•) = (same as I3)
Go to (I5, id) = Closure (F → id•) = (same as I4)

I6= Go to (I2, *) = Closure (T → T * •F)

Add all productions starting with F in I6 State because "." is followed by the non-terminal. So, the I6
State becomes

I6 = T → T * •F
F → •id

Go to (I6, id) = Closure (F → id•) = (same as I4)

I7= Go to (I5, T) = Closure (E → E + T•) = E → E + T•


I8= Go to (I6, F) = Closure (T → T * F•) = T → T * F•

Drawing DFA:

SLR (1) Table

Explanation:

First (E) = First (E + T) ∪ First (T)


First (T) = First (T * F) ∪ First (F)
First (F) = {id}
First (T) = {id}

Follow (E) = First (+T) ∪ {$} = {+, $}


First (E) = {id}

Follow (T) = First (*F) ∪ First (F)


= {*, +, $}
Follow (F) = {*, +, $}

 I1 contains the final item which drives S → E• and follow (S) = {$}, so action {I1, $} = Accept

 I2 contains the final item which drives E → T• and follow (E) = {+, $}, so action {I2, +} = R2,
action {I2, $} = R2

 I3 contains the final item which drives T → F• and follow (T) = {+, *, $}, so action {I3, +} = R4,
action {I3, *} = R4, action {I3, $} = R4

 I4 contains the final item which drives F → id• and follow (F) = {+, *, $}, so action {I4, +} = R5,
action {I4, *} = R5, action {I4, $} = R5

 I7 contains the final item which drives E → E + T• and follow (E) = {+, $}, so action {I7, +} = R1,
action {I7, $} = R1

 I8 contains the final item which drives T → T * F• and follow (T) = {+, *, $}, so action {I8, +} =
R3, action {I8, *} = R3, action {I8, $} = R3.

CLR (1) Parsing

CLR refers to canonical lookahead. CLR parsing use the canonical collection of LR (1) items to build
the CLR (1) parsing table. CLR (1) parsing table produces the more number of states as compare to
the SLR (1) parsing.

In the CLR (1), we place the reduce node only in the lookahead symbols.

Various steps involved in the CLR (1) Parsing:

 For the given input string write a context free grammar

 Check the ambiguity of the grammar

 Add Augment production in the given grammar

 Create Canonical collection of LR (0) items

 Draw a data flow diagram (DFA)

 Construct a CLR (1) parsing table

LR (1) item is a collection of LR (0) items and a look ahead symbol.

LR (1) item = LR (0) item + look ahead

The look ahead is used to determine that where we place the final item.

The look ahead always add $ symbol for the argument production.

Example

CLR ( 1 ) Grammar
S → AA

A → aA

A→b

Add Augment Production, insert '•' symbol at the first position for every production in G and also
add the lookahead.

S` → •S, $

S → •AA, $

A → •aA, a/b

A → •b, a/b

I0 State:

Add Augment production to the I0 State and Compute the Closure

I0 = Closure (S` → •S)

Add all productions starting with S in to I0 State because "." is followed by the non-terminal. So, the
I0 State becomes

I0 = S` → •S, $
S → •AA, $

Add all productions starting with A in modified I0 State because "." is followed by the non-terminal.
So, the I0 State becomes.

I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b

I1= Go to (I0, S) = closure (S` → S•, $) = S` → S•, $


I2= Go to (I0, A) = closure ( S → A•A, $ )

Add all productions starting with A in I2 State because "." is followed by the non-terminal. So, the I2
State becomes

I2= S → A•A, $
A → •aA, $
A → •b, $

I3= Go to (I0, a) = Closure ( A → a•A, a/b )

Add all productions starting with A in I3 State because "." is followed by the non-terminal. So, the I3
State becomes

I3= A → a•A, a/b


A → •aA, a/b
A → •b, a/b
Go to (I3, a) = Closure (A → a•A, a/b) = (same as I3)
Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)

I4= Go to (I0, b) = closure ( A → b•, a/b) = A → b•, a/b


I5= Go to (I2, A) = Closure (S → AA•, $) =S → AA•, $
I6= Go to (I2, a) = Closure (A → a•A, $)

Add all productions starting with A in I6 State because "." is followed by the non-terminal. So, the I6
State becomes

I6 = A → a•A, $
A → •aA, $
A → •b, $

Go to (I6, a) = Closure (A → a•A, $) = (same as I6)


Go to (I6, b) = Closure (A → b•, $) = (same as I7)

I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $


I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•, a/b
I9= Go to (I6, A) = Closure (A → aA•, $) = A → aA•, $

Drawing DFA:

CLR (1) Parsing table:


Productions are numbered as follows:

1. S → AA ... (1)

2. A → aA ....(2)

3. A → b ... (3)

The placement of shift node in CLR (1) parsing table is same as the SLR (1) parsing table. Only
difference in the placement of reduce node.

I4 contains the final item which drives ( A → b•, a/b), so action {I4, a} = R3, action {I4, b} = R3.
I5 contains the final item which drives ( S → AA•, $), so action {I5, $} = R1.
I7 contains the final item which drives ( A → b•,$), so action {I7, $} = R3.
I8 contains the final item which drives ( A → aA•, a/b), so action {I8, a} = R2, action {I8, b} = R2.
I9 contains the final item which drives ( A → aA•, $), so action {I9, $} = R2.

LALR (1) Parsing:

LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the canonical
collection of LR (1) items.

In the LALR (1) parsing, the LR (1) items which have same productions but different look ahead are
combined to form a single set of items

LALR (1) parsing is same as the CLR (1) parsing, only difference in the parsing table.

Example

LALR ( 1 ) Grammar

S → AA

A → aA

A→b

Add Augment Production, insert '•' symbol at the first position for every production in G and also
add the look ahead.

S` → •S, $

S → •AA, $
A → •aA, a/b

A → •b, a/b

I0 State:

Add Augment production to the I0 State and Compute the ClosureL

I0 = Closure (S` → •S)

Add all productions starting with S in to I0 State because "•" is followed by the non-terminal. So, the
I0 State becomes

I0 = S` → •S, $
S → •AA, $

Add all productions starting with A in modified I0 State because "•" is followed by the non-terminal.
So, the I0 State becomes.

I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b

I1= Go to (I0, S) = closure (S` → S•, $) = S` → S•, $


I2= Go to (I0, A) = closure ( S → A•A, $ )

Add all productions starting with A in I2 State because "•" is followed by the non-terminal. So, the I2
State becomes

I2= S → A•A, $
A → •aA, $
A → •b, $

I3= Go to (I0, a) = Closure ( A → a•A, a/b )

Add all productions starting with A in I3 State because "•" is followed by the non-terminal. So, the I3
State becomes

I3= A → a•A, a/b


A → •aA, a/b
A → •b, a/b

Go to (I3, a) = Closure (A → a•A, a/b) = (same as I3)


Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)

I4= Go to (I0, b) = closure ( A → b•, a/b) = A → b•, a/b


I5= Go to (I2, A) = Closure (S → AA•, $) =S → AA•, $
I6= Go to (I2, a) = Closure (A → a•A, $)

Add all productions starting with A in I6 State because "•" is followed by the non-terminal. So, the I6
State becomes

I6 = A → a•A, $
A → •aA, $
A → •b, $
Go to (I6, a) = Closure (A → a•A, $) = (same as I6)
Go to (I6, b) = Closure (A → b•, $) = (same as I7)

I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $


I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•, a/b
I9= Go to (I6, A) = Closure (A → aA•, $) A → aA•, $

If we analyze then LR (0) items of I3 and I6 are same but they differ only in their lookahead.

I3 = { A → a•A, a/b
A → •aA, a/b
A → •b, a/b
}

I6= { A → a•A, $
A → •aA, $
A → •b, $
}

Clearly I3 and I6 are same in their LR (0) items but differ in their lookahead, so we can combine them
and called as I36.

I36 = { A → a•A, a/b/$


A → •aA, a/b/$
A → •b, a/b/$
}

The I4 and I7 are same but they differ only in their look ahead, so we can combine them and called
as I47.

I47 = {A → b•, a/b/$}

The I8 and I9 are same but they differ only in their look ahead, so we can combine them and called
as I89.

I89 = {A → aA•, a/b/$}

Drawing DFA:
LALR (1) Parsing table:

Automatic Parser Generator

YACC is an automatic tool that generates the parser program.

As we have discussed YACC in the first unit of this tutorial so you can go through the concepts again
to make things more clear.

YACC

 YACC stands for Yet Another Compiler Compiler.

 YACC provides a tool to produce a parser for a given grammar.

 YACC is a program designed to compile a LALR (1) grammar.

 It is used to produce the source code of the syntactic analyzer of the language produced by
LALR (1) grammar.

 The input of YACC is the rule or grammar and the output is a C program.

These are some points about YACC:

Input: A CFG- file.y

Output: A parser [Link].c (yacc)

o The output file "[Link]" contains the parsing tables.

o The file "[Link].h" contains declarations.

o The parser called the yyparse ().

o Parser expects to use a function called yylex () to get tokens.

The basic operational sequence is as follows:


This file contains the desired grammar in YACC format.

It shows the YACC program.

It is the c source program created by YACC.

C Compiler

Executable file that will parse grammar given in gram.Y

You might also like