ATCD-Unit 4 Compiler Design
ATCD-Unit 4 Compiler Design
Recognition of Tokens – A language for specifying lexical analyzers – design of a Lexical analyzer.
Introduction to Compiler:
A compiler is a translator that converts the high-level language into the machine
language.
The main purpose of compiler is to change the code written in one language without
changing the meaning of the program.
When you execute a program which is written in HLL programming language then it
executes into two parts.
In the first part, the source program compiled and translated into the object program
(low level language).
In the second part, object program translated into the target program through the
assembler.
What is a Compiler?
A compiler is a translator that produces an output of low-level language (like an assembly or machine
language) by taking an input of high-level language. It is basically a computer program used to
transform codes written in a programming language into machine code (human-readable code to a
binary 0 and 1 bits language for a computer processor to understand). The computer then processes
the machine code for performing the corresponding tasks.
Compilers check all types of errors, limits, and ranges. Thus, it’s more intelligent.
The run time of its program is longer, and it occupies more memory.
What Is an Interpreter?
It is a program that functions for the translation of a programming language into a comprehensible
one. It is a computer program used for converting high-level program statements into machine
codes. It includes pre-compiled code, source code, and scripts.
They create an exe of the programming language before the program runs.
Compiler Passes
Pass is a complete traversal of the source program. Compiler has two passes to traverse the source
program.
Multi-pass Compiler
Multi pass compiler is used to process the source code of a program several times.
In the first pass, compiler can read the source program, scan it, extract the tokens and
store the result in an output file.
In the second pass, compiler can read the output file produced by first pass, build the
syntactic tree and perform the syntactical analysis. The output of this phase is a file that
contains the syntactical tree.
In the third pass, compiler can read the output file produced by second pass and check
that the tree follows the rules of language or not. The output of semantic analysis phase
is the annotated tree syntax.
One-pass Compiler
One-pass compiler is used to traverse the program only once. The one-pass compiler passes
only once through the parts of each compilation unit. It translates each part into its final
machine code.
In the one pass compiler, when the line source is processed, it is scanned and the token is
extracted.
Then the syntax of each line is analyzed and the tree structure is build. After the semantic
part, the code is generated.
The same process is repeated for each line of code until the entire program is compiled.
Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.
1. Source Language
2. Target Language
3. Implementation Language
1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and that
compiler runs on machine A.
2. Create a compiler LCSA for language L written in a subset of L.
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L, which runs on
machine A and produces code for machine A.
Lexical Analysis:
Lexical analyzer phase is the first phase of compilation process. It takes source code as input. It reads
the source program one character at a time and converts it into meaningful lexemes. Lexical analyzer
represents these lexemes in the form of tokens.
Syntax Analysis
Syntax analysis is the second phase of compilation process. It takes tokens as input and generates a
parse tree as output. In syntax analysis phase, the parser checks that the expression made by the
tokens is syntactically correct or not.
Semantic Analysis:Semantic analysis is the third phase of compilation process. It checks whether the
parse tree follows the rules of language. Semantic analyzer keeps track of identifiers, their types and
expressions. The output of semantic analysis phase is the annotated tree syntax.
Code Optimization
Code optimization is an optional phase. It is used to improve the intermediate code so that the
output of the program could run faster and take less space. It removes the unnecessary lines of the
code and arranges the sequence of statements in order to speed up the program execution.
Code Generation
Code generation is the final stage of the compilation process. It takes the optimized intermediate
code as input and maps it to the target machine language. Code generator translates the
intermediate code into the machine code of the specified computer.
Example:
If the lexical analyzer is located as a separate pass in the compiler it can need an intermediate file to
locate its output, from which the parser would then takes its input. It can eliminate the need for the
intermediate file, the lexical analyzer and the syntactic analyser (parser) are often grouped into the
same pass where the lexical analyser operates either under the control of the parser or as a
subroutine with the parser.
The lexical analyzer also interacts with the symbol table while passing tokens to the parser.
Whenever a token is discovered, the lexical analyzer returns a representation for that token to the
parser. If the token is a simple construct including parenthesis, comma, or a colon, then it returns an
integer program. If the token is a more complex items including an identifier or another token with a
value, the value is also passed to the parser.
Lexical analyzer separates the characters of the source language into groups that logically belong
together, called tokens. It includes the token name which is an abstract symbol that define a type of
lexical unit and an optional attribute value called token values. Tokens can be identifiers, keywords,
constants, operators, and punctuation symbols including commas and parenthesis. A rule that
represent a group of input strings for which the equal token is make as output is called the pattern.
Regular expression plays an essential role in specifying patterns. If a keyword is treated as a token,
the pattern is only a sequence of characters. For identifiers and various tokens, patterns form a
difficult structure.
The lexical analyzer also handles issues including stripping out the comments and whitespace (tab,
newline, blank, and other characters that are used to separate tokens in the input). The correlating
error messages that are generated by the compiler during lexical analyzer with the source program.
For example, it can maintain track of all newline characters so that it can relate an ambiguous
statement line number with each error message. It can be implementing the expansion of macros, in
the case of macro, pre-processors are used in the source program.
Lexical Analysis scans input string from left to right one character at a time to identify tokens. It uses
two pointers to scan tokens −
Look Ahead Pointer (lptr) − It moves ahead to search for the end of the token.
Both pointers start at the beginning of the string, which is stored in the buffer.
Look Ahead Pointer scans buffer until the token is found.
The character ("blank space") beyond the token ("int") have to be examined before the token
("int") will be determined.
After processing token ("int") both pointers will set to the next token ('a'), & this process will
be repeated for the whole program.
A buffer can be divided into two halves. If the look Ahead pointer moves towards halfway in First
Half, the second half is filled with new characters to be read. If the look Ahead pointer moves
towards the right end of the buffer of the second half, the first half will be filled with new characters,
and it goes on.
Line Input Buffer: This input buffer reads input characters one line at a time and returns a token
when a complete line has been read. This type of input buffer is commonly used in scripting
languages where each line of code is treated as a separate command.
Block Input Buffer: This input buffer reads input characters in blocks or chunks of fixed size,
typically specified by the programmer. Input buffering in compiler design This type of input buffer is
used in languages that support block structures, such as C or Pascal.
Reduced I/O operations: Reading a large block of input data into the buffer can significantly reduce
the number of I/O operations required to process the input. This can result in faster execution times
and better performance.
Improved memory usage: Input buffering allows the compiler to use a fixed-size buffer to store the
input data, Input buffering in compiler design, which can help prevent memory allocation errors and
improve overall memory usage.
Simplified code: Input buffering can simplify the code required to read and process input data, which
can make the lexer easier to implement and maintain.
Better error handling: Input buffering can help the lexer identify and report errors more efficiently,
Input buffering in compiler design, as it can process the input data in larger chunks and identify
errors earlier in the process.
Overall, input buffering is a useful technique for improving the efficiency and effectiveness of the
lexical analysis phase in compiler design.
Sentinels − Sentinels are used to making a check, each time when the forward pointer is converted, a
check is completed to provide that one half of the buffer has not converted off. If it is completed,
then the other half should be reloaded.
Buffer Pairs − A specialized buffering technique can decrease the amount of overhead, which is
needed to process an input character in transferring characters. It includes two buffers, each includes
N-character size which is reloaded alternatively.
There are two pointers such as the lexeme Begin and forward are supported. Lexeme Begin points to
the starting of the current lexeme which is discovered. Forward scans ahead before a match for a
pattern are discovered. Before a lexeme is initiated, lexeme begin is set to the character directly after
the lexeme which is only constructed, and forward is set to the character at its right end.
Preliminary Scanning − Certain processes are best performed as characters are moved from the
source file to the buffer. For example, it can delete comments. Languages like FORTRAN which
ignores blank can delete them from the character stream. It can also collapse strings of several
blanks into one blank. Pre-processing the character stream being subjected to lexical analysis saves
the trouble of moving the look ahead pointer back and forth over a string of blanks.
Specification of Tokens
Specification of tokens depends on the pattern of the lexeme. Here we will be using regular
expressions to specify the different types of patterns that can actually form tokens.
Although the regular expressions are inefficient in specifying all the patterns forming tokens. Yet it
reveals almost all types of pattern that forms a token.
1. String
2. Language
3. Regular Expression
Recognition of Tokens
Tokens obtained during lexical analysis are recognized by Finite Automata.
Finite Automata (FA) is a simple idealized machine that can be used to recognize patterns
within input taken from a character set or alphabet (denoted as C). The primary task of an FA
is to accept or reject an input based on whether the defined pattern occurs within the input.
There are two notations for representing Finite Automata. They are:
1. Transition Table
2. Transition Diagram
1. Transition Table
It is a tabular representation that lists all possible transitions for each state and input symbol
combination.
EXAMPLE
Assume the following grammar fragment to generate a specific language:
where the terminals if, then, else, relop, id and num generates sets of strings given by following
regular definitions.
Specification of Tokens
Specification of tokens depends on the pattern of the lexeme. Here we will be using regular
expressions to specify the different types of patterns that can actually form tokens.
Although the regular expressions are inefficient in specifying all the patterns forming tokens. Yet it
reveals almost all types of pattern that forms a token.
1. String
2. Language
3. Regular Expression
1. String
A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
In language theory, the term "word" is often used as synonyms for "string".
The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For
example, "banana" is a string of length six.
Operations on String:
1. Prefix of String
The prefix of the string is the preceding symbols present in the string and the string(s) itself.
2. Suffix of String
Suffix of the string is the ending symbols of the string and the string(s) itself.
The proper prefix of the string includes all the prefixes of the string excluding ∈ and the string(s)
itself.
The proper suffix of the string includes all the suffixes excluding ∈ and the string(s) itself.
4. Proper Suffix of String
5. Substring of String
The substring of a string s is obtained by deleting any prefix or suffix from the string.
The proper substring of a string s includes all the substrings of s excluding ∈ and the string(s) itself.
6. Proper Substring of String
7. Subsequence of String
The subsequence of the string is obtained by eliminating zero or more (not necessarily consecutive)
symbols from the string.
8. Concatenation of String
If s and t are two strings, then st denotes concatenation.
s = abc t = def
2. Language
Operation on Language
As we have learnt language is a set of strings that are constructed over some fixed alphabets. Now
the operation that can be performed on languages are:
1. Union
Union is the most common set operation. Consider the two languages L and M. Then the union of
these two languages is denoted by:
L ∪ M = { s | s is in L or s is in M}
That means the string s from the union of two languages can either be from language L or from
language M.
2. Concatenation
Concatenation links the string from one language to the string of another language in a series in all
possible ways. The concatenation of two different languages is denoted by:
3. Kleene Closure
Kleene closure of a language L provides you with a set of strings. This set of strings is obtained by
concatenating L zero or more time. The Kleene closure of the language L is denoted by:
4. Positive Closure
The positive closure on a language L provides a set of strings. This set of strings is obtained by
concatenating ‘L’ one or more times. It is denoted by:
It is similar to the Kleene closure. Except for the term L0, i.e. L+ excludes ∈ until it is in L itself.
So, these are the four operations that can be performed on the languages in the lexical analysis
phase.
3. Regular Expression
A regular expression is a sequence of symbols used to specify lexeme patterns. A regular expression
is helpful in describing the languages that can be built using operators such as union, concatenation,
and closure over the symbols.
A regular expression ‘r’ that denotes a language L(r) is built recursively over the smaller regular
expression using the rules given below.
The following rules define the regular expression over some alphabet Σ and the languages denoted
by these regular expressions.
1. ∈ is a regular expression that denotes a language L(∈). The language L(∈) has a set of
strings {∈} which means that this language has a single empty string.
2. If there is a symbol ‘a’ in Σ then ‘a’ is a regular expression that denotes a language L(a). The
language L(a) = {a} i.e. the language has only one string of length one and the string holds ‘a’
in the first position.
Regular Definitions:
d1 → r1
d2 → r2
………
dn → rn
letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9
Peephole Optimization
Peephole optimization involves analyzing small sections of code, known as peepholes or windows,
and optimizing them by replacing inefficient or unnecessary instructions with smaller, faster
equivalents. This technique works on the principle of code replacement while preserving the
program output. By examining local parts of the code, compilers can make targeted optimizations
that have a significant impact on overall performance.
1. Identifying Patterns: Peephole optimization begins by scanning through the generated machine
code and identifying common patterns or sequences of instructions that can be optimized. These
patterns may involve unnecessary instructions, redundant computations, or opportunities for more
efficient instruction sequences.
2. Defining Peepholes: A peephole is a small window that moves through the generated code,
typically spanning a fixed number of consecutive instructions. The size of the peephole window is
determined based on the specific optimization goals and the complexity of the patterns targeted for
optimization. Common sizes range from 2 to 10 instructions.
3. Analyzing Peephole Contents: As the peephole window slides through the code, the compiler
analyzes the instructions contained within it. This analysis involves examining the sequence of
instructions, their operands, and their effects on program state (e.g., register values, memory
accesses).
4. Detecting Optimization Opportunities: Within each peephole window, the compiler looks for
opportunities to optimize the code. This may involve:
6. Moving the Peephole Window: After optimizing the instructions within a peephole window, the
window slides to the next position in the code, and the process repeats. Peephole optimization
continues until the entire generated code has been scanned and optimized, or until a predefined
limit on optimization iterations is reached.
7. Considering Constraints: Peephole optimization must operate within certain constraints to ensure
that optimizations do not inadvertently change program behavior or violate correctness properties.
These constraints may include preserving the semantics of the original program, adhering to CPU
architecture-specific constraints, and avoiding optimizations that could introduce new performance
bottlenecks or increase code size excessively.
8. Finalization and Code Emission: Once peephole optimization is complete, the compiler finalizes
the optimized machine code and emits it as the output of the compilation process. The optimized
code is then ready for further processing, such as linking and execution on the target platform.
2. Reducing Memory Footprint: Optimized code consumes fewer resources, leading to efficient
memory utilization.
3. Minimizing Code Size: Peephole optimization aims to produce compact code, reducing
storage requirements and facilitating faster program loading.
1. Redundant Load and Store Elimination: Eliminates redundancy in memory operations, optimizing
code like eliminating unnecessary load and store instructions.
Example:
// Initial code
y = x + 5;
i = y;
z = i;
w = z * 3;
// Optimized code
y = x + 5;
i = y;
w = y * 3;
x = 2 * 3;
// Optimized code
x = 6;
y = x * 2;
// Optimized code
y = x + x; // or y = x << 1;
// Initial code
y = x / 2;
// Optimized code
y = x >> 1;
Overall, peephole optimization plays a crucial role in improving the efficiency and performance of
compiled code by identifying and eliminating small-scale inefficiencies and redundancies in
generated machine code. While peephole optimization alone may not yield significant performance
gains, it complements other optimization techniques and contributes to overall code quality and
execution speed.
To understand operator precedence parsing better, let's consider the expression: 3 + 4 * 2. The
operators here are + and *, and they have different precedence levels. In this case, the * operator
has higher precedence than the + operator. So, the parsing process will give preference to the *
operator and construct the parse tree accordingly.
In operator precedence parsing, the parser examines the input expression one token at a time,
moving from left to right. It uses a stack data structure to keep track of the tokens and their
precedence levels as it processes them.
The two primary actions performed by the parser are shifting and reducing:
Shifting: As the parser reads each token from the input expression, it "shifts" or places it
onto the top of the stack. The stack holds a sequence of tokens encountered so far, forming a
partial expression.
Reducing: After shifting a token onto the stack, the parser checks if the token's operator
precedence is higher than that of the operator currently on top of the stack. If it is, the
parser "reduces" or combines the operands using the higher precedence operator. It then
replaces the operands and the operator on top of the stack with a new non-terminal symbol,
representing the simplified sub-expression.
The parser repeats these steps until the entire expression is parsed and reduced to a single non-
terminal symbol, which represents the fully simplified expression.
Stack: [3]
Stack: [3, +]
Stack: [3, +, 4]
4. Read '*': Since '*' has higher precedence than '+', reduce the stack. Apply the '*' operator to the
top two elements of the stack, i.e., 4 * 2. Replace the top three elements on the stack with the result
of the reduction.
Stack: [3, +, 8]
5. The expression is fully parsed, and the final result(8) is on the top of the stack.
2. Operator Precedence Table
The core of operator precedence parsing lies in the operator precedence table. This table contains all
the operators in the language along with their precedence levels. The parser refers to this table to
determine the order in which operators should be applied during reduction. Operators with higher
precedence appear higher in the table. In case of equal precedence, associativity rules are used to
decide the order of operations.
Consider the operator precedence table for a simple arithmetic language:
Operator Precedence
* High
/ High
+ Low
- Low
For example, with the expression 2 + 3 * 4, the parser will first shift 2 onto the stack, then +. When it
encounters 3, it compares the precedence of '*' (higher) with '+' (lower) in the table. Since '*' has
higher precedence, it reduces the stack by applying the '*' operation first before proceeding with '+',
resulting in the correct parse tree.
Expression: 5 * 3 + 6 / 2
Result: The parse tree represents the expression (5 * 3) + (6 / 2), and the final result is 18.
Advantages of Operator Precedence Parsing
Efficiency: It requires only one pass through the input expression, making it more efficient
for simple expressions.
Deterministic Parsing: Operator precedence parsing eliminates the need for backtracking,
resulting in deterministic parsing.
Limited Grammar Support: Operator precedence parsing may not handle all types of
grammars, especially those with left-recursive or ambiguous rules.
Lack of Context Sensitivity: The parser only considers the precedence of operators and may
miss certain context-specific language constructs.
Conclusion
where letter and digits are defined as - (letter → [A-Z a-z] & digit → [0-9])
For this language, the lexical analyzer will recognize the keywords if, then, and else, as well
as lexemes that match the patterns for relop, id, and number.
To simplify matters, we make the common assumption that keywords are also reserved
words: that is they cannot be used as identifiers.
The num represents the unsigned integer and real numbers of Pascal.
Our lexical analyzer will strip out white space. It will do so by comparing a string against the
regular definition ws, below.
If a match for ws is found, the lexical analyzer does not return a token to the parser.
It is a directed labeled graph consisting of nodes and edges. Nodes represent states, while edges
represent state transitions.
1. One state is labelled the Start State. It is the initial state of transition diagram where control
resides when we begin to recognize a token.
2. Position is a transition diagram are drawn as circles and are called states.
3. The states are connected by Arrows called edges. Labels on edges are indicating the input
characters
4. Zero or more final states or Accepting states are represented by double circle in which the
tokens has been found.
5. Example:
Where state "1" is initial state and state 3 is final state.
Here is the transition diagram of Finite Automata that recognizes the lexemes matching the
token relop.
Here is the Finite Automata Transition Diagram for the Identifiers and Keywords.
Here is the Finite Automata Transition Diagram for recognizing white spaces.
Note:
These Finite Automata can be constructed using either the transition diagram or the transition table
representation. Both transition diagrams and transition tables serve the same purpose of defining
and representing the behavior of an FA. They provide different visual and structural representations,
allowing designers to choose the format that best suits their preferences or requirements.
Lex
YACC
Lex is a computer program that generates lexical analyzers. Lex is commonly used with the yacc
parser generator.
• First, a specification of a lexical analyzer is prepared by creating a program lex.l in the Lex
language. Then, lex.l is run through the Lex compiler to produce a C program [Link].c.
• Finally, [Link].c is run through the C compiler to produce an object progra m [Link], which is the
lexical analyzer that transforms an input stream into a sequence of tokens.
Lex Specification
{ definitions }
%%
{ rules }
%%
{ user subroutines }
p1 {action1}
p2 {action2}
where pi is regular expression and actioni describes what action the lexical analyzer should take
User subroutines are auxiliary procedures needed by the actions. These can be compiled
separately and loaded with the lexical analyzer.
Yacc provides a general tool for describing the input to a computer program. The Yacc user specifies
the structures of his input, together with code to be invoked as each such structure is recognized.
Yacc turns such a specification into a subroutine that handles the input process; frequently, it is
convenient and appropriate to have most of the flow of control in the user's application handled by
this subroutine.
Finite Automata
Finite Automata is one of the mathematical models that consist of a number of states and edges. It
is a transition diagram that recognizes a regular expression or grammar.
Dpn {actionn}
ii) there is at most one transition from each state on any input.
The following steps are involved in the construction of DFA from regular expression:
The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and verifies that
the string can be generated by the grammar for the source language.
It reports any syntax errors in the program. It also recovers from commonly occurring errors
so that it can continue processing its input.
Parser verifies the structure generated by the tokens based on the grammar (performs
context free syntax analysis).
Issues:
Variable re-declaration
GRAMMARS
Constructs that begin with keywords like while or int, are relatively easy to parse, because
the keyword guides the choice of the grammar production that must be applied to match the
input. Consider the grammar for expressions, which present more of challenge, because of
the associativity and precedence of operators.
Error Handling:
For example:
It should recover from each error quickly enough to be able to detect subsequent errors.
Error-Recovery Strategies:
Once an error is detected, the parser should recover from the error.
Panic-mode: On discovering an error, the parser discards input symbols one at a time until
one of a designated set of synchronizing tokens is found. The synchronizing tokens are
usually delimiters, such as semicolon or}, whose role in the source program is clear and
unambiguous
Phrase-level: When an error is identified, a parser may perform local correction on the
remaining input;that is, it may replace a prefix of the remaining input by some string that
allows the parser to continue. A typical local correction is to replace a comma by a
semicolon, delete an extraneous semicolon, or insert a missing semicolon. Phrase-level
replacement has been used in several error-repairing compilers, as it can correct any input
string. Its major drawback is the difficulty it has in coping with situations in which the actual
error has occurred before the point of detection
Error-productions: The common errors that might be encountered are anticipated and
augment the grammar for the language at hand with productions that generate the
erroneous constructs.
Parser
Parser is a compiler that is used to break the data into smaller elements coming from lexical analysis
phase.
A parser takes input in the form of sequence of tokens and produces output in the form of parse
tree.
In the top down parsing, the parsing starts from the start symbol and transform it into the
input symbol.
In the bottom up parsing, the parsing starts with the input symbol and construct the parse
tree up to the start symbol by tracing out the rightmost derivations of string in reverse.
Example
Production
E→T
T→T*F
T → id
F→T
F → id
1. Shift-Reduce Parsing
a. LR( 1 )
b. SLR( 1 )
c. CLR ( 1 )
d. LALR( 1 )
Parsing is classified into two categories, i.e. Top-Down Parsing, and Bottom-Up Parsing. Top-Down
Parsing is based on Left Most Derivation whereas Bottom-Up Parsing is dependent on Reverse Right
Most Derivation.
The process of constructing the parse tree which starts from the root and goes down to the leaf is
Top-Down Parsing.
1. Top-Down Parsers constructs from the Grammar which is free from ambiguity and left
recursion.
1. Whenever a Non-terminal spends the first time then go with the first alternative and
compare it with the given I/P String
2. If matching doesn’t occur then go with the second alternative and compare it with the given
I/P String.
3. If matching is not found again then go with the alternative and so on.
4. Moreover, If matching occurs for at least one alternative, then the I/P string is parsed
successfully.
S()
{ Choose any S production, S ->X1X2…..Xk;
for (i = 1 to k)
{
If ( Xi is a non-terminal)
Call procedure Xi();
else if ( Xi equals the current input, increment input)
Else /* error has occurred, backtrack and try another possibility */
}
}
Non-Recursive Predictive Parsing: This type of parsing does not require backtracking. Predictive
parsers can be constructed for LL(1) grammar, the first ‘L’ stands for scanning the input from left to
right, the second ‘L’ stands for leftmost derivation, and ‘1’ for using one input symbol lookahead at
each step to make parsing action decisions.
1. In LL1, the first L stands for Left to Right, and the second L stands for Left-most Derivation. 1
stands for the number of Looks Ahead tokens used by the parser while parsing a sentence.
2. LL(1) parsing is constructed from the grammar which is free from left recursion, common
prefix, and ambiguity.
3. LL(1) parser depends on 1 look ahead symbol to predict the production to expand the parse
tree.
If FIRST(α) contains ε then add A -> α under M[A,c] for all c in FOLLOW(A).
Example:
Example:
S → iEtSS’/a
S’ →eS/ε
E→b
Important Notes
Eg - S -> aS | a
---- both productions go in a
Eg - S -> Sa | b
S -> Sa goes to FIRST(S) = b
S -> b goes to b, thus b has 2 entries hence not LL(1)
Every regular grammar need not be LL(1) because regular grammar may contain left factoring, left
recursion or ambiguity.
Disadvantages of Backtracking
Performance Impact: Backtracking can lead to inefficient parsing, particularly in cases where
there are numerous backtracking points or long ambiguous sequences in the input. In such
scenarios, alternative parsing techniques may be more efficient.
Complexity: Managing backtracking points and tracking alternative choices can introduce
additional complexity to the parsing algorithm, requiring careful implementation and
optimization.
Conclusion
Backtracking is a fundamental technique in top-down parsing that enables the handling of ambiguity
and the exploration of alternative choices. It allows parsers to systematically backtrack to previous
decision points and try different production rules when a chosen path fails. Although backtracking
provides the necessary flexibility to handle ambiguous grammars, its usage should be carefully
considered due to potential performance implications. Understanding and effectively implementing
backtracking in top-down parsing contributes to more robust parsing algorithms and enhances the
accuracy of language processing systems.
EXAMPLE :
Top-Down Parsing with Backtracking, Parser will attempt multiple rules or production to identify the
match for input string by backtracking at every step of derivation. So, if the applied production does
not give the input string as needed, or it does not match with the needed string, then it can undo
that shift.
S→aAd
A→bc|b
Make parse tree for the string a bd. Also, show parse Tree when Backtracking is required when the
wrong alternative is chosen.
Solution
Example2 − Write an Algorithm for Top-Down Parsing with Backtracking for the following Grammar.
S→aAd
A → bc| b
Solution
In the following Algorithm, there is a procedure for each Non-terminal S & A. Terminal symbols a, b,
c, d in Grammar are input symbols or Look ahead symbols to be read by Parser.
advance( ) − It is a function used to move the input pointer to the next input symbol.
Limitations of Top-Down Parsing with Backtracking
Following are some problems, which occur in Top-Down Parsing with Backtracking.
Backtracking − Backtracking looks very simple and easy to implement but choosing a
different alternative causes lot of problems which are given below −
Entries made in the symbol table during parsing have to be removed while
backtracking.
A → Aα|β
Choosing Right Production − The order in which alternatives are tested can change the
language accepted.
Difficult to locate error − When failure occurs, it is difficult to know where the error
occurred.
To implement a recursive descent parser, the grammar must hold the following properties:
If the grammar is not left-factored, the recursive descent parser will have to use backtracking.
Example
E –> T E’
E’ –> + T E’ | e
E –> E + T | T T –> F T’
T –> T * F | F T’ –> * F T’ | e
F –> ( E ) | id F –> ( E ) | id
**Here e is Epsilon
Example:
In the following Grammar, first symbol, i.e., if, while & begin uniquely determine, which of the
alternative to choose.
As Statement starting with if will be a conditional statement & statement starting with while will be
an Iterative Statement.
Example − Write down the algorithm using Recursive procedures to implement the following
Grammar.
E → TE′
E′ → +TE′
T → FT′
T′ →∗ FT′|ε
F → (E)|id
One of major drawback or recursive-descent parsing is that it can be implemented only for those
languages which support recursive procedure calls and it suffers from the problem of left-recursion.
1. Ease of use: Because recursive descent parsing closely mimics the grammar rules of the
language being parsed, it is simple to comprehend and use.
2. Readability: The parsing code is usually set up in a structured and modular way, which makes
it easier to read and maintain.
3. Recursive descent parsers can produce descriptive error messages, which make it simpler to
find and detect syntax mistakes in the input. 3. Error reporting.
4. Predictability: The predictable behavior of recursive descent parsers makes the parsing
process deterministic and clear.
1. Recursive descent parsers encounter difficulties with left-recursive grammar rules since they
can result in unbounded recursion. To effectively handle left recursion, care must be made to
avoid it or employ methods like memoization.
2. Recursive descent parsers rely on backtracking when internal alternatives to a grammar rule
are unsuccessful. This could result in inefficiencies, especially if the grammar contains a lot of
ambiguity or options.
3. Recursive descent parsers frequently adhere to the LL(1) constraint, which requires that they
only use one token of lookahead. The grammar's expressiveness is constrained by this
restriction because it is unable to handle some ambiguous or context-sensitive languages.
Predictive Parser is also another method that implements the technique of Top- Down parsing
without Backtracking. A predictive parser is an effective technique of executing recursive-descent
parsing by managing the stack of activation records, particularly.
Input Buffer − The input buffer includes the string to be parsed followed by an end
Stack − It contains a combination of grammar symbols with $ on the bottom of the stack. At
the start of Parsing, the stack contains the start symbol of Grammar followed by $.
The construction of a predictive parser is aided by two functions associated with a grammar G :
1. FIRST
2. FOLLOW
Rules for first( ):
follow(B).
All the terminals are written column-wise, and all the Non-terminals are written row wise.
Input :
Parsing Program − The parsing program performs some action by comparing the symbol on
top of the stack and the current input symbol to be read on the input buffer.
Actions − Parsing program takes various actions depending upon the symbol on the top of
the stack and the current input symbol. Various Actions taken are given below −
Left Factoring
Example:
E→E+T|T
T→T*F|F
F→(E)|id
E →TE’ E’ → +TE’ | ε
T →FT’
T’ → *FT’ | ε
F → (E)|id
First( ) :
FIRST(F) = { ( , id }
Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, $, ) }
FOLLOW(T’) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }
LL(1) grammar:
The parsing table entries are single entries. So each location has not more than one entry.
LL(1) Parsing: Here the 1st L represents that the scanning of the Input will be done from the Left to
Right manner and the second L shows that in this parsing technique, we are going to use the Left
most Derivation Tree. And finally, the 1 represents the number of look-ahead, which means how
many symbols are you going to see when you want to make a decision.
Essential conditions to check first are as follows:
3. The grammar has to be left factored in so that the grammar is deterministic grammar.
S→iEtS | iEtSeS| a
E→b
S→iEtSS’|a
S’→ eS | ε
E→b
To construct a parsing table, we need FIRST() and FOLLOW() for all the non terminals.
FIRST(S) = { i, a }
FIRST(S’) = {e, ε }
FIRST(E) = { b}
FOLLOW(S) = { $ ,e }
FOLLOW(S’) = { $ ,e }
FOLLOW(E) = {t}
Since there are more than one production, the grammar is not LL(1) grammar. Actions performed in
predictive parsing:
1. Shift
2. Reduce
3. Accept
4. Error
4. Parse the given input string using stack and parsing table.
E --> TE'
E' --> +TE' | ε
T --> FT'
T' --> *FT' | ε
F --> id | (E)
*ε denotes epsilon
Step 1: The grammar satisfies all properties in step 1.
Note: Every grammar is not feasible for LL(1) Parsing table. It may be possible that one cell may
contain more than one production.
o Shift reduce parsing is a process of reducing a string to the start symbol of a grammar.
o Shift reduce parsing uses a stack to hold the grammar and an input tape to hold the string.
Sift reduce parsing performs the two actions: shift and reduce. That's why it is known as shift
reduces parsing.
At the shift action, the current symbol in the input string is pushed to a stack.
At each reduction, the symbols will replaced by the non-terminals. The symbol is the right
side of the production and non-terminal is the left side of the production.
Example:
Input string:a1-(a2+a3)
Parsing table:
Operator precedence grammar is kinds of shift reduce parsing method. It is applied to a small class of
operator grammars.
a ⋗ b means that terminal "a" has the higher precedence than terminal "b".
a ⋖ b means that terminal "a" has the lower precedence than terminal "b".
a ≐ b means that the terminal "a" and "b" both have same precedence.
Precedence table:
Parsing Action
Now scan the input string from left right until the ⋗ is encountered.
Scan towards left over all the equal precedence until the first left most ⋖ is encountered.
Example
Grammar:
1. E → E+T/T
2. T → T*F/F
3. F → id
Given string:
1. w = id + id * id
Now let us process the string with the help of the above precedence table:
There are two main categories of shift reduce parsing as follows:
1. Operator-Precedence Parsing
2. LR-Parser
LR Parser
LR parsing is one type of bottom up parsing. It is used to parse the large class of grammars.
"K" is the number of input symbols of the look ahead used to make number of parsing decision.
LR parsing is divided into four parts: LR (0) parsing, SLR parsing, CLR parsing and LALR parsing.
LR algorithm:
The LR algorithm requires stack, input, output and parsing table. In all type of LR parsing, input,
output and stack are same but parsing table is different.
Input buffer is used to indicate end of input and it contains the string to be parsed followed by a $
Symbol.
A stack is used to contain a sequence of grammar symbols with a $ at the bottom of the stack.
Parsing table is a two dimensional array. It contains two parts: Action part and Go To part.
LR (1) Parsing
Augment Grammar
Augmented grammar G` will be generated if we add one more production in the given grammar G. It
helps the parser to identify when to stop the parsing and announce the acceptance of the input.
Example
Given grammar
S → AA
A → aA | b
S`→ S
S → AA
A → aA | b
An LR (0) item is a production G with dot at some position on the right side of the production.
LR(0) items is useful to indicate that how much of the input has been scanned up to a given point in
the process of parsing.
Example
Given grammar:
S → AA
A → aA | b
Add Augment Production and insert '•' symbol at the first position for every production in G
S` → •S
S → •AA
A → •aA
A → •b
I0 State:
Add all productions starting with S in to I0 State because "•" is followed by the non-terminal. So, the
I0 State becomes
I0 = S` → •S
S → •AA
Add all productions starting with "A" in modified I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes.
I0= S` → •S
S → •AA
A → •aA
A → •b
I1= S` → S•
Add all productions starting with A in to I2 State because "•" is followed by the non-terminal. So, the
I2 State becomes
I2 =S→A•A
A → •aA
A → •b
A → a•A
A → •aA
A → •b
Drawing DFA:
If a state is going to some other state on a terminal then it correspond to a shift move.
If a state contain the final item in the particular row then write the reduce node completely.
Explanation:
I0 on S is going to I1 so write it as 1.
I0 on A is going to I2 so write it as 2.
I2 on A is going to I5 so write it as 5.
I3 on A is going to I6 so write it as 6.
I0, I2and I3on a are going to I3 so write it as S3 which means that shift 3.
1. S → AA ... (1)
2. A → aA ... (2)
3. A → b ... (3)
I1 contains the final item which drives(S` → S•), so action {I1, $} = Accept.
I4 contains the final item which drives A → b• and that production corresponds to the
production number 3 so write it as r3 in the entire row.
I5 contains the final item which drives S → AA• and that production corresponds to the
production number 1 so write it as r1 in the entire row.
I6 contains the final item which drives A → aA• and that production corresponds to the
production number 2 so write it as r2 in the entire row.
SLR (1) refers to simple LR Parsing. It is same as LR(0) parsing. The only difference is in the parsing
[Link] construct SLR (1) parsing table, we use canonical collection of LR (0) item.
In the SLR (1) parsing, we place the reduce move only in the follow of left hand side.
The steps which use to construct SLR (1) Table is given below:
If a state (Ii) is going to some other state (Ij) on a terminal then it corresponds to a shift move in the
action part.
If a state (Ii) is going to some other state (Ij) on a variable then it correspond to go to move in the Go
to part.
If a state (Ii) contains the final item like A → ab• which has no transitions to the next state then the
production is known as reduce production. For all terminals X in FOLLOW (A), write the reduce entry
along with their production numbers.
Example
1. S -> •Aa
2. A->αβ•
1. Follow(S) = {$}
S→E
E→E+T|T
T→T*F|F
F → id
Add Augment Production and insert '•' symbol at the first position for every production in G
S` → •E
E → •E + T
E → •T
T → •T * F
T → •F
F → •id
I0 State:
Add all productions starting with E in to I0 State because "." is followed by the non-terminal. So, the
I0 State becomes
I0 = S` → •E
E → •E + T
E → •T
Add all productions starting with T and F in modified I0 State because "." is followed by the non-
terminal. So, the I0 State becomes.
I0= S` → •E
E → •E + T
E → •T
T → •T * F
T → •F
F → •id
Add all productions starting with T and F in I5 State because "." is followed by the non-terminal. So,
the I5 State becomes
I5 = E → E +•T
T → •T * F
T → •F
F → •id
Go to (I5, F) = Closure (T → F•) = (same as I3)
Go to (I5, id) = Closure (F → id•) = (same as I4)
Add all productions starting with F in I6 State because "." is followed by the non-terminal. So, the I6
State becomes
I6 = T → T * •F
F → •id
Drawing DFA:
Explanation:
I1 contains the final item which drives S → E• and follow (S) = {$}, so action {I1, $} = Accept
I2 contains the final item which drives E → T• and follow (E) = {+, $}, so action {I2, +} = R2,
action {I2, $} = R2
I3 contains the final item which drives T → F• and follow (T) = {+, *, $}, so action {I3, +} = R4,
action {I3, *} = R4, action {I3, $} = R4
I4 contains the final item which drives F → id• and follow (F) = {+, *, $}, so action {I4, +} = R5,
action {I4, *} = R5, action {I4, $} = R5
I7 contains the final item which drives E → E + T• and follow (E) = {+, $}, so action {I7, +} = R1,
action {I7, $} = R1
I8 contains the final item which drives T → T * F• and follow (T) = {+, *, $}, so action {I8, +} =
R3, action {I8, *} = R3, action {I8, $} = R3.
CLR refers to canonical lookahead. CLR parsing use the canonical collection of LR (1) items to build
the CLR (1) parsing table. CLR (1) parsing table produces the more number of states as compare to
the SLR (1) parsing.
In the CLR (1), we place the reduce node only in the lookahead symbols.
The look ahead is used to determine that where we place the final item.
The look ahead always add $ symbol for the argument production.
Example
CLR ( 1 ) Grammar
S → AA
A → aA
A→b
Add Augment Production, insert '•' symbol at the first position for every production in G and also
add the lookahead.
S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I0 State:
Add all productions starting with S in to I0 State because "." is followed by the non-terminal. So, the
I0 State becomes
I0 = S` → •S, $
S → •AA, $
Add all productions starting with A in modified I0 State because "." is followed by the non-terminal.
So, the I0 State becomes.
I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
Add all productions starting with A in I2 State because "." is followed by the non-terminal. So, the I2
State becomes
I2= S → A•A, $
A → •aA, $
A → •b, $
Add all productions starting with A in I3 State because "." is followed by the non-terminal. So, the I3
State becomes
Add all productions starting with A in I6 State because "." is followed by the non-terminal. So, the I6
State becomes
I6 = A → a•A, $
A → •aA, $
A → •b, $
Drawing DFA:
1. S → AA ... (1)
2. A → aA ....(2)
3. A → b ... (3)
The placement of shift node in CLR (1) parsing table is same as the SLR (1) parsing table. Only
difference in the placement of reduce node.
I4 contains the final item which drives ( A → b•, a/b), so action {I4, a} = R3, action {I4, b} = R3.
I5 contains the final item which drives ( S → AA•, $), so action {I5, $} = R1.
I7 contains the final item which drives ( A → b•,$), so action {I7, $} = R3.
I8 contains the final item which drives ( A → aA•, a/b), so action {I8, a} = R2, action {I8, b} = R2.
I9 contains the final item which drives ( A → aA•, $), so action {I9, $} = R2.
LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the canonical
collection of LR (1) items.
In the LALR (1) parsing, the LR (1) items which have same productions but different look ahead are
combined to form a single set of items
LALR (1) parsing is same as the CLR (1) parsing, only difference in the parsing table.
Example
LALR ( 1 ) Grammar
S → AA
A → aA
A→b
Add Augment Production, insert '•' symbol at the first position for every production in G and also
add the look ahead.
S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I0 State:
Add all productions starting with S in to I0 State because "•" is followed by the non-terminal. So, the
I0 State becomes
I0 = S` → •S, $
S → •AA, $
Add all productions starting with A in modified I0 State because "•" is followed by the non-terminal.
So, the I0 State becomes.
I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
Add all productions starting with A in I2 State because "•" is followed by the non-terminal. So, the I2
State becomes
I2= S → A•A, $
A → •aA, $
A → •b, $
Add all productions starting with A in I3 State because "•" is followed by the non-terminal. So, the I3
State becomes
Add all productions starting with A in I6 State because "•" is followed by the non-terminal. So, the I6
State becomes
I6 = A → a•A, $
A → •aA, $
A → •b, $
Go to (I6, a) = Closure (A → a•A, $) = (same as I6)
Go to (I6, b) = Closure (A → b•, $) = (same as I7)
If we analyze then LR (0) items of I3 and I6 are same but they differ only in their lookahead.
I3 = { A → a•A, a/b
A → •aA, a/b
A → •b, a/b
}
I6= { A → a•A, $
A → •aA, $
A → •b, $
}
Clearly I3 and I6 are same in their LR (0) items but differ in their lookahead, so we can combine them
and called as I36.
The I4 and I7 are same but they differ only in their look ahead, so we can combine them and called
as I47.
The I8 and I9 are same but they differ only in their look ahead, so we can combine them and called
as I89.
Drawing DFA:
LALR (1) Parsing table:
As we have discussed YACC in the first unit of this tutorial so you can go through the concepts again
to make things more clear.
YACC
It is used to produce the source code of the syntactic analyzer of the language produced by
LALR (1) grammar.
The input of YACC is the rule or grammar and the output is a C program.
C Compiler