0% found this document useful (0 votes)
18 views241 pages

System Programming & Compiler Design Syllabus

for system programming and compiler design subject btech students

Uploaded by

kprabhpreet25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views241 pages

System Programming & Compiler Design Syllabus

for system programming and compiler design subject btech students

Uploaded by

kprabhpreet25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SYSTEM PROGRAMMING AND COMPILER DESIGN

Code: CSPC-402
By
Dr K P Sharma

Department of Computer Science and Engineering


Dr. B.R. Ambedkar National Institute of Technology
1
Jalandhar Punjab-144011
Course Syllabus
▪ DEPARTMENT: COMPUTER SCIENCE AND ENGINEERING
▪ COURSE CODE: CSPC-402
▪ COURSE TITLE: SYSTEM PROGRAMMING AND COMPILER DESIGN
▪ COURSE DESIGNATION: REQUIRED
▪ PRE-REQUISITES: NONE
▪ CONTACT HOURS/CREDIT SCHEME: (L-T-P-C: 3-0-0-3)
▪ COURSE ASSESSMENT METHODS: Two sessional exams and one end-semester exam, along with
assignments, presentations and class tests which may be conducted by the course coordinator in lieu of internal
assessment.
▪ COURSE OUTCOMES
▪ After the course completion, the student will be able to
▪ Understand the design aspects of various different language processors
▪ Identify theory and practice of compilation in the lexical analysis, syntax, and semantic analysis, code generation
and optimization phases of compilation.
▪ Exemplify and compare various function of parser along with its types for design of compiler.
▪ Understand a parser such as a bottom-up SLR parser without using compiler-generation tools.

Course
Program outcomes
Outcomes
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CSPC-402
CO 1 M M H
CO 2 H H M
CO 3 M L
CO 4 M L L 2
TOPICS TO BE COVERED
▪ Introduction to Language Processor: Overview, fundamentals of language
processing and symbol tables
▪ Assemblers and Macros: Pass structure of assemblers, design of two pass
assemblers using any hypothetical computer, design of macro processors
▪ Linker and Loaders: Relocation and linking operation, design of linker,
absolute and relocating loaders
▪ Compiler structure: analysis-synthesis model of compilation, various phases of
a compiler, tool based approach to compiler construction.
▪ Lexical analysis: interface with input, parser and symbol table, token, lexeme
and patterns. Difficulties in lexical analysis, Error reporting. Implementation.
Regular definition, Transition diagrams, LEX.
▪ Syntax analysis: CFGs, ambiguity, associativity, precedence, top down parsing,
recursive descent parsing, transformation on the grammars, predictive parsing,
bottom up parsing, operator precedence grammars, LR parsers (SLR, LALR,
LR), YACC.
▪ Syntax directed definitions: inherited and synthesized attributes, dependency
graph, evaluation order, bottom up and top down evaluation of attributes, L- and
S-attributed definitions. 3
TOPICS TO BE COVERED
▪ Type checking: type system, type expressions, structural and name equivalence of types,
type conversion, overloaded functions and operators, polymorphic functions.
▪ Run time system: storage organization, activation tree, activation record, parameter
passing, symbol table, dynamic storage allocation.
▪ Intermediate code generation: intermediate representations, translation of declarations,
assignments, control flow, boolean expressions and procedure calls. Implementation
issues.
▪ Code generation and instruction selection: issues, basic blocks and flow graphs,
register allocation, code generation, dag representation of programs, code generation from
dags, peep hole optimization, code generator generators, specifications of machine.
▪ TEXT BOOKS, AND/OR REFERENCE MATERIAL:
▪ D M Dhamdhere, System Programming, TMH, 1st ed., 2012
▪ V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques and Tools ,
Addison-Wesley, 1988.
▪ Fischer and R. LeBlanc. Crafting a Compiler , Benjamin Cummings, 1991..
▪ C. Holub. Compiler Design in C , Prentice-Hall Inc., 1993.
▪ Appel. Modern Compiler Implementation in C: Basic Design , Cambridge Press.
▪ Fraser and Hanson. A Retargetable C Compiler: Design and Implementation ,
4
Addison-Wesley.
Basic questions before start the Compiler
▪ What are different level of programming languages ?
▪ What is the role of translator ?
▪ What are different types of translator ?
▪ Why we need to study Compiler ?
▪ In which language, the code of first compiler was written ?
▪ Which language use compiler, which use interpreter, and which
used both ?
▪ How to design Compiler ?
▪ Difference between compiler and interpreter

5
6
Machine Language
▪ The machine-level language is a language that consists of a set of
instructions that are in the binary form 0 or 1.
▪ As we know that computers can understand only machine instructions,
which are in binary digits, i.e., 0 and 1.
▪ Creating a program in a machine-level language is a very difficult task as
it is not easy for the programmers to write the program in machine
instructions.
▪ It is error-prone as it is not easy to understand, and its maintenance is
also very high.
▪ A machine-level language is not portable, so if we write a program in one
computer will no longer be valid in another computer.
▪ The different processor architectures use different machine codes, for
example, a PowerPC processor contains RISC architecture, which
requires different code than intel x86 processor, which has a CISC
architecture.
7
Assembly Languages
▪ The assembly language contains some human-readable commands
such as mov, add, sub, etc.
▪ The problems which we were facing in machine-level language are
reduced to some extent by using an extended form of
machine-level language known as assembly language.
▪ Computers can only understand the machine-level instructions, so
we require a translator that converts the assembly code into
machine code.
▪ The translator used for translating the code is known as an
assembler.
▪ The assembly code is not faster than machine code because the
assembly language comes above the machine language in the
hierarchy.
8
High Level Languages
▪ The high-level language is a programming language that allows a
programmer to write the programs which are independent of a
particular type of computer.
▪ The high-level languages are considered as high-level because
they are closer to human languages than machine-level languages.
▪ A high-level language gets away from all the constraints of a
particular machine.
▪ A compiler is required to translate a high-level language into a
low-level language.

9
FORTRAN

▪ FORTRAN was first HLL written by John Backus and IBM in 1957.
▪ The original Fortran compiler was written entirely in assembly language.
▪ The first compiler had a huge impact on the programming languages and
computer science.
▪ The whole new field of compiler design was started
▪ More than half the programmers were using Fortran by 1958
▪ Led to enormous amount of theoretical work (lexical analysis, parsing,
optimization, structured programming, code generation, error recovery
etc.)

10
Translators
• Compilers
• Interpreters
• Assemblers

Regardless of what language you use, you eventually need to


convert your program (written in high or low level) into a
language that the computer can understand.

Two ways for doing that: compile the program or interpret the
program
Compilers:
• A compiler is a computer program that translates a program in a
source language into an equivalent program in a target language.
or
• Compilers: Translate a source (human-writable) program to an
executable (machine-readable) program.

• Translate the entire program.

• Convert the entire program to machine code, when the syntax


errors are removed then converted into the object code
Compilers Continue..
• Requires more main memory

• Slow for debugging and testing.

• Execution time is less.


Object code and machine code

• Object code is a portion of machine code that has not yet been
linked into a complete program.

• It is the machine code for one particular library or module that


will make up the completed product.

• Machine code is binary (1's and 0's) code that can be executed
directly by the CPU.
Interpreters:
• An interpreter is a computer program that translates and
executes instructions written in a computer programming
language line-by-line, unit by unit etc.,

• Interpreters: Convert a source program and execute it at the


same time.

• Translate the program line by line.

• Each time the program is executed ,every line is checked for


syntax error & then converted to equivalent machine code
directly
Interpreters Continue..
• An interpreter needs to be able to analyze, or parse, instructions
written in the source language.

•Requires less main memory.

•Source program and the interpreter are required for execution.

•Good for fast debugging and testing.

• Execution time is more.


Assemblers
• Programmers found difficult to read and write the program in
machine language.
•For a more convenient language they began to use a mnemonic for
each machine instruction, which they would translate into machine
language.
•Such a mnemonic machine language is now called an assembly
language.
• The input to an assembler program is called the source program, the
output is a machine language translation (object program).
Linker
• A program that takes as input the object files of one or more
separately compiled program modules, and links them together into
a complete executable program, resolving reference from one
module to another.
Loader
• A loader is the part of an operating system that is responsible for
loading programs into memory, preparing them for execution and then
executing them.

• The loader is usually a part of the operating system's kernel and


usually is loaded at system boot time and stays in memory until the
system is rebooted, shut down, or powered off.

•When the program finished, control must somehow be returned to the


operating system.
Loader continue..
• The assembler could place the object program directly in memory and
transfer control to it, thereby causing the machine language program to
be executed.
•However this would waste memory by leaving the assembler in
memory while the user’s program was being executed.
•To overcome the problem of wasted memory, system programmers
developed another component, called the Loader.
•Loader program is much smaller then the assembler, this makes more
memory available to the user’s program.
21
Why study compilers?
1. Compilers use the whole spectrum of language processing
technology
2. How are programming languages implemented?
Two major strategies: –
Interpreters (old and much less studied)
Compilers (very well understood with mathematical foundations)
3. Some environments provide both interpreter and compiler
Interpreter for development
Compiler for deployment
Java – Java compiler: Java to interpretable bytecode
Java JIT: bytecode to executable image

23
▪ Compiler is part of program development environment
▪ The other typical components of this environment are editor,
assembler, linker, loader, debugger, profiler etc.
▪ The compiler (and all other tools) must support each other for easy
program development

24
Difference between Compiler and Interpreters

25
COMPILERS
▪ A compiler is a system software takes a program written in a
source language and translates it into an equivalent program in a
target language.

source program COMPILER target program


( Normally the equivalent program in
( Normally a program written in machine code – relocatable object file)
a high-level programming language)

error messages
▪ As an important part of translation process, the compiler report to
its user the presence of error in the source program.

26
Abstract view

Source Machine
code Compiler code

errors
▪ Recognizes legal (and illegal) programs
▪ Generate correct code
▪ Manage storage of all variables and code
▪ Agreement on format for object code
Types of Compiler
Compilers are sometimes classified as
1. Single-pass
2. Two Pass Compilers or Multi-pass
▪ debugging or optimizing, depending on how they have been
constructed or what function they are supposed to perform.
▪ If we combine or group all the phases of compiler design in
a single module known as single pass compiler.
▪ In single pass Compiler source code directly transforms into
machine code. For example, Pascal language.
▪ Single pass compiler is faster and smaller than the multi pass
compiler

28
Continue..
▪ As a disadvantage of single pass compiler is that it is less
efficient in comparison with two pass or multi-pass compiler.

▪ Two pass Compiler is divided into two sections,


▪ Front end: It maps legal code into Intermediate Representation
(IR).
▪ Back end: It maps IR onto the target machine

29
Continue..
▪ In first pass, included phases are as Lexical analyzer, syntax analyzer,
semantic analyzer, intermediate code generator are work as front end.

▪ Analytic part means all phases analyze the High level language and
convert them three address code.
▪ First pass is platform independent because the output of first pass is
as three address code which is useful for every system and
requirement is to change the code optimization and code generator
phase which are comes to the second pass.

▪ In second Pass the included phases are as Code optimization and


Code generator are work as back end.
▪ Second pass is platform dependent because final stage of a typical
compiler converts the intermediate representation of program into an
executable set of instructions which is dependent on the system.

30
▪ A multi-pass Compiler is a type of compiler that processes
the source code or abstract syntax tree of a program multiple
times.
▪ It divided a large program into multiple small programs and
process them.
▪ It develops multiple intermediate codes.
▪ All of these multi-pass take the output of the previous phase as an
input. So it requires less memory. It is also known as 'Wide
Compiler'.

31
Tasks of Compiler
▪ Breaks up the source program into pieces and impose grammatical
structure on them
▪ Allows you to construct the desired target program from the
intermediate representation and also create the symbol table
▪ Compiles source code and detects errors in it
▪ Manage storage of all variables and codes.
▪ Support for separate compilation
▪ Read, analyze the entire program, and translate to semantically
equivalent
▪ Translating the source code into object code depending upon the
type of machine

32
Other Applications
▪ In addition to the development of a compiler, the techniques
used in compiler design can be applicable to many problems in
computer science.
• Techniques used in a lexical analyzer can be used in text editors,
information retrieval system, and pattern recognition programs.
• Techniques used in a parser can be used in a query processing system such
as SQL.
• Most of the techniques used in compiler design can be used in Natural
Language Processing (NLP) systems.

33
Practical applications
▪ Greedy algorithms - register allocation
▪ Graph algorithms - dead code elimination,
register allocation
▪ Dynamic programming - instruction selection
▪ Optimization techniques – instruction scheduling
▪ Finite automata – lexical analysis
▪ Pushdown automata – parsing
▪ Complex data structure – symbol table, data dependence graph
▪ Computer architecture – machine code generator

34
Analysis of the source program
▪ There are two major parts of a compiler: Analysis and Synthesis
▪ In analysis phase, an intermediate representation is created from
the given source program.
• Lexical Analyzer, Syntax Analyzer and Semantic Analyzer are the parts of this phase.
▪ In synthesis phase, the equivalent target program is created from
this intermediate representation.
• Intermediate Code Generator, Code Generator, and Code Optimizer are the parts of this phase.

35
Phases of A Compiler

Source Lexical Syntax Semantic Intermediate Code Code Target


Program Analyzer Analyzer Analyzer Code Generator Optimizer Generator Program

• Each phase transforms the source program from one representation


into another representation.

• They communicate with error handlers.

• They communicate with the symbol table.

36
37
38
Phases of Compiler
▪ Lexical Analysis
▪ The first phase of scanner works as a text scanner. This phase scans
the source code as a stream of characters and converts it into
meaningful lexemes. Lexical analyzer represents these lexemes in
the form of tokens as:<token-name, attribute-value>
▪ Syntax Analysis
▪ The next phase is called the syntax analysis or parsing. It takes the
token produced by lexical analysis as input and generates a parse
tree (or syntax tree). In this phase, token arrangements are checked
against the source code grammar, i.e. the parser checks if the
expression made by the tokens is syntactically correct.

39
Phases of Compiler
▪ Semantic Analysis
▪ Semantic analysis checks whether the parse tree constructed
follows the rules of language. For example, assignment of values is
between compatible data types, and adding string to an integer.
Also, the semantic analyzer keeps track of identifiers, their types
and expressions; whether identifiers are declared before use or not
etc. The semantic analyzer produces an annotated syntax tree as an
output.
▪ Intermediate Code Generation
▪ After semantic analysis the compiler generates an intermediate
code of the source code for the target machine. It represents a
program for some abstract machine. It is in between the high-level
language and the machine language. This intermediate code should
be generated in such a way that it makes it easier to be translated
40
into the target machine code.
Phases of Compiler
▪ Code Optimization
▪ The next phase does code optimization of the intermediate code.
Optimization can be assumed as something that removes
unnecessary code lines, and arranges the sequence of statements in
order to speed up the program execution without wasting resources
(CPU, memory).
▪ Code Generation
▪ In this phase, the code generator takes the optimized representation
of the intermediate code and maps it to the target machine
language. The code generator translates the intermediate code into
a sequence of (generally) re-locatable machine code. Sequence of
instructions of machine code performs the task as the intermediate
code would do.
41
Phases of Compiler
▪ Symbol Table
▪ It is a data-structure maintained throughout all the phases of a
compiler. All the identifier's names along with their types are
stored here. The symbol table makes it easier for the compiler to
quickly search the identifier record and retrieve it. The symbol
table is also used for scope management.

42
43
Compiler Construction Tools
▪ Scanner Generators − These generators create the lexical analysis. The
fundamental lexical analysis is produced by Finite Automata which takes input
in the form of regular expressions.
▪ Example − LEX is a scanner generator provided by UNIX systems.
▪ Parser Generators − This software produces syntax analysis which takes input
in the form of the syntax of a programming language depends on context-free
grammar. Example YACC
▪ Syntax Directed Translation Engines − These make a set of routines that walk
the parse tree. The basic concept is that one or more translations are related to
each node of the parse tree, and each translation is represented because of
translation at its neighbour nodes in the tree.
▪ Data Flow Engines − It can generate an optimized code. These tools are used in
code optimization.
▪ Automatic Code Generators − These generators take input in the form of
intermediate code and convert it into machine language.

44
Token, Pattern and Lexeme
▪ Token: Token is a sequence of characters that can be treated as a
single logical entity. Typical tokens
are,
▪ 1) Identifiers 2) keywords 3) operators 4) special symbols
5)constants
▪ Pattern: A set of strings in the input for which the same token is
produced as output. This set of strings is described by a rule called
a pattern associated with the token.
▪ Lexeme: A lexeme is a sequence of characters in the source
program that is matched by the pattern for a token.

45
46
Lexical Analyzer
▪ Lexical Analysis is the first phase of compiler also known as scanner.
▪ Its main task is to read the input characters and produce as output a sequence of
tokens that the parser use for syntax analysis.
▪ It converts the High level input program into a sequence of Tokens.
▪ Lexical Analysis can be implemented with the Deterministic finite Automata
Example of token:
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
Comments, preprocessor directive, macros, blanks, tabs, newline etc

47
Lexical analyzer

token
Source To semantic
Lexical Analyzer Parser
program analysis
getNextToken

Symbol
table
The role of lexical analyzer
▪ Its main task is to read the input characters and produce as output a sequence
of tokens that the parser uses for syntax analysis.
▪ Removes white spaces and comments from the source program which are in
the form of blank, tab, and newline characters.
▪ Correlates error messages with the source program
▪ In some compiler Lexical Analyzer is in charge of making a copy of the
source program with error message marked in it
▪ Helps you to expands the macros if it is found in the source program
▪ Read input characters from the source program
Why to separate Lexical analysis and parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
Tokens, Patterns and Lexemes
▪ A token is a pair a token name and an optional token value
▪ A pattern is a description of the form that the lexemes of a token
may take
▪ A lexeme is a sequence of characters in the source program that
matches the pattern for a token
Example

Toke Informal description Sample lexemes


n
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2


number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”

printf(“total = %d\n”, score);


Attributes for tokens
▪ E = M * C ** 2
• <id, pointer to symbol table entry for E>
• <assign-op>
• <id, pointer to symbol table entry for M>
• <mult-op>
• <id, pointer to symbol table entry for C>
• <exp-op>
• <number, integer value 2>
Lexical errors
▪ Some errors are out of power of lexical analyzer to recognize:
• fi (a == f(x)) …
▪ However it may be able to recognize errors like:
• d = 2r
▪ Such errors are recognized when no pattern for tokens matches a
character sequence
Error recovery
▪ Panic mode: successive characters are ignored until we reach to a
well formed token
▪ Delete one character from the remaining input
▪ Insert a missing character into the remaining input
▪ Replace a character by another character
▪ Transpose two adjacent characters
Input Buffering
▪ The lexical analyzer scans the input from left to right one
character at a time.
▪ It uses two pointers begin ptr(bp) and forward to keep track of
the pointer of the input scanned.
bp

i n t i , j ; i = i + 1 ; j = j + 1 ;

fp
Continue..
▪ Initially both the pointers point to the first character of the input string as
shown in fig
▪ The forward ptr moves ahead to search for end of lexeme.
▪ As soon as the blank space is encountered, it indicates end of lexeme.
▪ In the example as soon as ptr (fp) encounters a blank space the lexeme
“int” is identified.
▪ The fp will be moved ahead at white space, when fp encounters white
space, it ignore and moves ahead. then both the begin ptr(bp) and forward
ptr(fp) are set at next token.

▪ The input character is thus read from secondary storage, but reading in this
way from secondary storage is costly.
▪ Hence buffering technique is used.
▪ When referring to computer memory, the input buffer is a location that
holds all incoming information before it continues to the CPU for
processing.
▪ A block of data is first read into a buffer, and then scaned by lexical
analyzer.
▪ There are two methods used in this context: One Buffer Scheme, and Two
Buffer Scheme.
One Buffer Scheme
▪ In this scheme, only one buffer is used to store the input string
▪ But the problem with this scheme is that if lexeme is very long then it
crosses the buffer boundary,
▪ To scan rest of the lexeme the buffer has to be refilled, that makes
overwriting the first of lexeme.
Disadvantage:

▪ DECLARE (ARGl, ARG2, . . . , ARGn) in PL/1 program;


▪ It cannot determine whether the DECLARE is a keyword or an array name
until the character that follows the right parenthesis.
Two Buffer Scheme
▪ To overcome the problem of one buffer scheme, in this method two
buffers are used to store the input string.
▪ First buffer and second buffer are scanned alternately.
▪ When end of current buffer is reached the other buffer is filled.
▪ Initially both the bp and fp are pointing to the first character of first buffer.
▪ Then the fp moves towards right in search of end of lexeme. as soon as
blank character is recognized, the string between bp and fp is identified as
corresponding token.
▪ To identify, the boundary of first buffer end of buffer character should be
placed at the end first buffer.
▪ Similarly end of second buffer is also recognized by the end of buffer mark
present at the end of second buffer.
▪ Ends of the buffer halves require two tests for each advance of the forward
pointer.
▪ Test 1: For end of buffer.
▪ Test 2: To determine what character is read.
▪ The usage of sentinel reduces the two tests to one by extending each buffer
half to hold a sentinel character at the end.
▪ The sentinel is a special character that cannot be part of the source
program. (eof character is used as sentinel).
Scheme
▪ Consists of two buffers, each consists of N-character size which
are reloaded alternatively.
▪ N-Number of characters on one disk block, e.g.,1024 or 4096.
▪ N characters are read from the input file to the buffer using one
system read command.
▪ eof is inserted at the end if the number of characters is less than
N.
Optimize the code
▪ When fp encounters first eof, then one can recognize end of first buffer
and hence filling up second buffer is started.
▪ In the same way when second eof is obtained then it indicates of second
buffer.
▪ Alternatively both the buffers can be filled up until end of the input
program and stream of tokens is identified.
▪ This eof character introduced at the end is calling Sentinel which is used
to identify the end of buffer.
Advantages
▪ Most of the time, It performs only one test to see whether
forward pointer points to an eof.
▪ Only when it reaches the end of the buffer half or eof, it
performs more tests.
▪ Since N input characters are encountered between eofs, the
average number of tests per input character is very close to 1.
Specification of tokens
▪ In theory of compilation regular expressions are used to formalize
the specification of tokens
▪ Regular expressions are means for specifying regular languages
▪ Example:
• Letter_(letter_ | digit)*
▪ Each regular expression is a pattern specifying the form of strings
Regular expressions
▪ Ɛ is a regular expression, L(Ɛ) = {Ɛ}
▪ If a is a symbol in ∑then a is a regular expression, L(a) = {a}
▪ (r) | (s) is a regular expression denoting the language L(r) ∪ L(s)
▪ (r)(s) is a regular expression denoting the language L(r)L(s)
▪ (r)* is a regular expression denoting (L(r))*
▪ (r) is a regular expression denting L(r)
Regular definitions
d1 -> r1
d2 -> r2

dn -> rn

▪ Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Extensions
▪ One or more instances: (r)+
▪ Zero of one instances: r?
▪ Character classes: [abc]

▪ Example:
• letter_ -> [A-Za-z_]
• digit -> [0-9]
• id -> letter_(letter|digit)*
Recognition of tokens
▪ Starting point is the language grammar to understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt

expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.)
▪ The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
▪ We also need to handle whitespaces:
ws -> (blank | tab | newline)+
Transition diagrams
▪ Transition diagram for relop
Transition diagrams (cont.)
▪ Transition diagram for reserved words and identifiers
Transition diagrams (cont.)
▪ Transition diagram for unsigned numbers
Transition diagrams (cont.)
▪ Transition diagram for whitespace
Finite Automata
▪ Regular expressions = specification
▪ Finite automata = implementation

▪ A finite automaton consists of


• An input alphabet Σ
• A set of states S
• A start state n
• A set of accepting states F ⊆ S
• A set of transitions state →input state

78
Finite Automata
▪ Transition
s1 →a s2
▪ Is read
In state s1 on input “a” go to state s2

▪ If end of input
• If in accepting state => accept, othewise => reject
▪ If no transition possible => reject

79
Finite Automata State Graphs

▪ A state

• The start state

• An accepting state

a
• A transition

80
A Simple Example

▪ A finite automaton that accepts only “1”

▪ A finite automaton accepts a string if we can follow transitions


labeled with the characters in the string from the start to some
accepting state

81
Another Simple Example

▪ A finite automaton accepting any number of 1’s followed by a


single 0
▪ Alphabet: {0,1}

▪ Check that “1110” is accepted but “110…” is not

82
And Another Example

▪ Alphabet {0,1}
▪ What language does this recognize?

1 0

0 0

1
1

83
And Another Example
▪ Alphabet still { 0, 1 }
1

▪ The operation of the automaton is not completely defined by the


input
• On input “11” the automaton could be in either state

84
Epsilon Moves

▪ Another kind of transition: ε-moves


ε
A B

• Machine can move from state A to state B


without reading input

85
Deterministic and Nondeterministic
Automata
▪ Deterministic Finite Automata (DFA)
• One transition per input per state
• No ε-moves
▪ Nondeterministic Finite Automata (NFA)
• Can have multiple transitions for one input in a given state
• Can have ε-moves
▪ Finite automata have finite memory
• Need only to encode the current state

86
Execution of Finite Automata
▪ A DFA can take only one path through the state graph
• Completely determined by input

▪ NFAs can choose


• Whether to make ε-moves
• Which of multiple transitions for a single input to take

87
Acceptance of NFAs
▪ An NFA can get into multiple states
1

0 1

• Input: 1 0 1

• Rule: NFA accepts if it can get in a final state

88
NFA vs. DFA (1)
▪ NFAs and DFAs recognize the same set of languages (regular
languages)

▪ DFAs are easier to implement


• There are no choices to consider

89
NFA vs. DFA (2)
▪ For a given language the NFA can be simpler than the DFA

1
0 0
NFA
0

1 0
0 0
DFA
1
1

• DFA can be exponentially larger than NFA


90
Regular Expressions to Finite Automata

▪ High-level sketch

NFA

Regular
expressions DFA

Lexical Table-driven
Specification Implementation of DFA

91
Regular Expressions to NFA (1)
▪ For each kind of rexp, define an NFA
• Notation: NFA for rexp A

• For ε
ε

• For input a
a

92
Regular Expressions to NFA (2)
▪ For AB

A ε
B

• For A | B
B ε
ε
ε
ε A

93
Regular Expressions to NFA (3)
▪ For A*
ε

A
ε

94
Example of RegExp -> NFA conversion
▪ Consider the regular expression
(1 | 0)*1
▪ The NFA is

ε C 1 E ε
A B 1
ε 0 G H ε I J
ε D F ε
ε
95
Next

NFA

Regular
expressions DFA

Lexical Table-driven
Specification Implementation of DFA

96
NFA to DFA. The Trick
▪ Simulate the NFA
▪ Each state of resulting DFA
= a non-empty subset of states of the NFA
▪ Start state
= the set of NFA states reachable through ε-moves from NFA start state
▪ Add a transition S →a S’ to DFA iff
• S’ is the set of NFA states reachable from the states in S after seeing the input
a
• considering ε-moves as well

97
NFA -> DFA Example
ε

ε C 1 E ε
A B 1
ε 0 G H ε I J
ε D F ε
ε
0
0 FGABCDHI
ABCDHI 0 1
1
1 EJGABCDHI

98
NFA to DFA. Remark
▪ An NFA may be in many states at any time

▪ How many different states ?

▪ If there are N states, the NFA must be in some subset of those N


states

▪ How many non-empty subsets are there?


• 2N - 1 = finitely many, but exponentially many

99
Implementation
▪ A DFA can be implemented by a 2D table T
• One dimension is “states”
• Other dimension is “input symbols”
• For every transition Si →a Sk define T[i,a] = k
▪ DFA “execution”
• If in state Si and input a, read T[i,a] = k and skip to state Sk
• Very efficient

100
Table Implementation of a DFA
0
0 T
S 0 1
1
1 U

0 1
S T U
T T U
U T U
101
Implementation (Cont.)
▪ NFA -> DFA conversion is at the heart of tools such as flex or jflex

▪ But, DFAs can be huge

▪ In practice, flex-like tools trade off speed for space in the choice of
NFA and DFA representations

102
Lexical Analyzer Generator - Lex

Lex Source program


Lexical Compiler [Link].c
lex.l

[Link].c C [Link]
compiler

Input stream Sequence


[Link]
of tokens
Structure of Lex programs

declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions
Example
%{
Int installID() {/* funtion to install the
/* definitions of manifest constants
lexeme, whose first character is
LT, LE, EQ, NE, GT, GE, pointed to by yytext, and whose
IF, THEN, ELSE, ID, NUMBER, RELOP */ length is yyleng, into the symbol table
%} and return a pointer thereto */
}
/* regular definitions
delim [ \t\n] Int installNum() { /* similar to installID,
ws {delim}+ but puts numerical constants into a
letter [A-Za-z] separate table */
digit [0-9] }
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?

%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER);}

Lex patterns
1. Single Character 4. Repetition Operators
a → matches character a * → zero or more times

. → matches any character except newline Example: a* → "", a, aa, aaa...

2. Character Classes + → one or more times

[abc] → matches a, b, or c Example: a+ → a, aa, aaa...

[a-z] → matches any lowercase letter ? → zero or one time

[A-Z] → matches any uppercase letter Example: a? → "", a

[0-9] → matches digits {n} → exactly n times

[^abc] → matches any character except a, b, c Example: a{3} → aaa


{n,m} → between n and m times
3. Predefined Shorthands
Example: a{2,4} → aa, aaa, aaaa
\d → digit (same as [0-9])
\w → word character (letters, digits, underscore)
\s → whitespace (space, tab, newline)
(Note: Some versions of Lex may not support all
shorthands; safest way is to use [ ] ranges.)
106
Lex patterns
5. Alternation
a|b → matches a or b
Example: (yes|no) → matches "yes" or "no"
6. Grouping
(abc) → groups characters together
Example: (ab)+ → matches ab, abab, ababab
7. Anchors
^ → beginning of a line
$ → end of a line
Example: ^abc$ → matches exactly "abc"
8. Escape Sequences
\n → newline
\t → tab
\\ → backslash
\" → double quote
107
%{ /* -------------------------- Definitions Section --------------------------
- Include standard headers & Define variable */
#include <stdio.h>
int word_count = 0;
int number_count = 0;
int space_count = 0;
%} Input:
/* -------------------------- Regular Definitions -------------------------- */
digit [0-9]
letter [A-Za-z] hello 123 world 45
word {letter}+
number {digit}+
%%
/* -------------------------- Rules Section -------------------------- */
{word} { word_count++; }
{number} { number_count++; } Output:
[ \t\n]+ { space_count++; }
. { /* ignore everything else */ }
%% Words: 2
Numbers: 2
/* -------------------------- User Subroutines Section -------------------------- */
int main() { Whitespaces: 3
printf("Enter some text (Ctrl+D to end):\n");
yylex(); // start lexical analysis
printf("\nWords: %d\nNumbers: %d\nWhitespaces: %d\n",
word_count, number_count, space_count);
return 0;
}
int yywrap() {
return 1;
}
108
109
Continue..
▪ Lexical Analyzer reads the source program character by character
and returns the tokens of the source program.
Ex: newval = oldval + 12 => tokens: newval identifier
= assignment operator
oldval identifier
+ add operator
12 a number

For example, consider the program


int main()
{ // 2 variables
int a, b; a = 10; return 0; }
▪ Puts information about identifiers into the symbol table.

110
Continue..
▪ Count number of tokens :
int main() // 4 tokens
{ // 1
int a = 10, b = 20; // 9 tokens
printf("sum is :%d", a+b); // 9 tokens
return 0; // 3 tokens
} //1 token

▪ Count number of tokens :


int max(int i);

111
112
Answer is 41

113
Symbol Table
int a, b, c;
float i;

S. NO Variable name Types

1 a integer

2 b integer

3 c integer

114
Bootstrapping
▪ Bootstrapping is widely used in the compilation development.
▪ Bootstrapping is used to produce a self-hosting compiler.
Self-hosting compiler is a type of compiler that can compile its
own source code.
▪ Bootstrap compiler is used to compile the compiler and then
you can use this compiled compiler to compile everything else
as well as future versions of itself.
▪ A compiler can be characterized by three languages:
• Source Language
• Target Language
• Implementation Language

115
Bootstrapping
▪ The T- diagram shows a compiler SCIT for Source S, Target T,
implemented in I.

116
Bootstrapping
▪ Suppose we have a new language L, which we want to make available on several
machines say A and B.
▪ Follow some steps to produce a new language L for machine A:
▪ 1. Create a compiler SCAA for subset, S of the desired language, L using language
"A" and that compiler runs on machine A. Here the language S can have only
some core features of L and A can be the machine language or Assembly
language for Machine A.

▪ 2. Now, we can say that we are now available language S on machine A and now
we can code a program for compiler of language L in S for machine A i.e. Create
a compiler LCSA for language L written in a subset of L.

▪ 3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for
language L, which runs on machine A and produces code for machine A.
S
L
CS A CA A L
CA A 117
Bootstrapping

▪ The process described by the T-diagrams is called bootstrapping.


▪ Now we have full language L on machine A in the form of LCAA. Now Suppose
we want to produce another compiler for L to run on Machine B to produce code
for machine B i.e. LCLB now can compile it on LCAA

L
C LB L
CA A L
CA B

▪ Here, LCAB is called a cross-compiler, which running on a machine A and


generating the code for another machine B. This cross compiler now can be used
to generate a full compiler for L on machine B as.
L
L
C LB CA B L
CB B
118
Bootstrapping
▪ T-diagram for complete process.

119
Lex: A Lexical analyzer generator
▪ Lex is a computer program that generates lexical analyzers.
▪ Lex is commonly used with the yacc parser generator.
▪ Lex has a language for describing regular expression.
▪ It generate a pattern matcher for the regular expression
specifications provided to it as input.
▪ The input notation for the Lex tool is referred to as the Lex
language and tool itself is Lex compiler.
▪ Lexical analyzer is prepared by creating a program lex.l in the Lex
language. Then, lex.l is run through the Lex compiler to produce a
C program [Link].c.
▪ [Link].c is run through the C compiler to produce an object
program [Link], which is the lexical analyzer that transforms an
input stream into a sequence of tokens.
120
Continue..

121
How to install lex and Yacc
▪ Ubuntu does not come installed with a lex and yacc compiler to do so
install it first by
1. Opening the terminal
2. type - sudo apt-get install flex
3. enter your password
after installation of flex finishes
4. type - sudo apt-get install bison
5. enter your pasword.
Congratualtions Lex and Yacc have been installed on your system.
Command to run Lex
for compiling a lex program
1. write the lex program in a file and save it as file.l (where file is the name
of the file).
2. open the terminal and navigate to the directory where you have saved the
file.l
3. type: lex file.l
4. then type: cc [Link].c -ll
5. then type - ./[Link]
Your lex progam will be running now
Continue..
Lex Specification
A Lex program consists of three parts:
{ definitions } – opitional
{ rules } – essential
{ user subroutines } - essential
Commands to create a LA
lex file.l creates a C program [Link].c
type cc [Link].c –ll produce [Link]
[Link] is lexical analyzer that create token from input
▪ Definitions include declarations of variables, constants, and
regular definitions

125
Continue..
Rules are statements of the form
▪ p1 {action1}
▪ p2 {action2}
▪…
▪ pn {actionn}
▪ where pi is regular expression and actioni describes what action the
lexical analyzer should take when pattern pi matches a lexeme.
Actions are written in C code.
▪ User subroutines are auxiliary procedures needed by the actions.
These can be compiled separately and loaded with the lexical
analyzer.

126
127
COUSINS OF COMPILER
1. Preprocessor
2. Assembler
3. Loader and Link-editor
Minimization of DFA

129
Syntax Analysis
Outline
▪ Role of parser
▪ Context free grammars
▪ Top down parsing
▪ Bottom up parsing
▪ Parser generators
The role of parser

token
Source Lexical Parse tree Rest of Front Intermediate
Parser
program Analyzer End representation
getNext
Token

Symbol
table
The role of parser
▪ The parser obtains a string of tokens from the lexical analyzer and
verifies that the string can be the grammar for the source language.
▪ It detects and reports any syntax errors and produces a parse tree
from which intermediate code can be generated.
▪ Recognize sentences in a language.
▪ Discover the structure of a document/program.
▪ The above tree is used later to guide translation
▪ Model using context free grammars
▪ Recognize using Push down automata/Table Driven Parsers

133
Limitations of regular languages
▪ How to describe language syntax precisely and conveniently. Can
regular expressions be used?
▪ Many languages are not regular, for example, string of balanced
parentheses – ((((…)))) – { (i ) i | i ≥ 0 } – There is no regular
expression for this language
▪ A finite automata may repeat states, however, it cannot remember
the number of times it has been to a particular state
▪ A more powerful language is needed to describe a valid string of
tokens

134
Syntax definition or Grammars
▪ Noam Chomsky gave a mathematical model of grammar in 1956
which is effective for writing computer languages.
▪ A grammar G can be formally written as a 4-tuple (N, T, S, P)
where
• N or VN is a set of variables or non-terminal symbols
• T or Σ is a set of Terminal symbols
• S is a special variable called the Start symbol, S ∈ N
• P is Production rules for Terminals and Non-terminals.
A production rule has the form 𝛼 → 𝛽, where 𝛼 and 𝛽 are strings
on 𝑉𝑁 ∪ Σ and least one symbol of 𝛼 belongs to VN.

135
Grammar
▪ Example
▪ Grammar G1: ({S, A, B}, {a, b}, S, {S →AB, A →a, B →b})
Here,
• S, A, and B are Non-terminal symbols;
• a and b are Terminal symbols
• S is the Start symbol, S ∈ N
• Productions, P : S →AB, A →a, B →b

136
Derivations from a Grammar
▪ Derivations mean replacing a given string’s non-terminal by the
right-hand side of the production rule.
▪ Strings may be derived from other strings using the productions in
a grammar. If a grammar G has a production α → β, we can say
that x α y derives x β y in G. This derivation is written as:

137
Types of Derivation
There are two types of derivations which are as follows −
▪Leftmost Derivation
A derivation A⇒*w is known as leftmost derivation if we use a
production only to the leftmost variable at each step. Here, * means
0, 1, 2,………..n number of derivations.
▪Rightmost Derivation
A derivation A⇒*w is the rightmost derivation if we use
production to the rightmost variable at each step. It is also
called Canonical Derivation.

138
Derivation Examples
▪ Example1 − Let G be a CFG with productions.
S → AA
A → αB
B → b
B → ε
▪ Find Leftmost and Rightmost derivation for string abab

139
Types of Grammars or Chomsky Hierarchy
▪ According to Noam Chomsky, there are four types of grammars:
Type 0, Type 1, Type 2, and Type 3. The following table shows
how they differ from each other:

140
Chomsky Hierarchy

141
142
Context free grammars
▪ Definition − A context-free grammar (CFG) consisting of a
finite set of grammar rules is a quadruple (N, T, P, S) where
• N is a set of non-terminal symbols.
• T is a set of terminals where N ∩ T = NULL.
• P is a set of rules, P: N → (N ∪ T)*, i.e., the left-hand side of
the production rule P does have any right context or left
context.
• S is the start symbol.
Context free grammars
▪ Example

▪ The grammar ({A}, {a, b, c}, P, A), P : A → aA, A → abc.

▪ The grammar ({S, a, b}, {a, b}, P, S), P: S → aSa, S → bSb, S → ε

▪ The grammar ({S, F}, {0, 1}, P, S), P: S → 00S | 11F, F → 00F | ε

144
Generation of Derivation Tree
▪ Example
Let a CFG {N,T,P,S} be
N = {S}, T = {a, b}, Starting symbol = S, P = S → SS | aSb | ε
One derivation from the above CFG is “abaabb”
S → SS → aSbS → abS → abaSb → abaaSbb → abaabb

145
Context free grammars

expression -> expression + term


▪ Terminals
expression -> expression – term
▪ Nonterminals expression -> term
▪ Start symbol term -> term * factor
▪ productions term -> term / factor
term -> factor
factor -> (expression)
factor -> id
Derivations
▪ Productions are treated as rewriting rules to generate a string
▪ Rightmost and leftmost derivations
• E -> E + E | E * E | -E | (E) | id
• Derivations for –(id+id)
• E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
Parse trees
▪ -(id+id)
• E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
Ambiguity
▪ For some strings there exist more than one parse tree
▪ Or more than one leftmost derivation
▪ Or more than one rightmost derivation
▪ Example: id+id*id
Elimination of ambiguity
Elimination of ambiguity (cont.)
▪ Idea:
• A statement appearing between a then and an else must be matched
Uses of grammars

E -> E + T | T
T -> T * F | F
F -> (E) | id

E -> TE’
E’ -> +TE’ | Ɛ
T -> FT’
T’ -> *FT’ | Ɛ
F -> (E) | id
Top Down Parsing
Introduction

▪ A Top-down parser tries to create a parse tree from the root


towards the leafs scanning input from left to right
▪ It can be also viewed as finding a leftmost derivation for an
input string
▪ Example: id+id*id

E E E E E E
E -> TE’ lm lm lm lm lm
E’ -> +TE’ | Ɛ T E’ T E’ T E’ T E’ T E’
T -> FT’
F T’ F T’ F T’ F T’ + T E’
T’ -> *FT’ | Ɛ
F -> (E) | id id id Ɛ id Ɛ
Recursive descent parsing
▪ Consists of a set of procedures, one for each nonterminal
▪ Execution begins with the procedure for start symbol
▪ A typical procedure for a non-terminal

void A() {
choose an A-production, A->X1X2..Xk
for (i=1 to k) {
if (Xi is a nonterminal
call procedure Xi();
else if (Xi equals the current input symbol a)
advance the input to the next symbol;
else /* an error has occurred */
}
}
Recursive descent parsing (cont)
▪ General recursive descent may require backtracking
▪ The previous code needs to be modified to allow backtracking
▪ In general form it cant choose an A-production easily.
▪ So we need to try all alternatives
▪ If one failed the input pointer needs to be reset and another
alternative should be tried
▪ Recursive descent parsers cant be used for left-recursive grammars
Example

S->cAd
A->ab | a Input: cad

S S S

c A d c A d c A d

a b a
158
Left recursion
• A top down parser with production
A → A α may loop forever

• From the grammar A → A α | β left


recursion may be eliminated by
transforming the grammar to

A→β R
R→αR|ε
159
Parse tree corresponding Parse tree corresponding
to a left recursive to the modified grammar
grammar
A A

A R

A R

β α α β α Є

Both the trees generate string βα*


160
Elimination of left recursion
▪ A grammar is left recursive if it has a non-terminal A such that
there is a derivation A=> Aα
▪ Top down parsing methods cant handle left-recursive grammars
▪ A simple rule for direct left recursion elimination:
+
• For a rule like:
• A -> A α|β
• We may replace it with
• A -> β A’
• A’ -> α A’ | ɛ
Removal of left recursion
In general

A Aα1 | Aα2 | ….. |Aαm


|β1 | β2 | …… | βn

transforms to

A β1A' | β2A' | …..| βnA'


A' α1A' | α2A' |…..| αmA' | Є
162
Left recursion hidden due to many
productions
• Left recursion may also be introduced by two or more grammar rules.
For example:

S Aa | b
A Ac | Sd | Є

there is a left recursion because

S → Aa → Sda

• In such cases, left recursion is removed systematically

– Starting from the first rule and replacing all the occurrences of the first
non terminal symbol

– Removing left recursion from the modified grammar

163
Removal of left recursion due to
many productions …
• After the first step (substitute S by its rhs in the rules) the
grammar becomes

S Aa | b
A Ac | Aad | bd | Є

• After the second step (removal of left recursion) the


grammar becomes

S Aa | b
A bdA' | A'
A' cA' | adA' | Є
164
Left recursion elimination (cont.)
▪ There are cases like following
• S -> Aa | b
• A -> Ac | Sd | ɛ
▪ Left recursion elimination algorithm:
• Arrange the nonterminals in some order A1,A2,…,An.
• For (each i from 1 to n) {
• For (each j from 1 to i-1) {
– Replace each production of the form Ai-> Aj γ by the production Ai
-> δ1 γ | δ2 γ | … |δk γ where Aj-> δ1 | δ2 | … |δk are
all current Aj productions
– }
– Eliminate left recursion among the Ai-productions
•}
Left factoring
• In top-down parsing when it is not clear which production to choose
for expansion of a symbol
defer the decision till we have seen enough input.

In general if A αβ1 | αβ2


defer decision by expanding A to αA' we can then expand A’ to β1 or

β2

• Therefore A α β1 | α β2
transforms to A αA’

A’ β1 | β2

166
Left factoring
▪ Left factoring is a grammar transformation that is useful for
producing a grammar suitable for predictive or top-down parsing.
▪ Consider following grammar:
• Stmt -> if expr then stmt else stmt
• | if expr then stmt
▪ On seeing input if it is not clear for the parser which production to
use
▪ We can easily perform left factoring:
• If we have A->αβ1 | αβ2 then we replace it with
• A -> αA’
• A’ -> β1 | β2
Left factoring (cont.)
▪ Algorithm
• For each non-terminal A, find the longest prefix α common to two or more
of its alternatives. If α<> ɛ, then replace all of A-productions A->αβ1
|αβ2 | … | αβn | γ by
• A -> αA’ | γ
• A’ -> β1 |β2 | … | βn
▪ Example:
• S -> I E t S | i E t S e S | a
• E -> b
Dangling else problem again
Dangling else problem can be handled by left factoring

stmt if expr then stmt else stmt


| if expr then stmt

can be transformed to

stmt if expr then stmt S' S' else stmt | Є

169
First and Follow
▪ FIRST(X) for a grammar symbol X is the set of terminals that
begin the strings derivable from X.
▪ Rules to compute FIRST set:
▪ If x is a terminal, then FIRST(x) = { ‘x’ }
▪ If x-> Є, is a production rule, then add Є to FIRST(x).
▪ If X->Y1 Y2 Y3….Yn is a production,
• FIRST(X) = FIRST(Y1)
• If FIRST(Y1) contains Є then FIRST(X) = { FIRST(Y1) – Є } U {
FIRST(Y2) }
• If FIRST (Yi) contains Є for all i = 1 to n, then add Є to FIRST(X).
Calculate first for following Grammars
Production Rules of Grammar
E -> TE’
E’ -> +T E’|Є
T -> F T’
T’ -> *F T’ | Є
F -> (E) | id

171
Production Rules of Grammar
E -> TE’
E’ -> +T E’|Є
T -> F T’
T’ -> *F T’ | Є
F -> (E) | id

FIRST sets
FIRST(E) = FIRST(T) = { ( , id }
FIRST(E’) = { +, Є }
FIRST(T) = FIRST(F) = { ( , id }
FIRST(T’) = { *, Є }
FIRST(F) = { ( , id }

172
(1) S → aBDh
B → cC
C → bC / ∈
D → EF
E→g/∈
F→f/∈

(2) S → AaAb / BbBa


A→∈
B→∈

173
Production Rules of Grammar
S -> ACB | Cbb | Ba
A -> da | BC
B -> g | Є
C -> h | Є

174
Production Rules of Grammar
S -> ACB | Cbb | Ba
A -> da | BC
B -> g | Є
C -> h | Є

FIRST sets
FIRST(S) = FIRST(ACB) U FIRST(Cbb) U FIRST(Ba)
= { d, g, h, b, a, Є}
FIRST(A) = { d } U FIRST(BC)
= { d, g, h, Є }
FIRST(B) = { g , Є }
FIRST(C) = { h , Є }

175
S→A
A → aB / Ad
B→b
C→g

E→E+T/T
T→TxF/F
F → (E) / id

176
Computing follow
▪ To compute FOLLOW(A) for all nonterminals A, apply following
rules until nothing can be added to any follow set:
1. Place $ in Follow(S) where S is the start symbol
2. If there is a production A-> αBβ then everything in First(β) except ɛ is
in Follow(B).
3. If there is a production A-> αB or a production A->αBβ where
First(β) contains ɛ, then everything in Follow(A) is in Follow(B)
1. Example!
178
Example
• Consider grammar for arithmetic expressions

E E+T|T
T T*F|F F ( E ) | id

• After removal of left recursion the grammar becomes

E T E’
E’ + T E’ | Є
T F T’
T’ * F T’ | Є F ( E ) | id

179
Predictive parsers
• A non recursive top down parsing method

• Parser “predicts” which production to use

• It removes backtracking by fixing one production for every


non-terminal and input token(s)

• Predictive parsers accept LL(k) languages


– First L stands for left to right scan of input
– Second L stands for leftmost derivation
– k stands for number of lookahead token

• In practice LL(1) is used


Predictive parsing
• Predictive parser can be implemented by maintaining an external

stack

Parse table is a two dimensional


array M[X,a] where “X” is a non
terminal and “a” is a terminal of the
grammar
Example
• Consider the grammar

E T E’
E' +T E' | Є
T F T'
T' * F T' | Є
F ( E ) | id
Parse table for the grammar

id + * ( ) $
E E TE’ E TE’

E’ E’ +TE’ E’ Є E’ Є

T T FT’ T FT’

T’ T’ Є T’ *FT’ T’ Є T’ Є

F F id F (E)

Blank entries are error states. For example E cannot


derive a string starting with ‘+’
Parsing algorithm
• The parser considers 'X' the symbol on top of stack, and 'a' the current input
symbol

• These two symbols determine the action to be taken by the parser

• Assume that '$' is a special token that is at the bottom of the stack and terminates
the input string

if X = a = $ then halt
if X = a ≠ $ then pop(x) and ip++

if X is a non terminal

then if M[X,a] = {X UVW}


then begin pop(X);
push(W,V,U)
End
else error
185
Example
Stack input action
$E id + id * id $ expand by E TE’

$E’T id + id * id $ expand by T FT’

$E’T’F id + id * id $ expand by F id

$E’T’id id + id * id $ pop id and ip++

$E’T’ + id * id $ expand by T’
Є
$E’ + id * id $ expand by E’
+TE’
$E’T+ + id * id $ pop + and ip++

$E’T id * id $ expand by T FT’


186
Example

Stack input action
$E’T’F id * id $ expand by F id
$E’T’id id * id $ pop id and ip++
$E’T’ * id $ expand by T’ *FT’
$E’T’F* * id $ pop * and ip++
$E’T’F id $ expand by F id
$E’T’id id $ pop id and ip++
$E’T’ $ expand by T’ Є
$E’ $ expand by E’ Є
$ $ halt
LL(1) Grammars
▪ Predictive parsers are those recursive descent parsers needing no
backtracking
▪ Grammars for which we can create predictive parsers are called
LL(1)
• The first L means scanning input from left to right
• The second L means leftmost derivation
• And 1 stands for using one input symbol for lookahead
▪ A grammar G is LL(1) if and only if whenever A-> α|βare two
distinct productions of G, the following conditions hold:
• For no terminal a do αandβ both derive strings beginning with a
• At most one of α or βcan derive empty string
• If α=> ɛ then βdoes not derive any string beginning with a terminal in
*
Follow(A).
Construction of predictive parsing table
▪ For each production A->α in grammar do the following:
1. For each terminal a in First(α) add A-> in M[A,a]
2. If ɛ is in First(α), then for each terminal b in Follow(A) add A-> ɛ to
M[A,b]. If ɛ is in First(α) and $ is in Follow(A), add A-> ɛ to M[A,$] as
well
▪ If after performing the above, there is no production in M[A,a] then
set M[A,a] to error
Example
First Follow

E -> TE’ F {(,id} {+, *, ), $}


T {(,id} {+, ), $}
E’ -> +TE’ | Ɛ {(,id} {), $}
E
T -> FT’ {+,ɛ} {), $}
E’
T’ -> *FT’ | Ɛ {*,ɛ} {+, ), $}
F -> (E) | id T’
Input Symbol
Non -
terminal id + * ( ) $
E E -> TE’ E -> TE’

E’ E’ -> +TE’ E’ -> Ɛ E’ -> Ɛ

T T -> FT’ T -> FT’

T’ T’ -> Ɛ T’ -> *FT’ T’ -> Ɛ T’ -> Ɛ

F F -> id F -> (E)


Another example

S -> iEtSS’ | a
S’ -> eS | Ɛ
E -> b

Input Symbol
Non -
terminal a b e i t $
S S -> a S -> iEtSS’

S’ S’ -> Ɛ S’ -> Ɛ
S’ -> eS
E E -> b
Practice Assignment
• Construct LL(1) parse table for the expression grammar
bexpr bexpr or bterm | bterm
bterm bterm and bfactor | bfactor
bfactor not bfactor | ( bexpr ) | true | false

• Steps to be followed
– Remove left recursion
– Compute first sets
– Compute follow sets
– Construct the parse table
Error handling
• Stop at the first error and print a message
– Compiler writer friendly
– But not user friendly

• Every reasonable compiler must recover from errors and identify as


many errors as possible

• However, multiple error messages due to a single fault must be


avoided

• Error recovery methods


– Panic mode

– Phrase level recovery

– Error productions

– Global correction
193
Panic mode
• Simplest and the most popular method

• Most tools provide for specifying panic mode


recovery in the grammar

• When an error is detected


– Discard tokens one at a time until a set of tokens is
found whose role is clear
– Skip to the next token that can be placed reliably in the
parse tree

194
Panic mode

• Consider following code
begin
a = b + c; x = p r ; h = x < 0;
end;

• The second expression has syntax error

• Panic mode recovery for begin-end block


skip ahead to next ‘;’ and try to parse the next
expression

• It discards one expression and tries to continue parsing

• May fail if no further ‘;’ is found

195
Phrase level recovery
• Make local correction to the input

• Works only in limited situations


– A common programming error which is easily detected
– For example insert a “;” after closing “-” of a class
definition

• Does not work very well!

196
Error productions
• Add erroneous constructs as productions in the grammar

• Works only for most common mistakes which can be


easily identified

• Essentially makes common errors as part of the grammar

• Complicates the grammar and does not work very well

197
Global corrections
• Considering the program as a whole find a correct
“nearby” program

• Nearness may be measured using certain metric

• PL/C compiler implemented this scheme:


anything could be compiled!

• It is complicated and not a very good idea!

198
Error Recovery in LL(1)
parser
• Error occurs when a parse table entry M[A,a] is empty

• Skip symbols in the input until a token in a selected set


(synch) appears

• Place symbols in follow(A) in synch set. Skip tokens


until
an element in follow(A) is seen.
Pop(A) and continue parsing

• Add symbol in first(A) in synch set. Then it may be


possible to resume parsing according to A if a symbol in
first(A) appears in input. 199
Practice Assignment
• Reading assignment: Read about error
recovery in LL(1) parsers
• Assignment to be submitted:
– introduce synch symbols (using both follow
and first sets) in the parse table created for the
boolean expression grammar in the previous
assignment
– Parse “not (true and or false)” and show
how error recovery works

200
Non-recursive predicting parsing
a + b $

Predictive
parsing output
stack X
Y program
Z
$
Parsing
Table
M
Predictive parsing algorithm
Set ip point to the first symbol of w;
Set X to the top stack symbol;
While (X<>$) { /* stack is not empty */
if (X is a) pop the stack and advance ip;
else if (X is a terminal) error();
else if (M[X,a] is an error entry) error();
else if (M[X,a] = X->Y1Y2..Yk) {
output the production X->Y1Y2..Yk;
pop the stack;
push Yk,…,Y2,Y1 on to the stack with Y1 on top;
}
set X to the top stack symbol;
}
Example
▪ id+id*id$

Matched Stack Input Action


E$ id+id*id$
Error recovery in predictive parsing
▪ Panic mode
• Place all symbols in Follow(A) into synchronization set for nonterminal A:
skip tokens until an element of Follow(A) is seen and pop A from stack.
• Add to the synchronization set of lower level construct the symbols that begin
higher level constructs
• Add symbols in First(A) to the synchronization set of nonterminal A
• If a nonterminal can generate the empty string then the production deriving
can be used as a default
• If a terminal on top of the stack cannot be matched, pop the terminal, issue a
message saying that the terminal was insterted
Example
Non - Input Symbol
terminal id + * ( ) $
E E -> TE’ E -> TE’ synch synch

E’ E’ -> +TE’ E’ -> Ɛ E’ -> Ɛ

T -> FT’ synch T -> FT’ synch synch


T
T’ -> Ɛ T’ -> *FT’ T’ -> Ɛ T’ -> Ɛ
T’

F F -> id synch synch F -> (E) synch synch

Stack Input Action


E$ )id*+id$ Error, Skip )
E$ id*+id$ id is in First(E)
TE’$ id*+id$
FT’E’$ id*+id$
idT’E’$ id*+id$
T’E’$ *+id$
*FT’E’$ *+id$
FT’E’$ +id$ Error, M[F,+]=synch
T’E’$ +id$ F has been poped
Bottom-up Parsing
Introduction

▪ Constructs parse tree for an input string beginning at the leaves


(the bottom) and working towards the root (the top)
▪ Example: id*id

E -> E + T | T id*id F * id T * id T*F F E


T -> T * F | F
F id T*F F
F -> (E) | id id F

id id F id T*F

id F id

id
Shift-reduce parser
▪ The general idea is to shift some symbols of input to the stack until
a reduction can be applied
▪ At each reduction step, a specific substring matching the body of a
production is replaced by the nonterminal at the head of the
production
▪ The key decisions during bottom-up parsing are about when to
reduce and about what production to apply
▪ A reduction is a reverse of a step in a derivation
▪ The goal of a bottom-up parser is to construct a derivation in
reverse:
• E=>T=>T*F=>T*id=>F*id=>id*id
Handle pruning

▪ A Handle is a substring that matches the body of a production


and whose reduction represents one step along the reverse of a
rightmost derivation

Right sentential form Handle Reducing production


id*id id F->id
F*id F T->
T*id id F
F->id
T*F T*F E->T*F
Shift reduce parsing
▪ A stack is used to hold grammar symbols
▪ Handle always appear on top of the stack
▪ Initial configuration:
Stack Input
$ w$
▪ Acceptance configuration
Stack Input
$S $
Shift reduce parsing (cont.)

▪ Basic operations:
• Shift
• Reduce
Stack Input Action
• Accept $ id*id$ shift
• Error $id *id$ reduce by F->id
▪ Example: id*id $F *id$ reduce by
$T *id$ shift
T->F
$T* id$ shift
$T*id $ reduce by F->id
$T*F $ reduce by
$T $ reduce by E->T
T->T*F
$E $ accept
Handle will appear on top of the stack

S S
A
B
B A
α β γ z α γ z
y x y
Stack Input Stack Input
$αβγ yz$ $αγ xyz$
$αβB yz$ $αBxy z$
$αβBy z$
Conflicts during shit reduce parsing
▪ Two kind of conflicts
• Shift/reduce conflict
• Reduce/reduce conflict
▪ Example:

Stack Input
… if expr then stmt else …$
LR Parsing
▪ The most prevalent type of bottom-up parsers
▪ LR(k), mostly interested on parsers with k<=1
▪ Why LR parsers?
• Table driven
• Can be constructed to recognize all programming language constructs
• Most general non-backtracking shift-reduce parsing method
• Can detect a syntactic error as soon as it is possible to do so
• Class of grammars for which we can construct LR parsers are superset of
those which we can construct LL parsers
States of an LR parser
▪ States represent set of items
▪ An LR(0) item of G is a production of G with the dot at some
position of the body:
• For A->XYZ we have following items
• A->.XYZ
• A->[Link]
• A->XY.Z
• A->XYZ.
• In a state having A->.XYZ we hope to see a string derivable from XYZ next
on the input.
• What about A->[Link]?
Constructing canonical LR(0) item sets

▪ Augmented grammar:
• G with addition of a production: S’->S
▪ Closure of item sets:
• If I is a set of items, closure(I) is a set of items constructed from I by the
following rules:
• Add every item in I to closure(I)
• If A->α.Bβ is in closure(I) and B->γ is a production then add the item
B->.γ to clsoure(I).
▪ Example: I0=closure({[E’->.E]}
E’->E E’->.E
E -> E + T | T E->.E+T
T -> T * F | F E->.T
T->.T*F
F -> (E) | id T->.F
F->.(E)
F->.id
Constructing canonical LR(0) item sets
(cont.)

▪ Goto (I,X) where I is an item set and X is a grammar symbol is


closure of set of all items [A-> αX. β] where [A-> α.X β] is in
I
▪ Example E’->E.
I1

E E->E.+T
I0=closure({[E’->.E]}
E’->.E I2
E->.E+T T
E’->T.
E->.T T->T.*F
T->.T*F I4
T->.F ( F->(.E)
F->.(E) E->.E+T
E->.T
F->.id T->.T*F
T->.F
F->.(E)
F->.id
Closure algorithm
SetOfItems CLOSURE(I) {
J=I;
repeat
for (each item A-> α.Bβ in J)
for (each prodcution B->γ of G)
if (B->.γ is not in J)
add B->.γ to J;
until no more items are added to J on one round;
return J;
GOTO algorithm
SetOfItems GOTO(I,X) {
J=empty;
if (A-> α.X β is in I)
add CLOSURE(A-> αX. β ) to J;
return J;
}
Canonical LR(0) items
Void items(G’) {
C= CLOSURE({[S’->.S]});
repeat
for (each set of items I in C)
for (each grammar symbol X)
if (GOTO(I,X) is not empty and not in C)
add GOTO(I,X) to C;
until no new set of items are added to C on a round;
}
E’->E
E -> E + T | T
Example
acc
T -> T * F | F
F -> (E) | id
$ I6 I9
E->E+.T
I1
+ T->.T*F T E->E+T.
E’->E. T->.F
T->T.*F
E E->E.+T
F->.(E)
F->.id
I0=closure({[E’->.E]} I2
E’->.E T I7
F I10
E->.E+T E’->T. * T->T*.F
F->.(E) T->T*F.
E->.T T->T.*F id F->.id
T->.T*F id
T->.F I5
F->.(E)
F->.id ( F->id. +
I4
F->(.E)
I8 I11
E->.E+T
E->.T
E E->E.+T )
T->.T*F F->(E.) F->(E).
T->.F
F->.(E)
F->.id

I3
T>F.
Use of LR(0) automaton
▪ Example: id*id

Line Stack Symbols Input Action


(1) 0 $ id*id$ Shift to 5
(2) 05 $id *id$ Reduce by F->id
(3) 03 $F *id$ Reduce by T->F
(4) 02 $T *id$ Shift to 7
(5) 027 $T* id$ Shift to 5
(6) 0275 $T*id $ Reduce by F->id
(7) 02710 $T*F $ Reduce by T->T*F
(8) 02 $T $ Reduce by E->T
(9) 01 $E $ accept
LR-Parsing model

INPUT a1 … ai … an $

Sm LR Parsing Program Output


Sm-1

$
ACTION GOTO
LR parsing algorithm
let a be the first symbol of w$;
while(1) { /*repeat forever */
let s be the state on top of the stack;
if (ACTION[s,a] = shift t) {
push t onto the stack;
let a be the next input symbol;
} else if (ACTION[s,a] = reduce A->β) {
pop |β| symbols of the stack;
let state t now be on top of the stack;
push GOTO[t,A] onto the stack;
output the production A->β;
} else if (ACTION[s,a]=accept) break; /* parsing is done */
else call error-recovery routine;
}
Example (0) E’->E
(1) E -> E + T
(2) E-> T
STATE ACTON GOTO
(3) T -> T * F
id + * ( ) $ E T F (4) T-> F
0 S5 S4 1 2 3
(5) F -> (E) id*id+id?
(6) F->id
1 S6 Acc
Line Stack Symbols Input Action
2 R2 S7 R2 R2
(1) 0 id*id+id$ Shift to 5
3 R4 R7 R4 R4
(2) 05 id *id+id$ Reduce by F->id
4 S5 S4 8 2 3 (3) 03 F *id+id$ Reduce by T->F
5 R6 R6 R6 R6 (4) 02 T *id+id$ Shift to 7
(5) 027 T* id+id$ Shift to 5
6 S5 S4 9 3 (6) 0275 T*id +id$ Reduce by F->id
7 S5 S4 10 (7) 02710 T*F +id$ Reduce by
T->T*F
8 S6 S11 (8) 02 T +id$ Reduce by E->T
9 R1 S7 R1 R1 (9) 01 E +id$ Shift
(10) 016 E+ id$ Shift
10 R3 R3 R3 R3
(11) 0165 E+id $ Reduce by F->id
11 R5 R5 R5 R5 (12) 0163 E+F $ Reduce by T->F
(13) 0169 E+T` $ Reduce by
E->E+T
(14) 01 E $ accept
Constructing SLR parsing table
▪ Method
• Construct C={I0,I1, … , In}, the collection of LR(0) items for G’
• State i is constructed from state Ii:
• If [A->α.aβ] is in Ii and Goto(Ii,a)=Ij, then set ACTION[i,a]
to “shift j”, here a is terminal symbol.
• If [A->α.] is in Ii, then set ACTION[i,a] to “reduce A->α” for
all a in follow(A)
• If {S’->S.] is in Ii, then set ACTION[I,$] to “Accept”
• If any conflicts appears then we say that the grammar is not SLR(1).
• If GOTO(Ii,A) = Ij then GOTO[i,A]=j
• All entries not defined by above rules are made “error”
• The initial state of the parser is the one constructed from the set of items
containing [S’->.S]
Example grammar which is not SLR(1)
S -> L=R | R
L -> *R | id
R -> L
I0 I1 I3 I5 I7
S’->.S S’->S. S ->R. L -> id. L -> *R.
S -> .L=R
S->.R I2 I4 I6
I8
L -> .*R | S ->L.=R L->*.R S->L=.R
R -> L.
L->.id R ->L. R->.L R->.L
R ->. L L->.*R L->.*R I9
L->.id L->.id S -> L=R.

Action
=
Shift 6
2 Reduce R->L
More powerful LR parsers
▪ Canonical-LR or just LR method
• Use lookahead symbols for items: LR(1) items
• Results in a large collection of items
▪ LALR: lookaheads are introduced in LR(0) items
Canonical LR(1) items
▪ In LR(1) items each item is in the form: [A->α.β,a]
▪ An LR(1) item [A->α.β,a] is valid for a viable prefix γ if there is a
derivation S=>δAw=>δαβw, where
• Γ= δα
• Either a is the first symbol of w, or w is ε and a is $
*
▪ Example: rm
• S->BB
• B->aB|b

*
S=>aaBab=>aaaBab
rm
Item [B->a.B,a] is valid for γ=aaa
and w=ab
Constructing LR(1) sets of items
SetOfItems Closure(I) {
repeat
for (each item [A->α.Bβ,a] in I)
for (each production B->γ in G’)
for (each terminal b in First(βa))
add [B->.γ, b] to set I;
until no more items are added to I;
return I;
}

SetOfItems Goto(I,X) {
initialize J to be the empty set;
for (each item [A->α.Xβ,a] in I)
add item [A->αX.β,a] to set J;
return closure(J);
}

void items(G’){
initialize C to Closure({[S’->.S,$]});
repeat
for (each set of items I in C)
for (each grammar symbol X)
if (Goto(I,X) is not empty and not in C)
add Goto(I,X) to C;
until no new sets of items are added to C;
}
Example
S’->S
S->CC
C->cC
C->d
Canonical LR(1) parsing table
▪ Method
• Construct C={I0,I1, … , In}, the collection of LR(1) items for G’
• State i is constructed from state Ii:
• If [A->α.aβ, b] is in Ii and Goto(Ii,a)=Ij, then set
ACTION[i,a] to “shift j”
• If [A->α., a] is in Ii, then set ACTION[i,a] to “reduce A->α”
• If {S’->.S,$] is in Ii, then set ACTION[I,$] to “Accept”
• If any conflicts appears then we say that the grammar is not LR(1).
• If GOTO(Ii,A) = Ij then GOTO[i,A]=j
• All entries not defined by above rules are made “error”
• The initial state of the parser is the one constructed from the set of items
containing [S’->.S,$]
Example
S’->S
S->CC
C->cC
C->d
LALR Parsing Table

▪ For the previous example we had:

I4
C->d. , c/d
I47
C->d. , c/d/$
I7
C->d. , $

⚫ State merges cant produce Shift-Reduce conflicts. Why?


⚫ But it may produce reduce-reduce conflict
Example of RR conflict in state merging
S’->S
S -> aAd | bBd | aBe | bAe
A -> c
B -> c
An easy but space-consuming LALR table
construction
▪ Method:
1. Construct C={I0,I1,…,In} the collection of LR(1) items.
2. For each core among the set of LR(1) items, find all sets having that core,
and replace these sets by their union.
3. Let C’={J0,J1,…,Jm} be the resulting sets. The parsing actions for state i,
is constructed from Ji as before. If there is a conflict grammar is not
LALR(1).
4. If J is the union of one or more sets of LR(1) items, that is J = I1 UI2…IIk
then the cores of Goto(I1,X), …, Goto(Ik,X) are the same and is a state
like K, then we set Goto(J,X) =k.
▪ This method is not efficient, a more efficient one is discussed in the
book
Compaction of LR parsing table
▪ Many rows of action tables are identical
• Store those rows separately and have pointers to them from different states
• Make lists of (terminal-symbol, action) for each state
• Implement Goto table by having a link list for each nonterinal in the form
(current state, next state)
Using ambiguous grammars STATE ACTON GO
TO
E->E+E id + * ( ) $ E

E->E*E 0 S3 S2 1

E->(E) 1 S4 S5 Acc

E->id 2 S3 S2 6

3 R4 R4 R4 R4

4 S3 S2 7

5 S3 S2 8
I0: E’->.E I1: E’->E. I2: E->(.E)
E->.E+E E->E.+E E->.E+E 6 S4 S5
E->.E*E E->E.*E E->.E*E 7 R1 S5 R1 R1
E->.(E) E->.(E)
E->.id E->.id 8 R2 R2 R2 R2

I4: E->E+.E I5: E->E*.E 9 E->(E.)


I6: I7:R3 R3
E->E+E. R3 R3
I3: E->.id E->.E+E E->(.E) E->E.+E E->E.+E
E->.E*E E->.E+E E->E.*E E->E.*E
E->.(E) E->.E*E
E->.id E->.(E) I8: E->E*E. I9: E->(E).
E->.id E->E.+E
E->E.*E
Readings
▪ Chapter 4 of the book
Error handling
▪ Common programming errors
• Lexical errors
• Syntactic errors
• Semantic errors
• Lexical errors
▪ Error handler goals
• Report the presence of errors clearly and accurately
• Recover from each error quickly enough to detect subsequent errors
• Add minimal overhead to the processing of correct progrms
Error-recover strategies
▪ Panic mode recovery
• Discard input symbol one at a time until one of designated set of
synchronization tokens is found
▪ Phrase level recovery
• Replacing a prefix of remaining input by some string that allows the parser to
continue
▪ Error productions
• Augment the grammar with productions that generate the erroneous
constructs
▪ Global correction
• Choosing minimal sequence of changes to obtain a globally least-cost
correction

You might also like