0% found this document useful (0 votes)
11 views105 pages

Evolution of Programming Languages

The document covers the evolution of programming languages, starting from Ada Lovelace's first code in 1883 to modern languages like Swift and Kotlin. It categorizes programming languages into low-level, high-level, and scripting languages, and discusses the importance of syntax in programming, including its rules and conventions. Additionally, it introduces formal methods for describing syntax, such as Backus-Naur Form and context-free grammars.

Uploaded by

yokesh.mca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views105 pages

Evolution of Programming Languages

The document covers the evolution of programming languages, starting from Ada Lovelace's first code in 1883 to modern languages like Swift and Kotlin. It categorizes programming languages into low-level, high-level, and scripting languages, and discusses the importance of syntax in programming, including its rules and conventions. Additionally, it introduces formal methods for describing syntax, such as Backus-Naur Form and context-free grammars.

Uploaded by

yokesh.mca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

PRINCIPLES OF PROGRAMMING LANGUAGE

UNIT 1
SYNTAX AND SEMANTICS
The Evolution of Programming Languages:

Programming Language is indeed the fundamental unit of today’s tech world. It is


considered as the set of commands and instructions that we give to the machines to
perform a particular task. For example, if you give some set of instructions to add two
numbers then the machine will do it for you and tell you the correct answer
accordingly. But do you know that Programming Languages are having a long and
rich history of their evolution? And with a similar concern, here in this article, we’ll
take a look at the evolution of Programming Languages over the period.
In the computer world, we have about 500+ programming languages with having their
own syntax and features. And if you type who’s the father of the computer, then the
search engine will show you the result as to Charles Babbage but the father of the
computer didn’t write the first code. It was Ada Lovelace who has written the first-
ever computer programming language and the year was 1883.
1883: The Journey starts from here…!!
 In the early days, Charles Babbage had made the device, but he was confused
about how to give instructions to the machine, and then Ada Lovelace wrote the
instructions for the analytical engine.
 The device was made by Charles Babbage and the code was written by Ada
Lovelace for computing Bernoulli’s number.
 First time in history that the capability of computer devices was judged.

1949: Assembly Language


 It is a type of low-level language.
 It mainly consists of instructions (kind of symbols) that only machines could
understand.
 In today’s time also assembly language is used in real-time programs such as
simulation flight navigation systems and medical equipment eg – Fly-by-wire
(FBW) systems.
 It is also used to create computer viruses.

1952: Autocode
 Developed by Alick Glennie.
 The first compiled computer programming language.
 COBOL and FORTRAN are the languages referred to as Autocode.

1957: FORTRAN
 Developers are John Backus and IBM.
 It was designed for numeric computation and scientific computing.
 Software for NASA probes voyager-1 (space probe) and voyager-2 (space probe)
was originally written in FORTRAN 5.

1958: ALGOL
 ALGOL stands for ALGOrithmic Language.
 The initial phase of the most popular programming languages of C, C++, and
JAVA.
 It was also the first language implementing the nested function and has a simple
syntax than FORTRAN.
 The first programming language to have a code block like “begin” that indicates
that your program has started and “end” means you have ended your code.

1959: COBOL
 It stands for COmmon Business-Oriented Language.
 In 1997, 80% of the world’s business ran on Cobol.
 The US internal revenue service scrambled its path to COBOL-based IMF
(individual master file) in order to pay the tens of millions of payments mandated
by the coronavirus aid, relief, and economic security.

1964: BASIC
 It stands for beginners All-purpose symbolic instruction code.
 In 1991 Microsoft released Visual Basic, an updated version of Basic
 The first microcomputer version of Basic was co-written by Bill Gates, Paul
Allen, and Monte Davidoff for their newly-formed company, Microsoft.

1972: C
 It is a general-purpose, procedural programming language and the most popular
programming language till now.
 All the code that was previously written in assembly language gets replaced by the
C language like operating system, kernel, and many other applications.
 It can be used in implementing an operating system, embedded system, and also
on the website using the Common Gateway Interface (CGI).
 C is the mother of almost all higher-level programming languages like C#, D, Go,
Java, JavaScript, Limbo, LPC, Perl, PHP, Python, and Unix’s C shell.
Some other programming languages that are popular among programmers are listed
below.

YEAR OF PROGRAMMING
RELEASE LANGUAGES FACTS

SQL was developed at IBM by Donald D.


Chamberlin and Raymond F. Boyce. The
SQL
earlier name was SEQUEL (Structured
1972 English Query Language).
YEAR OF PROGRAMMING
RELEASE LANGUAGES FACTS

It stands for MATrix LABoratory. It is used


for matrix manipulation, implementation of
MATLAB
an algorithm, and creation of a user
1978 interface.

C++ is the fastest high-level programming


language.
Objective-C, C++
Earlier, Apple Inc uses Objective-C to make
applications.
1983

It is a purely functional programming


Haskell
1990 language.

The language is very easy to


Python understand. Famous language among data
1991 scientists and analysts.

JAVA is everywhere. JAVA is the platform-


independent language.
PHP is a scripting language mainly used in
JAVA, PHP, web programming for connecting databases.
JavaScript JavaScript enables interactive web pages. JS
is the most popular programming language.
JS is famous for building a web application.
It makes our page interactive.
1995

C#(C-sharp) is mainly used for making


C# games. Unity engine uses C# for making
2000 amazing games for all platforms
YEAR OF PROGRAMMING
RELEASE LANGUAGES FACTS

GO language is developed in Google by


GO Robert Griesemer, Rob Pike, and Ken
2009 Thompson.

Kotlin is developed by JetBrains. It is used


Kotlin
2011 for making an android application.

Swift language is developed by Apple Inc. It


Swift
2014 is a general-purpose program

Types of Programming Languages


Programming languages can be broadly classified into three categories:

1. Low-Level Languages: These languages are closer to the machine


language and are used to write operating systems, device drivers, and
firmware. Examples include Assembly Language, C, and C++.

2. High-Level Languages: These languages are easier to learn and use than
low-level languages. They are used to write applications, games, and
websites. Examples include Java, Python, and Ruby.

3. Scripting Languages: These languages are used to automate repetitive


tasks, such as web development and system administration. Examples
include Perl, Python, and Ruby.
List of Programming Languages :

1. Java: Java is a high-level, object-oriented programming language


developed by Sun Microsystems (now owned by Oracle Corporation). Java
is designed to be platform-independent, meaning that Java code can run on
any computer with a Java Virtual Machine (JVM) installed. Java is widely
used for developing web applications, mobile apps, and enterprise
software.

2. Python: Python is a high-level, interpreted programming language that


emphasizes code readability and simplicity. Python is widely used for
scientific computing, data analysis, web development, and artificial
intelligence.

3. C: C is a low-level, compiled programming language that is widely used


for systems programming and embedded systems development. C is
known for its efficiency and control over hardware, making it an ideal
choice for developing operating systems, device drivers, and firmware.

4. C++: C++ is an extension of the C programming language that adds


support for object-oriented programming. C++ is widely used for
developing high- performance software, including operating systems,
games, and scientific simulations.

5. Ruby: Ruby is a high-level, interpreted programming language that


emphasizes simplicity and productivity. Ruby is widely used for web
development, automation, and scripting.

6. Swift: Swift is a high-level, compiled programming language developed by


Apple Inc. Swift is designed for developing applications for iOS, macOS,
and watchOS. Swift is known for its safety, speed, and expressiveness.

7. PHP: PHP is a server-side, interpreted programming language that is


widely used for developing web applications. PHP is known for its
simplicity and ease of use, making it a popular choice for web developers.

8. SQL: SQL (Structured Query Language) is a domain-specific language


used for managing relational databases. SQL is used to create, modify, and
query databases, and is widely used in business and data analysis.
9. Assembly language: Assembly language is a low-level programming
language that is used to write instructions for a computer's CPU. Assembly
language is difficult to read and write, but provides direct access to
hardware and can be used to write highly optimized code

Examples of low-level programming languages:

1. Machine Language: Machine language is the lowest-level programming


language that a computer can understand. It is a binary code consisting of
0's and 1's that correspond to machine instructions. Each computer
architecture has its specific machine language, which is difficult to read
and write.

2. Assembly Language: Assembly language is a low-level programming


language that is easier to read and write than machine language. Assembly
language uses mnemonics to represent machine instructions, making it
more human-readable. Assembly language programs are translated into
machine language by an assembler.

3. C: C is a high-level programming language that can also be considered a


low-level language due to its low-level memory access and pointer
manipulation capabilities. C provides direct access to hardware, making it
an ideal choice for systems programming and embedded systems
development.

4. Ada: Ada is a high-level programming language designed for safety-


critical systems, such as aerospace and defense applications. Ada provides
low-level access to hardware and memory management, making it suitable
for systems programming.

5. FORTRAN: FORTRAN (FORmula TRANslation) is a high-level


programming language that was designed for scientific and engineering
applications. FORTRAN provides low-level control over hardware,
allowing for efficient computation and numerical analysis.

Chapter 2

Describing sytax:
What is syntax in a programming language?

f you’re new to programming, you’ll find that learning syntax rules is a key part of
learning a programming language. Knowing what syntax is and why it’s important
will help you learn these rules and gain a better understanding of how code works
structurally.

Today, we’ll explore syntax in more detail, consider why it’s important, and learn
how it varies between a few major programming languages.

What is syntax?
Syntax is a set of rules that tell us what arrangements of characters create a valid
statement in a language. Human languages and programming languages are both
dependent on syntax. To use either type of language effectively, you need to know
how to fit elements together to achieve your goal. In the case of a human
language, this goal is successful communication. In the case of a programming
language, the goal is to issue a set of directives that a computer can read and act
on.

Syntax errors occur when the elements in a statement are disordered in a way that
impedes successful communication. Let’s look at a simple example from English:
1. The dog chased the rabbit.
2. The rabbit chased the dog.
3. Chased the rabbit the dog.

As you can see, word order matters a great deal in English. The words are the same in
these three sentences, but combining them in different ways results in very different
meanings. Sentence one contains a simple statement that accords with English syntax.
Sentence two switches the positions of the subject and direct object, so that while the
sentence still makes sense syntactically, the meaning is radically different. The third
sentence shuffles the words around in such a way that it’s difficult to read. It doesn’t
follow the conventions of English syntax, so it’s unclear what the sentence is saying.

Using correct syntax is arguably even more important in programming languages than
it is in human languages. If you’re learning a new human language and you only have
a beginner’s understanding of its syntax, you’ll often still be able to get your meaning
across. This is because syntax in human languages is often flexible, and human
listeners can problem-solve to figure out the meaning of an imperfect sentence.
Programming languages are less accommodating to syntax errors. Syntax determines
how we organize the elements in our code to make it legible to the computer. If
there’s a syntax error, the computer might not be able to read it.

Here are some examples of what programming syntax can determine:

 whether we use lower-case or upper-case characters


 how we notate code comments
 how we use whitespace
 how we indicate the relationships between statements (individual commands
issued to the computer)

Syntax is important in programming because it would be impossible to write


functioning code without it. Code is a set of instructions written in a language that a
computer can read and act on. If there are syntax errors in the code, the program
won’t work.

If the syntax of a language is not followed, the code will not be understood by a
compiler or interpreter.

Compilers convert programming languages like Java or C++ into binary code that
computers can understand. If the syntax is incorrect, the code will not compile.

Interpreters execute programming languages such as JavaScript or Python at


runtime. The incorrect syntax will cause the code to fail.

That’s why it is crucial that a programmer pays close attention to a language’s syntax.
No programmer likes to get a syntax error.

What Is Basic Syntax?


Basic syntax represents the fundamental rules of a programming
language. Without these rules, it is impossible to write
functioning code.

Every language has its own set of rules that make up its basic
syntax. Naming conventions are a primary component of basic
syntax conventions and vary by language.
 Case Sensitive. Java, C++, and Python are examples of
languages that are case-sensitive. Identifiers such as world
and World have different meanings in these languages.
Languages such as Basic and SQL are insensitive, meaning
world and World have the same meaning.
 Class Names. Java requires the first letter of each word in class
names be upper case. For example, class FirstJavaClass.
Languages such as C or C++ use an underscore to separate
words. In C, the class name would be first_java_class.
 Program Filenames. The name of a Java program file must
match the class name with the extension ‘*.java” added to
the name. For example, [Link] would be the
name of the program file for the class FirstJavaClass. C and
C++ files require a “*.c” or “*.cpp” extension but have no
other stipulations.

Different languages may have rules for adding comments, using


white space, or declaring variables.

Object-oriented languages such as Java and C use methods that


have different syntax requirements.

Why Is Syntax Important in Programming?


Syntax improves code readability. It ensures that the four C’s of coding are
maintained:

 Communication
 Code integration
 Consistency
 Clarity

The concept behind conventions is to make the code explain itself. If


the code is self-explanatory, the focus can be on design and
program improvements and not on what does this mean?
Using consistent standards means that code is predictable and
discoverable when read by other programmers.

When code does not follow conventions, it becomes


disorganized and difficult to read. It becomes what is known as
spaghetti code.

The term has a negative connotation indicating that the


programmer did not have the skills or experience needed to
write readable code.

FORMAL METHODS

The formal language-generation mechanisms, usually called grammars,


that are commonly used to describe the syntax of programming languages.

Backus-Naur Form and Context-Free Grammars

In the middle to late 1950s, two men, Noam Chomsky and John Backus, in
unrelated research efforts, developed the same syntax description formalism, which
subsequently became the most widely used method for programming language syntax.

Context-Free Grammars

 Two of these grammar classes, named context-free and regular, turned out to be
useful for describing the syntax of programming languages.
 The forms of the tokens of programming languages can be described by regular
grammars.
 The syntax of whole programming languages, with minor exceptions, can be
described by context-free grammars.

Origins of Backus-Naur Form

 ALGOL 58 was presented by John Backus, a prominent member of the ACM-


GAMM group, at an international conference in 1959 (Backus, 1959).
 This revised method of syntax description became known as Backus-Naur
Form, or simply BNF.
 BNF is a natural notation for describing syntax.
 In fact, something similar to BNF was used by Panini to describe the syntax of
Sanskrit several hundred years before Christ.
 The use of BNF in the ALGOL 60 report was not immediately accepted by
computer users, it soon became and is still the most popular method of
concisely describing programming language syntax.
 BNF is nearly identical to Chomsky’s generative devices for context-free
languages, called context-free grammars.

BNF Fundamentals

 A metalanguage is a language that is used to describe another language.


 BNF is a metalanguage for programming languages.
 BNF uses abstractions for syntactic structures. A simple assignment
statement, for example, might be represented by the abstraction. The actual
definition can be given by
 The text on the left side of the arrow, which is aptly called the left-hand side
(LHS), is the abstraction being defined.
<assign> → <var> = <expression>
 The text to the right of the arrow is the definition of the LHS. It is called the
right-hand side (RHS) and consists of some mixture of tokens, lexemes, and
references to other abstractions.
 This particular rule specifies that the abstraction is defined as an instance of the
abstraction, followed by the lexeme =, followed by an instance of the
abstraction.
 One example sentence whose syntactic structure is described by the rule is
total = subtotal1 + subtotal2
 The abstractions in a BNF description, or grammar, are often called
nonterminal symbols, or simply non-terminals, and the lexemes and tokens of
the rules are called terminal symbols, or simply terminals.
 Nonterminal symbols can have two or more distinct definitions, representing
two or more possible syntactic forms in the language.
 Multiple definitions can be written as a single rule, with the different
definitions separated by the symbol |, meaning logical OR.
 For example, an if statement can be described with the rules
<if_stmt> → if (<logic_expr>) <stmt>
<if_stmt> → if (<logic_expr>) <stmt> else <stmt>
or with the rule
<if_stmt> → if (<logic_expr>) <stmt>
| if ( <logic_expr>) <stmt> else <stmt>
 These rules, represents either a single statement or a compound statement.

Describing Lists

 Variable-length lists in mathematics are often written using an ellipsis (. . .); 1,


2, . . . is an example.
 BNF does not include the ellipsis, so an alternative method is required for
describing lists of syntactic elements in programming languages (for example, a
list of identifiers appearing on a data declaration statement).
 For BNF, the alternative is recursion. A rule is recursive if its LHS appears in
its RHS. The following rules illustrate how recursion is used to describe
lists:
<ident_list> → identifier
| identifier, <ident_list>
 This defines as either a single token (identifier) or an identifier followed by a
comma and another instance of.

The first step in learning any programming language is to


understand the basics such as phrase structure, proper syntax and
correctly structured code.

Why Is Syntax Important in Programming?


Syntax improves code readability. It ensures that the four C’s of coding are
maintained:

 Communication
 Code integration
 Consistency
 Clarity

The concept behind conventions is to make the code explain itself. If


the code is self-explanatory, the focus can be on design and
program improvements and not on what does this mean?
Using consistent standards means that code is predictable and
discoverable when read by other programmers.

When code does not follow conventions, it becomes


disorganized and difficult to read. It becomes what is known as
spaghetti code.

The term has a negative connotation indicating that the


programmer did not have the skills or experience needed to
write readable code.

Examples of low-level programming languages


Machine Language: Machine language is the lowest-level programming language
that a computer can understand. It is a binary code consisting of 0's and 1's that
correspond to machine instructions. Each computer architecture has its specific
machine language, which is difficult to read and write.

1. Assembly Language: Assembly language is a low-level programming


language that is easier to read and write than machine language.
Assembly language uses mnemonics to represent machine instructions,
making it more human-readable. Assembly language programs are
translated into machine language by an assembler.

2. C: C is a high-level programming language that can also be considered a


low-level language due to its low-level memory access and pointer
manipulation capabilities. C provides direct access to hardware, making
it an ideal choice for systems programming and embedded systems
development.

3. Ada: Ada is a high-level programming language designed for safety-


critical systems, such as aerospace and defense applications. Ada provides
low-level access to hardware and memory management, making it
suitable for systems programming.

4. FORTRAN: FORTRAN (FORmula TRANslation) is a high-level


programming language that was designed for scientific and engineering
applications. FORTRAN provides low-level control over hardware,
allowing for efficient computation and numerical analysis.
Key skills

1. Logical thinking: Programming requires logical thinking, the ability to


break down complex problems into smaller, manageable parts, and to
develop algorithms to solve them.

2. Attention to detail: Good programmers pay attention to detail,


writing clean, efficient, and error-free code.

3. Persistence: Programming can be challenging, and it often


requires persistence and patience to debug and solve problems.

4. Adaptability: Programming languages and technologies are constantly


evolving, so good programmers must be adaptable and willing to learn new
skills and techniques.

5. Collaboration: Programming often involves working in teams, so good


programmers must be able to collaborate effectively with others, share
their ideas, and give and receive constructive feedback.

6. Creativity: Programming can also be a creative process, requiring


programmers to come up with innovative solutions to problems and to
think outside the box.
Chapter 3
Context-free grammars

efinition − A context-free grammar (CFG) consisting of a finite set of grammar rules is


a quadruple (N, T, P, S) where
 N is a set of non-terminal symbols.

P is a set of rules, P: N → (N ∪ T)*, i.e., the left-hand side of the production


 T is a set of terminals where N ∩ T = NULL.

rule P does have any right context or left context.
 S is the start symbol.
Example
 The grammar ({A}, {a, b, c}, P, A), P : A → aA, A → abc.
 The grammar ({S, a, b}, {a, b}, P, S), P: S → aSa, S → bSb, S → ε
 The grammar ({S, F}, {0, 1}, P, S), P: S → 00S | 11F, F → 00F |
ε Generation of Derivation Tree
A derivation tree or parse tree is an ordered rooted tree that graphically represents the
semantic information a string derived from a context-free grammar.
Representation Technique
 Root vertex − Must be labeled by the start symbol.
 Vertex − Labeled by a non-terminal symbol.
 Leaves − Labeled by a terminal symbol or ε.

If S → x1x2 …… xn is a production rule in a CFG, then the parse tree / derivation tree
will be as follows −
There are two different approaches to draw a derivation tree −
Top-down Approach −
 Starts with the starting symbol S
 Goes down to tree leaves using productions
Bottom-up Approach −
 Starts from tree leaves
 Proceeds upward to the root which is the starting symbol S
Derivation or Yield of a Tree
The derivation or the yield of a parse tree is the final string obtained by concatenating
the labels of the leaves of the tree from left to right, ignoring the Nulls. However, if
all the leaves are Null, derivation is Null.
Example
Let a CFG {N,T,P,S} be
N = {S}, T = {a, b}, Starting symbol = S, P = S → SS | aSb | ε
One derivation from the above CFG is “abaabb”
S → SS → aSbS → abS → abaSb → abaaSbb → abaabb
Sentential Form and Partial Derivation Tree
A partial derivation tree is a sub-tree of a derivation tree/parse tree such that either all
of its children are in the sub-tree or none of them are in the sub-tree.
Example
If in any CFG the productions are −
S → AB, A → aaA | ε, B → Bb| ε
the partial derivation tree can be the following −
If a partial derivation tree contains the root S, it is called a sentential form. The
above sub-tree is also in sentential form.
Leftmost and Rightmost Derivation of a String
 Leftmost derivation − A leftmost derivation is obtained by applying
production to the leftmost variable in each step.
 Rightmost derivation − A rightmost derivation is obtained by applying
production to the rightmost variable in each step.
Example
Let any set of production rules in a CFG be
X → X+X | X*X |X| a
over an alphabet {a}.
The leftmost derivation for the string "a+a*a" may be −
X → X+X → a+X → a + X*X → a+a*X → a+a*a
The stepwise derivation of the above string is shown as below −
The rightmost derivation for the above string "a+a*a" may be −
X → X*X → X*a → X+X*a → X+a*a → a+a*a
The stepwise derivation of the above string is shown as below −
Left and Right Recursive Grammars
In a context-free grammar G, if there is a production in the form X → Xa where X is
a non-terminal and ‘a’ is a string of terminals, it is called a left recursive production.
The grammar having a left recursive production is called a left recursive grammar.
And if in a context-free grammar G, if there is a production is in the form X →
aX where X is a non-terminal and ‘a’ is a string of terminals, it is called a right
recursive production. The grammar having a right recursive production is called
a right recursive grammar.

Ambiguity in Context-Free Grammars


If a context free grammar G has more than one derivation tree for some string w ∈
L(G), it is called an ambiguous grammar. There exist multiple right-most or left-
most derivations for some string generated from that grammar.
Problem
Check whether the grammar G with production rules −
X → X+X | X*X |X| a
is ambiguous or not.
Solution
Let’s find out the derivation tree for the string "a+a*a". It has two leftmost derivations.
Derivation 1 − X → X+X → a +X → a+ X*X → a+a*X → a+a*a
Parse tree 1 −

Derivation 2 − X → X*X → X+X*X → a+ X*X → a+a*X → a+a*a


Parse tree 2 −
Since there are two parse trees for a single string "a+a*a", the grammar G is ambiguous.
CFL Closure Property

Previous Page

Next Page

Context-free languages are closed under −


 Union
 Concatenation
 Kleene Star operation
Union
Let L1 and L2 be two context free languages. Then L1 ∪ L2 is also context free.
Example
Let L1 = { anbn , n > 0}. Corresponding grammar G1 will have P: S1 → aAb|ab
Let L2 = { cmdm , m ≥ 0}. Corresponding grammar G2 will have P: S2 → cBb| ε
Union of L1 and L2, L = L1 ∪ L2 = { anbn } ∪ { cmdm }
The corresponding grammar G will have the additional production S → S1 | S2
Concatenation
If L1 and L2 are context free languages, then L1L2 is also context free.
Example
Union of the languages L1 and L2, L = L1L2 = { anbncmdm }
The corresponding grammar G will have the additional production S → S1 S2
Kleene Star
If L is a context free language, then L* is also context free.
Example
Let L = { anbn , n ≥ 0}. Corresponding grammar G will have P: S → aAb| ε
Kleene Star L1 = { anbn }*
The corresponding grammar G1 will have additional productions S1 → SS1 | ε
Context-free languages are not closed under −
 Intersection − If L1 and L2 are context free languages, then L1 ∩ L2 is not
necessarily context free.
 Intersection with Regular Language − If L1 is a regular language and L2 is a
context free language, then L1 ∩ L2 is a context free language.
 Complement − If L1 is a context free language, then L1’ may not be context
free.
CFG Simplification

In a CFG, it may happen that all the production rules and symbols are not needed for
the derivation of strings. Besides, there may be some null productions and unit
productions. Elimination of these productions and symbols is called simplification of
CFGs. Simplification essentially comprises of the following steps −
 Reduction of CFG
 Removal of Unit Productions
 Removal of Null Productions
Reduction of CFG
CFGs are reduced in two phases −
Phase 1 − Derivation of an equivalent grammar, G’, from the CFG, G, such that each
variable derives some terminal string.
Derivation Procedure −
Step 1 − Include all symbols, W1, that derive some terminal and initialize i=1.
Step 2 − Include all symbols, Wi+1, that derive Wi.
Step 3 − Increment i and repeat Step 2, until Wi+1 = Wi.
Step 4 − Include all production rules that have Wi in it.
Phase 2 − Derivation of an equivalent grammar, G”, from the CFG, G’, such that
each symbol appears in a sentential form.
Derivation Procedure −
Step 1 − Include the start symbol in Y1 and initialize i = 1.
Step 2 − Include all symbols, Yi+1, that can be derived from Yi and include all
production rules that have been applied.
Step 3 − Increment i and repeat Step 2, until Yi+1 = Yi.
Problem
Find a reduced grammar equivalent to the grammar G, having production rules, P: S
→ AC | B, A → a, C → c | BC, E → aA | e
Solution
Phase 1 −
T = { a, c, e }
W1 = { A, C, E } from rules A → a, C → c and E → aA
W2 = { A, C, E } U { S } from rule S → AC
W3 = { A, C, E, S } U ∅
Since W2 = W3, we can derive G’ as −
G’ = { { A, C, E, S }, { a, c, e }, P, {S}}
where P: S → AC, A → a, C → c , E → aA | e
Phase 2 −
Y1 = { S }
Y2 = { S, A, C } from rule S → AC
Y3 = { S, A, C, a, c } from rules A → a and C → c
Y4 = { S, A, C, a, c }
Since Y3 = Y4, we can derive G” as −
G” = { { A, C, S }, { a, c }, P, {S}}
where P: S → AC, A → a, C → c
Removal of Unit Productions
Any production rule in the form A → B where A, B ∈ Non-terminal is called unit
production..
Removal Procedure −

→ x occurs in the grammar. [x ∈ Terminal, x can be Null]


Step 1 − To remove A → B, add production A → x to the grammar rule whenever B

Step 2 − Delete A → B from the grammar.


Step 3 − Repeat from step 1 until all unit productions are removed.
Problem
Remove unit production from the following −
S → XY, X → a, Y → Z | b, Z → M, M → N, N → a
Solution −
There are 3 unit productions in the grammar −
Y → Z, Z → M, and M → N
At first, we will remove M → N.
As N → a, we add M → a, and M → N is removed.
The production set becomes
S → XY, X → a, Y → Z | b, Z → M, M → a, N → a
Now we will remove Z → M.
As M → a, we add Z→ a, and Z → M is removed.
The production set becomes
S → XY, X → a, Y → Z | b, Z → a, M → a, N → a
Now we will remove Y → Z.
As Z → a, we add Y→ a, and Y → Z is removed.
The production set becomes
S → XY, X → a, Y → a | b, Z → a, M → a, N → a
Now Z, M, and N are unreachable, hence we can remove those.
The final CFG is unit production free −
S → XY, X → a, Y → a | b
Removal of Null Productions
In a CFG, a non-terminal symbol ‘A’ is a nullable variable if there is a production A
→ ε or there is a derivation that starts at A and finally ends up with
ε: A →............→ ε
Removal Procedure
Step 1 − Find out nullable non-terminal variables which derive ε.
Step 2 − For each production A → a, construct all productions A → x where x is
obtained from ‘a’ by removing one or multiple non-terminals from Step 1.
Step 3 − Combine the original productions with the result of step 2 and remove ε -
productions.
Problem
Remove null production from the following −
S → ASA | aB | b, A → B, B → b | ∈
Solution −
There are two nullable variables − A and B
At first, we will remove B → ε.
After removing B → ε, the production set becomes −
S→ASA | aB | b | a, A ε B| b | &epsilon, B → b
Now we will remove A → ε.
After removing A → ε, the production set becomes −
S→ASA | aB | b | a | SA | AS | S, A → B| b, B → b
This is the final production set without null transition.
homsky Normal Form

A CFG is in Chomsky Normal Form if the Productions are in the following forms −
 A→a
 A → BC
 S→ε
where A, B, and C are non-terminals and a is terminal.
Algorithm to Convert into Chomsky Normal Form −
Step 1 − If the start symbol S occurs on some right side, create a new start
symbol S’ and a new production S’→ S.
Step 2 − Remove Null productions. (Using the Null production removal algorithm
discussed earlier)
Step 3 − Remove unit productions. (Using the Unit production removal algorithm
discussed earlier)
Step 4 − Replace each production A → B1…Bn where n > 2 with A → B1C where C
→ B2 …Bn. Repeat this step for all productions having two or more symbols in the
right side.
Step 5 − If the right side of any production is in the form A → aB where a is a
terminal and A, B are non-terminal, then the production is replaced by A → XB and
X → a. Repeat this step for every production which is in the form A → aB.
Problem
Convert the following CFG into CNF
S → ASA | aB, A → B | S, B → b | ε
Solution
(1) Since S appears in R.H.S, we add a new state S0 and S0→S is added to the
production set and it becomes −
S0→S, S→ ASA | aB, A → B | S, B → b | ∈
(2) Now we will remove the null productions
− B → ∈ and A → ∈
After removing B → ε, the production set becomes −
S0→S, S→ ASA | aB | a, A → B | S | ∈, B → b
After removing A → ∈, the production set becomes −
S0→S, S→ ASA | aB | a | AS | SA | S, A → B | S, B → b
(3) Now we will remove the unit productions.
After removing S → S, the production set becomes −
S0→S, S→ ASA | aB | a | AS | SA, A → B | S, B → b
After removing S0→ S, the production set becomes −
S0→ ASA | aB | a | AS | SA, S→ ASA | aB | a | AS | SA
A → B | S, B → b
After removing A→ B, the production set becomes −
S0 → ASA | aB | a | AS | SA, S→ ASA | aB | a | AS | SA
A→S|b
B→b
After removing A→ S, the production set becomes −
S0 → ASA | aB | a | AS | SA, S→ ASA | aB | a | AS | SA
A → b |ASA | aB | a | AS | SA, B → b
(4) Now we will find out more than two variables in the R.H.S
Here, S0→ ASA, S → ASA, A→ ASA violates two Non-terminals in R.H.S.
Hence we will apply step 4 and step 5 to get the following final production set which
is in CNF −
S0→ AX | aB | a | AS | SA
S→ AX | aB | a | AS | SA
A → b |AX | aB | a | AS | SA
B→b
X→
SA
(5) We have to change the productions S0→ aB, S→ aB, A→ aB
And the final production set becomes −
S0→ AX | YB | a | AS | SA
S→ AX | YB | a | AS | SA
A → b A → b |AX | YB | a | AS | SA
B→b
X → SA
Y→a
Greibach Normal Form

Previous Page

Next Page

A CFG is in Greibach Normal Form if the Productions are in the following forms −
A→b
A → bD1…Dn
S→ε
where A, D1,....,Dn are non-terminals and b is a terminal.
Algorithm to Convert a CFG into Greibach Normal Form
Step 1 − If the start symbol S occurs on some right side, create a new start
symbol S’ and a new production S’ → S.
Step 2 − Remove Null productions. (Using the Null production removal algorithm
discussed earlier)
Step 3 − Remove unit productions. (Using the Unit production removal algorithm
discussed earlier)
Step 4 − Remove all direct and indirect left-recursion.
Step 5 − Do proper substitutions of productions to convert it into the proper form of
GNF.
Problem
Convert the following CFG into CNF
S → XY | Xn | p
X → mX | m
Y → Xn | o
Solution
Here, S does not appear on the right side of any production and there are no unit or null
productions in the production rule set. So, we can skip Step 1 to Step 3.
Step 4
Now after replacing
X in S → XY | Xo | p
with
mX | m
we obtain
S → mXY | mY | mXo | mo | p.
And after replacing
X in Y → Xn | o
with the right side
of X → mX | m
we obtain
Y → mXn | mn | o.
Two new productions O → o and P → p are added to the production set and then we
came to the final GNF as the following −
S → mXY | mY | mXC | mC | p
X → mX | m
Y → mXD | mD | o
O→o
P→p
Pumping Lemma for CFG

Lemma

∈ L of length ≥ p can be written as w = uvxyz, where vy ≠ ε, |vxy| ≤ p, and for all i ≥


If L is a context-free language, there is a pumping length p such that any string w

0, uvixyiz ∈ L.
Applications of Pumping Lemma
Pumping lemma is used to check whether a grammar is context free or not. Let us
take an example and show how it is checked.
Problem
Find out whether the language L = {xnynzn | n ≥ 1} is context free or not.
Solution
Let L is context free. Then, L must satisfy pumping lemma.
At first, choose a number n of the pumping lemma. Then, take z as 0n1n2n.
Break z into uvwxy, where
|vwx| ≤ n and vx ≠ ε.
Hence vwx cannot involve both 0s and 2s, since the last 0 and the first 2 are at least
(n+1) positions apart. There are two cases −
Case 1 − vwx has no 2s. Then vx has only 0s and 1s. Then uwy, which would have to
be in L, has n 2s, but fewer than n 0s or 1s.
Case 2 − vwx has no 0s.
Here contradiction occurs.
Hence, L is not a context-free language.
CHAPTER 4
ATTRIBUTE
GRAMMAR
Attribute grammar is a special form of context-free grammar where some additional
information (attributes) are appended to one or more of its non-terminals in order to
provide context-sensitive information. Each attribute has well-defined domain of
values, such as integer, float, character, string, and expressions.
Attribute grammar is a medium to provide semantics to the context-free grammar and
it can help specify the syntax and semantics of a programming language. Attribute
grammar (when viewed as a parse-tree) can pass values or information among the
nodes of a tree.
Example:
E → E + T { [Link] = [Link] + [Link] }
The right part of the CFG contains the semantic rules that specify how the grammar
should be interpreted. Here, the values of non-terminals E and T are added together
and the result is copied to the non-terminal E.
Semantic attributes may be assigned to their values from their domain at the time of
parsing and evaluated at the time of assignment or conditions. Based on the way the
attributes get their values, they can be broadly divided into two categories :
synthesized attributes and inherited attributes.
Synthesized attributes
These attributes get values from the attribute values of their child nodes. To illustrate,
assume the following production:
S → ABC
If S is taking values from its child nodes (A,B,C), then it is said to be a synthesized
attribute, as the values of ABC are synthesized to S.
As in our previous example (E → E + T), the parent node E gets its value from its
child node. Synthesized attributes never take values from their parent nodes or any
sibling nodes.
Inherited attributes
In contrast to synthesized attributes, inherited attributes can take values from parent
and/or siblings. As in the following production,
S → ABC
A can get values from S, B and C. B can take values from S, A, and C. Likewise, C
can take values from S, A, and B.
Expansion : When a non-terminal is expanded to terminals as per a grammatical rule

Reduction : When a terminal is reduced to its corresponding non-terminal according


to grammar rules. Syntax trees are parsed top-down and left to right. Whenever
reduction occurs, we apply its corresponding semantic rules (actions).
Semantic analysis uses Syntax Directed Translations to perform the above tasks.
Semantic analyzer receives AST (Abstract Syntax Tree) from its previous stage (syntax
analysis).
Semantic analyzer attaches attribute information with AST, which are called
Attributed AST.
Attributes are two tuple value, <attribute name, attribute value>
For example:
int value = 5;
<type, “integer”>
<presentvalue, “5”>

For every production, we attach a semantic rule.


S-attributed SDT
If an SDT uses only synthesized attributes, it is called as S-attributed SDT. These
attributes are evaluated using S-attributed SDTs that have their semantic actions
written after the production (right hand side).

As depicted above, attributes in S-attributed SDTs are evaluated in bottom-up parsing,


as the values of the parent nodes depend upon the values of the child nodes.
L-attributed SDT
This form of SDT uses both synthesized and inherited attributes with restriction of not
taking values from right siblings.
In L-attributed SDTs, a non-terminal can get values from its parent, child, and sibling
nodes. As in the following production
S → ABC
S can take values from A, B, and C (synthesized). A can take values from S only. B
can take values from S and A. C can get values from S, A, and B. No non-terminal
can get values from the sibling to its right.
Attributes in L-attributed SDTs are evaluated by depth-first and left-to-right parsing
manner.
We may conclude that if a definition is S-attributed, then it is also L-attributed as L-
attributed definition encloses S-attributed definitions.

An attribute grammar is a device used to describe more of the structure of a


programming language than can be described with a context-free grammar. An
attribute grammar is an extension to a context-free [Link] extension allows
certain language rules to be conveniently described, such as type compatibility.
Before we formally define the form of attribute grammars, we must clarify the
concept of static semantics.

 Static Semantics
There are some characteristics of the structure of programming languages that
are difficult to describe with BNF, and some that are impossible. As an
example of a syntax rule that is difficult to specify with BNF, consider type
compatibility rules. In Java, for example, a floating-point value cannot be
assigned to an integer type variable, although the opposite is legal. Although
this restriction can be specified in BNF, it requires additional nonterminal
symbols and rules. If all of the typing rules of Java were specified in BNF, the
grammar would become too large to be useful, because the size of the grammar
determines the size of the syntax analyzer.

As an example of a syntax rule that cannot be specified in BNF, consider the


common rule that all variables must be declared before they are referenced. It
has been proven that this rule cannot be specified in BNF. These problems
exemplify the categories of language rules called static semantics rules. The
static semantics of a language is only indirectly related to the meaning of
programs during execution; rather, it has to do with the legal forms of programs
(syntax rather than semantics). Many static semantic rules of a language state
its type constraints. Static semantics is so named because the analysis required
to check these specifications can be done at compile
time.

Because of the problems of describing static semantics with BNF, a variety of


more powerful mechanisms has been devised for that task. One such
mechanism, attribute grammars, was designed by Knuth (1968a) to describe
both the syntax and the static semantics of programs. Attribute grammars are a
formal approach both to describing and checking the correctness of the static
semantics rules of a program. Although they are not always used in a formal
way in compiler design, the basic concepts of attribute grammars are at least
informally used in every compiler.
 Basic Concepts
Attribute grammars are context-free grammars to which have been added
attributes, attribute computation functions, and predicate functions. Attributes,
which are associated with grammar symbols (the terminal and nonterminal
symbols), are similar to variables in the sense that they can have values
assigned to them. Attribute computation functions, sometimes called semantic
functions, are associated with grammar rules. They are used to specify how
attribute values are computed. Predicate functions, which state the static
semantic rules of the language, are associated with grammar rules. These
concepts will become clearer after we formally define attribute grammars and
provide an example.

Attribute Grammars Defined

An attribute grammar is a grammar with the following additional features:

• Associated with each grammar symbol X is a set of attributes A(X). The set A(X)
consists of two disjoint sets S(X) and I(X), called synthesized and inherited attributes,
respectively. Synthesized attributes are used to pass semantic information up a parse
tree, while inherited attributes pass semantic information down and across a tree.
• Associated with each grammar rule is a set of semantic functions and a possibly
empty set of predicate functions over the attributes of the symbols in the grammar
rule. For a rule X0 SX1 c Xn, the synthesized attributes of X0 are computed with
semantic functions of the form S(X0) = f(A(X1), c , A(Xn)). So the value of a
synthesized attribute on a parse tree node depends only on the values of the attributes
on that node’s children nodes. Inherited attributes of symbols Xj, 1 … j … n (in the
rule above), are computed with a semantic function of the form I(Xj) = f(A(X0), c ,
A(Xn)). So the value of an inherited attribute on a parse tree node depends on the
attribute values of that node’s parent node and those of its sibling nodes. Note that, to
avoid circularity, inherited attributes are often restricted to functions of the form I(Xj)
= f(A(X0), c , A(X(j-1))). This form prevents an inherited attribute from depending on
itself or on attributes to the right in the
parse tree.

• A predicate function has the form of a Boolean expression on the union of the
attribute set {A(X0), c , A(Xn)} and a set of literal attribute values. The only
derivations allowed with an attribute grammar are those in which every predicate
associated with every nonterminal is true. A false predicate
function value indicates a violation of the syntax or static semantics rules
of the language.

A parse tree of an attribute grammar is the parse tree based on its underlying BNF
grammar, with a possibly empty set of attribute values attached to each node. If all the
attribute values in a parse tree have been computed, the tree is said to be fully
attributed. Although in practice it is not always done this way, it is convenient to
think of attribute values as being computed after the complete unattributed parse tree
has been constructed by the compiler.

Intrinsic Attributes
Intrinsic attributes are synthesized attributes of leaf nodes whose values are determined
outside the parse tree. For example, the type of an instance of a variable in a program
could come from the symbol table, which is used to store variable names and their
types. The contents of the symbol table are set based on earlier declaration statements.
Initially, assuming that an unattributed parse tree has been constructed and that
attribute values are needed, the only attributes with values are the intrinsic attributes
of the leaf nodes. Given the intrinsic attribute values on a parse tree, the semantic
functions can be used to compute the remaining attribute values.
Examples of Attribute Grammars

As a very simple example of how attribute grammars can be used to describe static
semantics, consider the following fragment of an attribute grammar that describes the
rule that the name on the end of an Ada procedure must match the procedure’s name.
(This rule cannot be stated in BNF.) The string attribute of <proc_name>, denoted by
<proc_name>.string, is the actual string of characters that were found immediately
following the reserved word procedure by the compiler. Notice that when there is
more than one occurrence of a nonterminal in a syntax rule in an attribute grammar,
the nonterminals are subscripted with brackets to distinguish them. Neither the
subscripts nor the brackets are part of the described language.

In this example, the predicate rule states that the name string attribute of the
<proc_name> nonterminal in the subprogram header must match the name string
attribute of the <proc_name> nonterminal following the end of the subprogram. Next,
we consider a larger example of an attribute grammar. In this case, the example
illustrates how an attribute grammar can be used to check the type rules of a simple
assignment statement. The syntax and static semantics of this assignment statement
are as follows: The only variable names are A, B, and C. The right side of the
assignments can be either a variable or an expression in the form of a variable added
to another variable. The variables can be one of two types: int or real. When there are
two variables on the right side of an assignment, they need not be the same type. The
type of the expression when the operand types are not the same is always real. When
they are the same, the expression type is that of the operands. The type of the left side
of the assignment must match the type of the right side. So the types of operands in
the right side can be mixed, but the assignment is valid only if the target and the value
resulting from evaluating the right side have the same type. The attribute grammar
specifies these static semantic rules.

The syntax portion of our example attribute grammar is


<assign> → <var> = <expr>
<expr> → <var> <var>
| <var>
<var> → A | B | C

EXAMPLE 3.6 An Attribute Grammar for Simple Assignment Statements


1. Syntax rule: <assign> → <var> = <expr>
Semantic rule: <expr>.expected_type ← <var>.actual_type

2. Syntax rule: <expr> → <var>[2] <var>[3]


Semantic rule: <expr>.actual_type ←
if (<var>[2].actual_type = int) and
(<var>[3].actual_type = int)
then int
else real
end if
Predicate: <expr>.actual_type == <expr>.expected_type

3. Syntax rule: <expr> → <var>


Semantic rule: <expr>.actual_type ← <var>.actual_type
Predicate: <expr>.actual_type == <expr>.expected_type

4. Syntax rule: <var> → A | B | C


Semantic rule: <var>.actual_type ← look-up(<var>.string)
The look-up function looks up a given variable name in the symbol table and returns
the variable’s type.

A parse tree of the sentence A = A B generated by the grammar in Example 3.6 is


shown in Figure 3.6. As in the grammar, bracketed numbers are added after the
repeated node labels in the tree so they can be referenced unambiguously.
Computing attribute values

The process of computing the attribute values of a parse tree, which is sometimes
called decorating the parse tree. If all attributes were inherited, this could proceed in a
completely top-down order, from the root to the leaves. Alternatively, it could
proceed in a completely bottomup order, from the leaves to the root, if all the
attributes were [Link] our grammar has both synthesized and inherited
attributes, the evaluation process cannot be in any single direction. The following is
an evaluation of the attributes, in an order in which it is possible to
compute them:

1. <var>.actual_type ← look-up(A) (Rule 4)


2. <expr>.expected_type ← <var>.actual_type (Rule 1)
3. <var>[2].actual_type ← look-up(A) (Rule 4)
<var>[3].actual_type ← look-up(B) (Rule 4)
4. <expr>.actual_type ← either int or real (Rule 2)
5. <expr>.expected_type == <expr>.actual_type is either
TRUE or FALSE (Rule 2)

The tree in Figure 3.7 shows the flow of attribute values. Solid lines are used for the
parse tree; dashed lines show attribute flow in the tree. The tree in Figure 3.8 shows
the final attribute values on the nodes. In this example, A is defined as a real and B is
defined as an int. Determining attribute evaluation order for the general case of an
attribute grammar is a complex problem, requiring the construction of a dependency
mgraph to show all attribute dependencies.
Checking the static semantic rules of a language is an essential part of all compilers.
Even if a compiler writer has never heard of an attribute grammar, he or she would
need to use their fundamental ideas to design the checks of static semantics rules for
his or her compiler.

One of the main difficulties in using an attribute grammar to describe all of the syntax
and static semantics of a real contemporary programming language is the size and
complexity of the attribute grammar. The large number of attributes and semantic
rules required for a complete programming language make such grammars difficult to
write and read. Furthermore, the attribute values on a large parse tree are costly to
evaluate. On the other hand, less formal attribute grammars are a powerful and
commonly used tool for compiler writers, who are more interested in the process of
producing a compiler than they are in formalism.
CHAPTER 5

Describing semantics
What is semantic?
In programming language theory, semantics is the rigorous mathematical study of the
meaning of programming [Link] assigns computational meaning to
valid strings in a programming language syntax.
Semantics describes the processes a computer follows when executing a program in
that specific language. This can be shown by describing the relationship between the
input and output of a program, or an explanation of how the program will be executed
on a certain platform, hence creating a model of computation.

Semantics may refer to any of the following:

1. With computer programming and programming languages, semantics refers to the


meaning of an instruction and not the syntax.

For example, if a programmer writes a function that compiles or


is interpreted successfully, but doesn't work, that would be a semantic error.

By contrast, if the programmer misspells a keyword or function name, and the


program cannot be successfully compiled or interpreted, that would be a syntax error.

2. Semantics is the study of language meanings by examining a single word, signs, or


group of text. Semantics is also a branch of linguistics that studies how meaning is
constructed, communicated, and how meaning changes over time. In linguistics,
there are

Types of semantics
 Formal semantics
 Lexical semantics
 Conceptual semantics.

Formal semantics
With formal semantics, words and meaning are examined from a philosophical or
mathematical standpoint. This branch of semantics develops models to help determine
the truth behind words instead of only examining real-world examples.

Lexical semantics
Lexical semantics are the most commonly-known type of semantics. This branch
looks for the meaning of individual words by considering the context and text
surrounding them.

Conceptual semantics
With conceptual semantics, the dictionary definition of the word is examined before
any context is applied. After examining the definition, the context is examined by
looking for connecting words, how meaning gets assigned, and how meaning may
change over time. If the word conveys context, it may be called a sign in conceptual
semantics.

Examples of semantics
Below are examples of how the semantics of different words may be interpreted by
how they're used.

 Run
The word "run" can have many different meanings. For example, "Joe enjoys
running" describes a person (Joe) who runs for exercise. However, "My computer is
running" doesn't mean the computer is running like a human; it means the computer
is operational.
 Crash
A "crash" could refer to a car accident, a drop in the Stock Market, attending a party
without being invited, or a computer problem.

 Move
"Move" can describe taking something and putting it elsewhere, pushing or pulling
something to an alternate location, or something that stirs emotion.

Dynamic Semantics
The difficult task of describing the dynamic semantics, or meaning, of the
expressions, statements, and program units of a programming language. Because of
the power and naturalness of the available notation, describing syntax is a relatively
simple matter. On the other hand, no universally accepted notation or approach has
been devised for dynamic semantics.

There are several different reasons underlying the need for a methodology and
notation for describing semantics. Programmers obviously need to know precisely
what the statements of a language do before they can use them effectively in their
programs. Compiler writers must know exactly what language constructs mean to
design implementations for them correctly. If there were a precise semantics
specification of a programming language, programs written in the language
potentially could be proven correct without testing. Also, compilers could be shown to
produce programs that exhibited exactly the behavior given in the language definition;
that is, their correctness could be verified. A complete specification of the syntax and
semantics of a programming language could be used by a tool to generate a compiler
for the language automatically. Finally, language designers, who would develop the
semantic descriptions of their languages, could in the process discover ambiguities
and inconsistencies in their designs.

Software developers and compiler designers typically determine the semantics of


programming languages by reading English explanations in language manuals.
Because such explanations are often imprecise and incomplete, this approach is
clearly unsatisfactory. Due to the lack of complete semantics specifications of
programming languages, programs are rarely proven correct without testing, and
commercial compilers are never generated automatically from language descriptions.
 Operational Semantics
The idea behind operational semantics is to describe the meaning of a statement
or program by specifying the effects of running it on a machine. The effects on
the machine are viewed as the sequence of changes in its state, where the
machine’s state is the collection of the values in its storage.

An obvious operational semantics description, then, is given by executing a


compiled version of the program on a computer. Most programmers have, on at
least one occasion, written a small test program to determine the meaning of
some programming language construct, often while learning the language.
Essentially, what such a programmer is doing is using operational semantics to
determine the meaning of the construct.

There are several problems with using this approach for complete formal
semantics descriptions. First, the individual steps in the execution of machine
language and the resulting changes to the state of the machine are too small and
too numerous. Second, the storage of a real computer is too large and complex.
There are usually several levels of memory devices, as well as connections to
enumerable other computers and memory devices through networks. Therefore,
machine languages and real computers are not used for formal operational
semantics. Rather, intermediate-level languages and interpreters for idealized
computers are designed specifically for the process.

There are different levels of uses of operational semantics. At the highest level,
the interest is in the final result of the execution of a complete program. This is
sometimes called natural operational semantics. At the lowest level,
operational semantics can be used to determine the precise meaning of a
program through an examination of the complete sequence of state changes that
occur when the program is executed. This use is sometimes called structural
operational semantics.
 Denotational Semantics
Denotational semantics is the most rigorous and most widely known formal
method for describing the meaning of programs. It is solidly based on recursive
function theory. A thorough discussion of the use of denotational semantics to
describe the semantics of programming languages is necessarily long and
complex. It is our intent to provide the reader with an introduction to the central
concepts of denotational semantics, along with a few simple examples that are
relevant to programming language specifications.
The process of constructing a denotational semantics specification for a
programming language requires one to define for each language entity both a
mathematical object and a function that maps instances of that language entity
onto instances of the mathematical object. Because the objects are rigorously
defined, they model the exact meaning of their corresponding entities. The idea
is based on the fact that there are rigorous ways of manipulating mathematical
objects but not programming language constructs. The difficulty with this
method lies in creating the objects and the mapping functions. The method is
named denotational because the mathematical objects denote the meaning of
their corresponding syntactic entities.

The mapping functions of a denotational semantics programming language


specification, like all functions in mathematics, have a domain and a range. The
domain is the collection of values that are legitimate parameters to the function;
the range is the collection of objects to which the parameters are mapped. In
denotational semantics, the domain is called the syntactic domain, because it is
syntactic structures that are mapped. The range is called the semantic domain.
Denotational semantics is related to operational semantics. In operational
semantics, programming language constructs are translated into simpler
programming language constructs, which become the basis of the meaning of
the construct. In denotational semantics, programming language constructs are
mapped to mathematical objects, either sets or, more often, functions. However,
unlike operational semantics, denotational semantics does not model the step-
by- step computational processing of programs.

Static semantics
Principles

 The program is not evaluated, i.e. the semantics don't directly guide the
semantic algorithm's flow (which follows the program's structure instead)
 Should terminate on every program
 Must not depend on external events (such as IO, system errors, etc)
 The algorithm should not need to visit the same node in the AST twice (or more
than n times, for n fixed for the algorithm)
 The output tells you something about how the program is defined, not what
it does
Application

 Type checking
 Warnings / coding suggestions
 Optimizing
 Compiling

Dynamic semantics
Principles

 The program is evaluated - this can mean for example that semantics of control
flow instructions affect the way the semantic algorithm is evaluated itself
 Can loop, halting problem can be undecidable
 Can depend on arbitrary real-world events
 It may be necessary to evaluate the same part of the AST indefinite number
of times
 The output tells you the results of the implemented

algorithm Application

 Well, running programs


 Different dynamic semantics can be used to describe runtime in different
environments (such as test / prod / 1987 washing machine chip)

Static and dynamic difference


Static and Dynamic Semantics Syntax concerns the form of a valid program, while
semantics concerns its meaning Static semantic rules are enforced by a compiler at
compile time Implemented in semantic analysis phase of the compiler Context-free
grammars are not powerful enough to describe certain rules, such as checking variable
declaration with variable use Examples: Type checking; Identifiers are used in
appropriate context; Check subroutine call arguments; Check labels Dynamic
semantic rules are enforced by the compiler by generating code to perform the checks
Examples: Array subscript values are within bounds; Arithmetic errors; Pointers are
not dereferenced unless pointing to valid object; A variable is used but hasn’t been
initialized Some languages (Euclid, Eiffel) allow programmers to add explicit
dynamic
semantic checks in the form of assertions, e.g. assert denominator not= 0 When a
check fails at run time, an exception is raised

Chapter 6
Lexical Analysis

Lexical analysis is the first phase of a compiler. It takes modified source code from
language preprocessors that are written in the form of sentences. The lexical analyzer
breaks these syntaxes into a series of tokens, by removing any whitespace or
comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer
works closely with the syntax analyzer. It reads character streams from the source
code, checks for legal tokens, and passes the data to the syntax analyzer when it
demands.

Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are
some predefined rules for every lexeme to be identified as a valid token. These rules
are defined by grammar rules, by means of a pattern. A pattern explains what can be a
token, and these patterns are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers,
operators and punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is a
set of English language alphabets.
Strings
Any finite sequence of alphabets (characters) is called a string. Length of the string is
the total number of occurrence of alphabets, e.g., the length of the string tutorialspoint
is 14 and is denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string
of zero length is known as an empty string and is denoted by ε (epsilon).
Special symbols
A typical high-level language contains the following symbols:-

Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%),


Multiplication(*), Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Assignment =

Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Preprocessor #

Location Specifier &

Logical &, &&, |, ||, !

Shift Operator >>, >>>, <<, <<<

Language
A language is considered as a finite set of strings over some finite set of alphabets.
Computer languages are considered as finite sets, and mathematically set operations
can be performed on them. Finite languages can be described by means of regular
expressions.
Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid
string/token/lexeme that belong to the language in hand. It searches for the pattern
defined by the language rules.
Regular expressions have the capability to express finite languages by defining a
pattern for finite strings of symbols. The grammar defined by regular expressions is
known as regular grammar. The language defined by regular grammar is known
as regular language.
Regular expression is an important notation for specifying patterns. Each pattern
matches a set of strings, so regular expressions serve as names for a set of strings.
Programming language tokens can be described by regular languages. The
specification of regular expressions is an example of a recursive definition. Regular
languages are easy to understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which
can be used to manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
 Union of two languages L and M is written
as L U M = {s | s is in L or s is in M}
 Concatenation of two languages L and M is written
as LM = {st | s is in L and t is in M}
 The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
 Union : (r)|(s) is a regular expression denoting L(r) U L(s)
 Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
 Kleene closure : (r)* is a regular expression denoting (L(r))*
 (r) is a regular expression denoting L(r)
Precedence and Associativity
 *, concatenation (.), and | (pipe sign) are left associative
 * has the highest precedence
Concatenation (.) has the second highest precedence.
 | (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … } x+
means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x* x?
means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language. [A-
Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Representation occurrence of symbols using regular expressions
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Representation of language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted
solution is to use finite automata for verification.
Finite Automata
Finite automata is a state machine that takes a string of symbols as input and changes its
state accordingly. Finite automata is a recognizer for regular expressions. When a
regular expression string is fed into finite automata, it changes its state for each literal.
If the input string is successfully processed and the automata reaches its final state, it
is accepted, i.e., the string just fed was said to be a valid token of the language in
hand.
The mathematical model of finite automata consists of:
 Finite set of states (Q)
 Finite set of input symbols (Σ)
 One Start state (q0)
 Set of final states (qf)
 Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a finite set of input
symbols (Σ), Q × Σ ➔ Q
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).
 States : States of FA are represented by circles. State names are of the state is
written inside the circle.
 Start state : The state from where the automata starts, is known as start state.
Start state has an arrow pointed towards it.
 Intermediate states : All intermediate states has at least two arrows; one
pointing to and another pointing out from them.
 Final state : If the input string is successfully parsed, the automata is expected
to be in this state. Final state is represented by double circles. It may have any
odd number of arrows pointing to it and even number of arrows pointing out
from it. The number of odd arrows are one greater than even, i.e. odd =
even+1.
 Transition : The transition from one state to another state happens when a
desired symbol in the input is found. Upon transition, automata can either move
to next state or stay in the same state. Movement from one state to another is
shown as a directed arrow, where the arrows points to the destination state. If
automata stays on the same state, an arrow pointing from a state to itself is
drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA =
{Q(q0, qf), Σ(0,1), q0, qf, δ}

Longest Match Rule


When the lexical analyzer read the source-code, it scans the code letter by letter; and
when a whitespace, operator symbol, or special symbols occurs, it decides that a word
is completed.
For example:
int intvalue;
While scanning both lexemes till ‘int’, the lexical analyzer cannot determine whether
it is a keyword int or the initials of identifier int value.
The Longest Match Rule states that the lexeme scanned should be determined based
on the longest match among all the tokens available.
The lexical analyzer also follows rule priority where a reserved word, e.g., a
keyword, of a language is given priority over user input. That is, if the lexical
analyzer finds a lexeme that matches with any existing reserved word, it should
generate an error.

What is Lexical Analysis?


Lexical Analysis is the very first phase in the compiler designing. A Lexer takes the
modified source code which is written in the form of sentences . In other words, it
helps you to convert a sequence of characters into a sequence of tokens. The lexical
analyzer breaks this syntax into a series of tokens. It removes any extra space or
comment written in the source code.
Programs that perform Lexical Analysis in compiler design are called lexical
analyzers or lexers. A lexer contains tokenizer or scanner. If the lexical analyzer
detects that the token is invalid, it generates an error. The role of Lexical Analyzer in
compiler design is to read character streams from the source code, check for legal
tokens, and pass the data to the syntax analyzer when it demands.

Example
How Pleasant Is The Weather?
See this Lexical Analysis example; Here, we can easily recognize that there are five
words How Pleasant, The, Weather, Is. This is very natural for us as we can recognize
the separators, blanks, and the punctuation symbol.

HowPl easantIs Th ewe ather?


Now, check this example, we can also read this. However, it will take some time
because separators are put in the Odd Places. It is not something which comes to you
immediately.
Basic Terminologies
What’s a lexeme?
A lexeme is a sequence of characters that are included in the source program
according to the matching pattern of a token. It is nothing but an instance of a token.

What’s a token?
Tokens in compiler design are the sequence of characters which represents a unit of
information in the source program.

What is Pattern?
A pattern is a description which is used by the token. In the case of a keyword which
uses as a token, the pattern is a sequence of characters.

Lexical Analyzer Architecture: How tokens are recognized


The main task of lexical analysis is to read input characters in the code and produce
tokens.

Lexical analyzer scans the entire source code of the program. It identifies each token
one by one. Scanners are usually implemented to produce tokens only when requested
by a parser. Here is how recognition of tokens in compiler design works-

Lexical Analyzer Architecture


1. “Get next token” is a command which is sent from the parser to the
lexical analyzer.
2. On receiving this command, the lexical analyzer scans the input until it
finds the next token.
3. It returns the token to Parser.

Lexical Analyzer skips whitespaces and comments while creating these tokens. If
any error is present, then Lexical analyzer will correlate that error with the source file
and line number.

Roles of the Lexical analyzer


Lexical analyzer performs below given tasks:

 Helps to identify token into the symbol table


 Removes white spaces and comments from the source program
 Correlates error messages with the source program
 Helps you to expands the macros if it is found in the source program
 Read input characters from the source program

Example of Lexical Analysis, Tokens, Non-Tokens


Consider the following code that is fed to Lexical Analyzer

#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token

int Keyword

maximum Identifier
( Operator

int Keyword

x Identifier

, Operator

int Keyword

Y Identifier

) Operator

{ Operator

If Keyword

Examples of Nontokens
Type Examples

Comment // This will compare 2 numbers

Pre-processor directive #include <stdio.h>

Pre-processor directive #define NUMS 8,9

Macro NUMS

Whitespace /n /b /t

Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical
error. Important facts about the lexical error:

 Lexical errors are not very common, but it should be managed by a scanner
 Misspelling of identifiers, operators, keyword are considered as lexical errors
 Generally, a lexical error is caused by the appearance of some illegal character,
mostly at the beginning of a token.

Error Recovery in Lexical Analyzer


Here, are a few most common error recovery techniques:

 Removes one character from the remaining input


 In the panic mode, the successive characters are always ignored until we
reach a well-formed token
 By inserting the missing character into the remaining input
 Replace a character with another character
 Transpose two serial characters

Lexical Analyzer vs. Parser


Lexical Analyser Parser

Scan Input program Perform syntax analysis

Create an abstract representation of the


Identify Tokens
code

Insert tokens into Symbol


Update symbol table entries
Table

It generates a parse tree of the source


It generates lexical errors
code

Why separate Lexical and Parser?

 The simplicity of design: It eases the process of lexical analysis and the
syntax analysis by eliminating unwanted tokens
 To improve compiler efficiency: Helps you to improve compiler efficiency
 Specialization: specialized techniques can be applied to improves the
lexical analysis process
 Portability: only the scanner requires to communicate with the outside world
 Higher portability: input-device-specific peculiarities restricted to the lexer
Advantages of Lexical analysis

 Lexical analyzer method is used by programs like compilers which can use
the parsed data from a programmer’s code to create a compiled binary
executable code
 It is used by web browsers to format and display a web page with the help
of parsed data from JavsScript, HTML, CSS
 A separate lexical analyzer helps you to construct a specialized and potentially
more efficient processor for the task

Disadvantage of Lexical analysis

 You need to spend significant time reading the source program and partitioning
it in the form of tokens
 Some regular expressions are quite difficult to understand compared to PEG
or EBNF rules
 More effort is needed to develop and debug the lexer and its token descriptions
 Additional runtime overhead is required to generate the lexer tables
and construct the tokens

Summary

 Lexical analysis is the very first phase in the compiler design


 Lexemes and Tokens are the sequence of characters that are included in the
source program according to the matching pattern of a token
 Lexical analyzer is implemented to scan the entire source code of the program
 Lexical analyzer helps to identify token into the symbol table
 A character sequence which is not possible to scan into any valid token is a
lexical error
 Removes one character from the remaining input is useful Error
recovery method
 Lexical Analyser scan the input program while parser perform syntax analysis
 It eases the process of lexical analysis and the syntax analysis by eliminating
unwanted tokens
 Lexical analyzer is used by web browsers to format and display a web
page with the help of parsed data from JavsScript, HTML, CSS
 The biggest drawback of using Lexical analyzer is that it needs additional
runtime overhead is required to generate the lexer tables and construct
the tokens
CHAPTER 7
PARSING
Parsing is performed at the syntax analysis phase where a stream of tokens is taken
as input from the lexical analyzer and the parser produces the parser tree for the tokens
while checking the stream of tokens against the syntax errors.
Role of Parser
In the syntax analysis phase, a compiler verifies whether or not the tokens generated by
the lexical analyzer are grouped according to the syntactic rules of the language. This
is done by a parser. The parser obtains a string of tokens from the lexical analyzer and
verifies that the string can be the grammar for the source language. It detects and
reports any syntax errors and produces a parse tree from which intermediate code can
be generated.

Types of Parsing
The parsing is divided into two types, which are as follows:
1. Top-down Parsing
2. Bottom-up Parsing

Top-Down Parsing

Top-down parsing attempts to build the parse tree from the root node to the leaf node.
The top-down parser will start from the start symbol and proceed to the string. It
follows the leftmost derivation. In leftmost derivation, the leftmost non-terminal in
each sentential is always chosen.
1. Recursive parsing or predictive parsing are other names for top-down parsing.
2. A parse tree is built for an input string using bottom-up parsing.
3. When parsing is done top-down, the input symbol is first transformed into the
start symbol.
The top-down parsing is further categorized as follows:
1. With Backtracking:
 Brute Force Technique
2. Without Backtracking:
 Recursive Descent Parsing
 Predictive Parsing or Non-Recursive Parsing or LL(1) Parsing or Table Driver
Parsing

Bottom-up Parsing

Bottom-up parsing builds the parse tree from the leaf node to the root node. The
bottom-up parsing will reduce the input string to the start symbol. It traces the
rightmost derivation of the string in reverse. Bottom-up parsers are also known as
shift-reduce parsers.
1. Shift-reduce parsing is another name for bottom-up parsing.
2. A parse tree is built for an input string using bottom-up parsing.
3. When parsing from the bottom up, the process begins with the input symbol and
builds the parse tree up to the start symbol by reversing the rightmost string
derivations.
Generally, bottom-up parsing is categorized into the following types:
1. LR parsing/Shift Reduce Parsing: Shift reduce Parsing is a process of parsing a
string to obtain the start symbol of the grammar.
 LR(0)
 SLR(1)
 LALR
 CLR
2. Operator Precedence Parsing: The grammar defined using operator grammar is
known as operator precedence parsing. In operator precedence parsing there should
be no null production and two non-terminals should not be adjacent to each other.
1 Top down parsing

We have defined the language of a CFG in a “generative” fashion: the language is


the set ofstrings which can be derived from a sentence symbol via a set of
rules. Parsing inverts the process of generating a string. The parsing problem
is, given a CFG and a terminal string,to find a derivation of the string, or report
that none exists.

There are two important styles of parsing: top-down and bottom-up. In actuality, very
many parsing algorithms have been proposed, not all of which fit neatly into one of
these categories. But the top-down/bottom-up distinction fits the parsing
algorithms in CS143 very well, andis very helpful for understanding parsing
algorithms.

Top-down parsing can be thought of as expanding a parse tree or derivation from the
sentence symbol until the frontier matches the desired input string (or until it
becomes clear that there is no way to make them match). Obviously, there are in
general an infinite number of trees that can be expanded. The tree that is expanded
depends on which productions areused to replace the nonterminals in the frontier
of the tree at each step of the [Link] differences among top-down parsing
methods are in the methods used to choose this production.

Top-down parsing by guessing

To introduce the idea of top-down parsing, let’s temporarily disregard the exact
method useto pick the next production to expand. Instead, we’ll do it by
making a “lucky guess” at each step about which production to expand. Later,
instead of guessing, we’ll look up what to do in a table.

At any point during the parse, the state of the parser is characterized by two items: the
remaining input and the stack. Once you know the values of these variables, you
know what will happen with the rest of the parse. The input contains a sequence of
terminal symbols.
Its initial value is the input string to be parsed, x, with the leftmost symbol
first. The stack contains terminal and nonterminal symbols. Initially, the stack
contains one symbol, which is S, the sentence symbol of the CFG.

Intuitively, the stack stores a partially expanded parse tree. The top of the stack
is a prediction about what the parser will see next. If the top symbol is a
terminal, it must match the next symbol in the input stream. If the top symbol
is a nonterminal, there must be some way to expand it to a string that matches
the beginning of the input (top-down parsing is sometimes called “predictive
parsing”).

Reflecting the discussion of the previous paragraph, there are two basic actions
that occur during parsing. When there is a terminal symbol on top of the stack,
it is matched with the next symbol on the input. Matching compares the two
symbols; if they are not equal, the input string is rejected, meaning that the
input string is not in the language of the CFG. If the symbols match, the top
symbol is popped off of the stack an→d the input symbol is removed from the
front
of the input. When there is a nonterminal A on top of the stack, it is expanded by
choosing a production A α from the CFG, popping A from the stack, and
pushing the symbols in α, rightmost first, onto the stack (so that the leftmost
symbol of α ends up on top of the stack). (Note: if α = ϵ, expanding has the
effect of simply popping A off of the stack.)

Example

Let’s parse the input “aab” using the CFG:

S → AS |
BA → a
B → b
Here are the steps of the parse. $ represents “end of file” and “bottom of
stack.” The actionsexplain how we get from one step to the next. The top of the
stack is drawn on the left. Whether we expand or match is determined by
whether the top symbol is a nonterminal orterminal; if it’s a nonterminal, we
have to guess which production with that symbol on the left-hand side should
be expanded. At the end of the parse, we accept the input string if theinput and
stack are both empty. If there is a mismatch, we reject
parse (top) stack action
aab$ S$ expand S → AS
aab$ AS$ expand A → a
aab$ aS$ match
ab$ S$ expand S → AS
ab$ AS$ expand A → a
match
ab$ aS$
expand S → B
b$ S$
b$ B$ expand B→ b
b$ b$ match
$ $ accept

If this parse algorithm accepts, we can be sure that the input is in the language
of the CFG. If the parse does not accept, the string may or may not be in the
language. However, if we assume that the best possible guess is made at each
step, the parsing algorithm will accept if it is possible to do so, and does not
accept exactly when the input is not in the language.

A derivation and parse tree can be reconstructed from the history of expansions in
the parse. The derivation in this case is
S =⇒ AS =⇒ aS =⇒ aAS =⇒ aaS =⇒ aaB =⇒ aab.

This is a leftmost derivation because we always expanded the top symbol on the
stack, whichwas the leftmost symbol of the string derived up to that point.

LL(1) parsing

If a computer could make lucky guesses with 100% reliability, as we have


assumed above, parsing would be one of the less important applications.
Instead, it is possible to construct a table where the proper action can be looked
up, based on the symbols on the top of the stack and at the beginning of the
input. This algorithm is called LL(1) parsing (for “leftmost (parse) lookahead 1
symbol”). LL(1) parsing is almost the simplest top-down parsing scheme one
could imagine. In essence, it says: “don’t choose a production unless it at least
has a chance of matching the next input symbol.”

LL(1) parsing is very efficient (the running time is linear in the length of the
input string,with a very low constant factor). There is a tradeoff, however. Not
all CFGs can be handled by LL(1) parsing. If a CFG cannot be handled, we
say it is “not LL(1).” There is an LL(1) parse table generation algorithm that
either successfully builds a parse table (if the CFG is LL(1)) or reports that the
CFG is not LL(1) and why.

The LL(1) parse table construction is somewhat involved, so it will take a little
while to go through all the steps. Before doing so, we first describe a
transformation that increases thechances that a CFG will be LL(1).

 Removing left recursion

A CFG is said to be left recursive if there exists a nonterminal A ⇒s u c h that A


=+ Aα (i.e., Acan expand to a string beginning with A). Left recursion in a CFG
is fatal for LL(1) parsing, and for most other top-down parsing algorithms. To
see why, i m a gi→ne that the parse table says to expand A β when a is at the
beginning of the input. The parse algorithm will go through a sequence of
expand steps until it has Aα on the top of the stack, without matchingany
inputs. But now we have A on top of the stack and a at the beginning of the
input, so we’ll do exactly the same thing ad infinitum. (LL(k) algorithms
generalize LL(1) parsing to use the first k input symbols to decide what to
expand. Although more powerful than LL(1) parsing when k > 1, the same
argument shows that they are also unable to deal with left recursion.)

Fortunately, there is an algorithm to eliminate left recursion from any CFG,


without changingits language. The general transformation is fairly difficult and
not really practical, so we’ll consider the important special case of removing
immediate left recursion, which is when there is a production of the form A →
Aα in the grammar.
Suppose we have a producti→on A Aα in our CFG. First, collect all of the
productions having A on the left-hand side, and combine them into a single
production using EBNF notation of the form A → Aα | β. For example,
suppose the productions are

A → Aα1
A → Aα2
A → β1
A → β2

The EBNF production is A → A(α1 | α2) | (β1 | β2).


A production of the form A → Aα | β can be written equivalently and non-
recursively as
A → βα∗ (to see this, expand A repeatedly and notice the pattern: A =⇒ Aα =⇒
Aαα =⇒
. . . =⇒ β . . . αα).
All that remains is to convert this back to a (non-EBNF) CFG. Here we exploit
the earlier observation that α∗ can be expanded left recursively or right
recursively:
A →
βB B →
ϵ
B → αB

If β = β1 | β2 and α = α1 | α2 as in the example above, this finally becomes

A →
β1B A

β2B B →
ϵ
B →
α 1B B →
α 2B

Example

Consider the unambiguous expression grammar of the previous lecture, which had
the left- recursive productions

E → E+T
E →T
which can be rewritten as E E + T T . In turn, this is E T ( + T )∗, which can
be → | →
re-expanded right recu rsively i nto

E → TA
A → + TA A → ϵ
IMPORTANT NOTE: A real understanding of this method requires a hands- on
[Link] generating some strings from each set of productions and see why
they do the same thing. Try working through this on the whole grammar. Try
making up some other grammars and trying it. Otherwise, you won’t learn it!

Left factoring

A CFG has common left factors if the right-hand sides of two productions with
the same left-hand symbol start with the same symbol. The presence of
common left factors CFGs sabotages LL(1) parsing.

Suppose we have productions

A → αβ A
→ αγ
We can convert these to EBNF, also: A → α(β | γ), and convert back:

A →
αBB
→β
B → γ

Example

Suppose we had a CFG with productions

E → T+E
E →T

This converts to E → T ( + E | ϵ), which converts back to

E → TA A → +EA → ϵ

A CFG may still not be LL(1) even after eliminating left recursion and left
factors, but it
certainly will not be LL(1) if they are not eliminated.

Unfortunately, while these transformations preserve the language of a CFG, they


do not preserve the structure of the parse tree. The structure can be recovered from
the parse with some trouble, but it is a distinct disadvantage of LL(1) parsing that
the original grammar must be rewritten into a form that is probably not as natural
as the original.

LL(1) parsing example

Before discussing in detail the construction the LL(1) parse tables, let’s seen an
example of how the parsing algorithm uses one. This is the grammar that
results when left recursion is removed from the unambiguous expression
grammar, as described above. We will call this our LL(1) expression grammar.
It will be referred to frequently below.

E → TA
A → + TA |
ϵ T→ FB
B → ∗FB |
ϵF → (E) | a

Here is the LL(1) parse table for the grammar:

a + ∗
( ) $
E E → E →
TA TA
T T → T →
FB FB
F F →a F →
(E)
A A → + A → A →
TA ϵ ϵ
B B→ϵ B → B → B →
∗FB ϵ ϵ

The parsing algorithm is as before, except that when the top symbol on the stack is
a nonterminal, we don’t guess. Instead, we look in the table, in the row for the top-
of-stack symbol and column for the next input symbol (which can be $ if the entire
input has been consumed). The table entry uniquely determines which production
to expand.
Here is what happens when we parse the string a + a ∗ a using this table.

(top) stack parse action


E$ a+a*a$ expand E TA
TA$ a+a*a$ expand → T
FB FBA$ →
a+a*a$ expand F a
aBA$ a+a*a$ match
BA$ +a*a$ expand B→ ϵ
A$ +a*a$ expand A→ +TA
+TA$ +a*a$ match
TA$ a*a$ expand → T
FB FBA$ →
a*a$ expand F a
aBA$ a*a$ match
BA$ *a$ expand B→ ∗ FB
*FBA$ *a$ match
FBA$ a$ expand F→
a aBA$ a$ match
BA$ $ expand B→ ϵ
A$ $ expand A→ ϵ
$ $ accept
LL(1) Parse Table Construction

The basic idea behind the LL(1) parse table is to find the set of terminal symbols
that can appear at the beginning of a string expanded from a production. If the
nonterminal on top of the stack is A and the next input is a, a→nd there a→re
two productio→ns A α and A β, we choose A α if α can expand to
something
beginning with a. The choice must be unique; if β also expands to a string
beginning with a, the grammar is not LL(1) (LL(1) parsing won’t work, so we
either have to change the grammar or use a different parsing algorithm).

Constructing the LL(1) parse table requires some preliminary definitions and
[Link] first is the set of nullable nonterminals. To compute them,
we need the more general concept of a nullable string:

Definition 8 A string α in (V ∪ Σ)∗ is nullable if α =⇒∗ ϵ.

We would like to determine the set of nullable symbols in a context-free


grammar. Instead of giving an algorithm to compute these sets, we give a set of
rules that can be applied iteratively to compute them. The order of application
of the rules does not matter, so long as each rule is eventually applied
whenever it can change the computed function (and as longis the rule is not
applied to unnecessary strings). The rules for nullable strings compute a
Boolean function Nullable. The domain of the function is all the strings
consisting of single nonterminal symbols, or right-hand-sides of productions.
Initially, we assume that Nullable is false for all strings; the rules can be
applied to set it to true for some strings, whereupon other rules can set it to true
for other strings. The process terminates when no rule can change Nullable
from false to true for any string.

1. Nullable(X1X2 . . . Xn) = true if, for all 1 ≤ i ≤ n, Nullable(Xi).


2. Nullable(A) = true if there is a production in the CFG A → α and
Nullable(α).

In particular, rule 1 implies that ϵ is nullable, since f≤o r a≤l l i i n


applies to no i
whatsoever, so all (zero) symbols are

[Link] is an example:

S → ABC
A → ϵ
B → ϵ
C → AB

Initially, Nullable is false for everything. Then

Nullable(ϵ) = true rule 1 (X1X2 . . . Xn = ϵ)


Nullable(A) = true rule 2 ( A→
ϵ) Nullable(B) = true rule 2
( B→ ϵ) Nullable(AB) = true
rule 1
Nullable(C) = true rule 2
AB)
(C→
Nullable(ABC) = true rule 1
Nullable(S) = true rule 2 (S → ABC)

In this case, everything is nullable, which is unusual.

By a much easier computation, the nullable symbols in the LL(1) expression


grammar are
A and B.

The next function that needs to be computed is the FNE set. FNE stands for
“First, no ϵ”. It is unique to this course, but I think it is easier to understand
than the standard approach, which I will also describe. FNE (α) is the set of
terminal symbols which can appear at the beginning of a terminal string
derived from α.

Definition 9 If α is a string in (V ∪ Σ)∗, then FNE (α) is


{a | a ∈ Σ ∧ α =⇒∗ ax, for some x ∈ Σ∗.}.
(Note that since we’ve assumed there are no useless symbols in the CFG being
analyzed, thedefinition would be equivalent if we relaxed the requirement that
x be a terminal string.)

The domain of FNE consists of the set of all nonterminals and suffixes (i.e.,
tails) of strings appearing on the right-hand sides of productions, but the
codomain of FNE consists of the subsets of Σ (i.e., FNE (α) is a set of
terminal symbols). As with Nullable, we start witha “small” initial value for
FNE , the function that maps every string to the empty set, andapply a set of
rules in no particular order until no rules can change FNE .

Rule 2 below is partially obvious: the FNE set of a string X1X2 . . . Xn always
includes the
FNE of X1. However, it may include the FNE of X2 and later symbols,
also: If X1 is
nullable, there is a derivation X1X2 . . . Xn =⇒∗ ϵa . . . = a . . ., where X1
“expands” into the
empty string, so the first terminal symbol actually derives from X2. However,
even if X1 is nullable, it may end up contributing terminals to the FNE of X1X2 .
. . Xn because there may also be derivations where X1 expands to a non-empty
string (nullable symbols may expand to ϵ, but they may expand to other strings
as well). Clearly, if X1 . . . Xk are all nullable, the first terminal may come from
Xk+1. The first rule also implies that FNE (ϵ) = ∅.

The justification for rule 3 below is simple:


If α =⇒∗ ax, and A → α is a production in
the
CFG, then A =⇒ α =⇒∗ ax, so a ∈ FNE (A).

1. FNE (a) = {a}


2. FNE (X1X2 . . . Xn) = if Nullable(X1)
then FNE (X1) ∪ FNE (X2 . . . Xn)
else FNE (X1).
3. FNE (A) = ∪{FNE (α) | A → α ∈ P }
Let’s use these rules to compute the FNE sets for our LL(1) expression CFG.
These are purposely done in a suboptimal order to show that rules sometimes
have to be used on thesame string multiple times before the correct answer is
reached.

String Added Reason


+ + rule 1
∗ ∗ rule 1
( ( rule 1
a a rule 1
+ TA + rule 2
A + rule 3
∗FB ∗ rule 2
B rule 3 ∗

F a rule 3
FB a rule 2
T a rule 3
TA a rule 2
(E) ( rule 2
F ( rule 3
FB ( rule 2
T ( rule 3
TA ( rule 2
E a, ( rule 3

Note: You should work through these in detail to see how the rules work, and try
anotherorder to see if you get the same result.
The resulting FNE sets for the nonterminals in the grammar are:

E { a, ( }
T { a, ( }
F { a, ( }
A {+ }
B {* }
Note: You should check the definition of FNE above and re-inspect the LL(1)
expressiongrammar to see why this makes sense.

This CFG didn’t use rule 2 in its full generality. Here is a simple example
that does.

S → ABC
A → aA |
ϵB → b | ϵ
C → c|d

Note that A and B are nullable. Applying the rules gives:

Strin Added Reaso


g n
a a rule 1
b b rule 1
c c rule 1
d d rule 1
C c, d rule 3
B b rule 2
aA a rule 2
A a rule 3
ABC a, b, c, d rule 2
S a, b, c, d rule 3

FNE (ABC) includes terminals from A but also from B (because A is nullable)
and C (because B is nullable). Suggestion: Show for each of member of FNE
(S) that there is a way to derive a string from S beginning with that
terminal.

I will also describe the “standard” approach that appears in all textbooks on
the subject,in case you need to talk to someone else about this material who
didn’t take this class. The standard approach is not to define FNE sets, but
FIRST sets. The only difference is that FIRST (α) includes ϵ if α is nullable.
Rules for computing FIRST can be given that are very similar to the rules for
FNE. FIRST sets are defined the way they are because the concept generalizes
to LL(k) parse table construction for k > 1. But we don’t care about that, and
inclusion of ϵ seems to lead to confusion, so we use FNE sets, instead.

The last set that needs to be defined before constructing the LL(1) parse table is
the FOL- LOW set for each nonterminal. The FOLLOW sets are only used for
nullable productions. Recall that the LL(1) parse table selects a production
based on the nonterminal on top of the stack and the next input symbol. We
want to choose a production that is consistent with
the next input. FNE sets tell us what we want to know if the next terminal is
going to be derived from the top nonterminal on the stack (we just pick the
production whose right-handside has the next input in it’s FNE set). But,
suppose the nonterminal on top of the stack “expands” to ϵ! Then the next
nonterminal is going to come from some symbol after that nonterminal. The
FOLLOW sets say exactly which terminals can occur immediately aftera
nonterminal.

Definition 10 FOLLOW (A) = {a | S$ =⇒∗


and a ∈ (Σ ∪ {$}).
αAaβ}, for some α ∈ (V ∪ Σ)∗, β ∈ (V ∪ Σ)∗,

This definition derives strings from S$ because we want $ to be in the


FOLLOW set of a nonterminal when the next input can be the end-of-file
marker. The domain of FOLLOWis just the set of nonterminals (not strings, in
contrast to the previous functions). The FOLLOW sets are computed in the
same style as Nullable and FNE. Initially, the FOLLOW sets are all empty,
then the following rules are applied in any order until no more changes are
possible.

1. $ ∈ FOLLOW (S)

2. FOLLOW (B) ⊇ FNE (β) when A → αBβ appears in the CFG.

3. FOLLOW (B) ⊇ FOLLOW (A) when A → αBβ appears in the CFG and
β is nullable.

Rule 1 is justified since S derive a complete input string, which will be


followed by $. Rule2 looks for productions where B is followed by something
that can have a at the beginning (so a immediately follows whatever B
expands to).
Rule 3 deals with a more subtle case.
If a is in FOLLOW (A), and β is nullable, the follow derivation exists: S =⇒∗
. . . Aa . . .
=⇒
. . . αBβa . . . =⇒∗ . . . αBa . . ., so a is in FOLLOW (B), too.

Let’s compute the FOLLOW sets for the LL(1) expression grammar.

String Added
Reason E
$ rule 1
E ) rule 2 (F → (E))
T + rule 2 (E → TA)
F * rule 2 (T → FB)
A $,) rule 3 (E → TA, ϵ
nullable) T $,) rule 3 (E → TA,
A nullable)B +,$,) rule 3 (T → FB,
ϵ nullable) F +,$,) rule 3 (T → FB,
B nullable)
The resulting FOLLOW sets are:

E { $, ) }
T { +, $, ) }
F { ∗, +, $, ) }
A { $, ) }
B { +, $, ) }

The LL(1) parse table is a two-dimensional array, which we will call Table.
The rows are indexed by nonterminals (for what is on the top of the stack) and
the columns by terminals (the next input symbol). Given A on the top of the
stack and a next in the input, we needto choose a production A → α to expand.
Table[A, a] should be A → α only if there is someway to expand A → α that
can match a next in the input.
There are two ways that expanding A → α can expand to something that matches
a. One way is if α expands to something beginning with a, so we set Table[A, a]
= A → α whenever a ∈ FNE (α). The other way is if α expands to ϵ, and
some symbol after α expands to a string beginning with a, so we also set
Table[A, a] = A → α when α is nullable and a ∈ FOLLOW (A).
The LL(1) parse table for the LL(1) expression grammar appears in full,
above, but let’s look at a few examples. E → TA appears in Table[E, a]
because a ∈ FNE (TA). The only productions with nullable right-hand
sides in this grammar are A → ϵ and B → ϵ. Table[B, +] = B → ϵ because +
∈ FOLLOW (B).

There is one more extremely important point to make about the LL(1) parse
table: If the rules above set Table[A, a] to two different productions, the parse table
construction fails. In this case, the CFG is not LL(1). LL(1) parsing demands
that we be able to choose exactlythe next production to expand based solely on
whether it is consistent with the next input symbol. So the LL(1) parse table
construction not only builds parse tables, it is a test for whether a CFG is LL(1)
or not.

What about the entries that have nothing in them? They are error entries: if the
parser ever looks at them, the input string can be rejected immediately. There
is no way to expandanything that can match the next input.

In spite of its limitations, LL(1) parsing is one of the two most widely used parsing
algorithms. The parsers can be built automatically, and the parsing algorithm is
easy to understand. It is usually simple to do processing associated with
individual grammar rules, during the parsing process (this is called syntax-
directed translation). Furthermore, it can also be used as the basis for simple
hand-written parsers, which allow a great deal of flexibility in handling special
cases and in error recovery and reporting (see below). However, LL(1) parsing
has
some limitations that can be annoying. One of them is that it is not as general
as the other common parsing algorithm, so there are some common
grammatical constructs it cannot handle. Another is the requirement to
eliminate left recursion and left factors, which can require rearranging a
grammar in ways that make it less clear.
LL(1) parsing and recursive descent

LL(1) parsing can be used with an automatic parser generator. It is also


appropriate asa basis for a simple hand-written parsing style, called recursive
descent. The idea behind recursive descent parsing is to write a collection of
recursive functions, one associated with each nonterminal symbol in the
grammar. The function for a nonterminal is responsible for reading and parsing
the part of the input string that that nonterminal expands to.

The parsing function for B in our LL(1) expression grammar might

be: int parse_B() {


next = peektok(); /* look at next input, but don’t remove
it */if (next == ’*’) {
gettok(); /* remove ’*’ from input */
parse_F(); /* should check error returns for
these, */ parse_B(); /* but I want to keep the
code short */ return 1; /* successfully parsed
B */
}
else if ((next == ’+’) || (next == ’)’) || (next ==
EOF)) { return 1; /* successfully parsed B */
}
else {
error("got %s, but expected *, +, ) or EOF while parsing B\
n", next); return 0;
}
}

Recursive descent parsing is very widely used, because it requires no special parser
generation tools, it can be extended in ad hoc ways (for example, looking ahead to
several inputs, or looking at other context, when the next input does not uniquely
determine the production choice), and it allows the user great freedom in
generating error messages and doing error recovery.

A looser style allows parsing of EBNF directly. For example:


int parse_T() {
parse_F(); /* again, I should check the return
code */ while ((next = peektok()) == ’*’) {
gettok(); /* remove ’*’ from
input */ parse_F();
}
}

While substantially different from the previous code, this is still basically the
same thing (it chooses whether to parse F or not at each point based on one
lookahead symbol.

2 Bottom-up parsing

The other major class of parsing methods are the bottom-up algorithms. As the
name suggests, bottom-up methods work by building a parse tree from the leaves
up. This involves reading the input until the right-hand side of a production is
recognized, then reducing the production instead of expanding productions until
they match the input.

Shift-reduce parsing algorithms

The bottom-up algorithms we will study are all shift-reduce algorithms. Shift-
reduce parsinguses the same data structures as LL parsing: a stack and the
input stream. However, the stack is used in a different way. Terminal symbols
are shifted onto the stack, that is, a symbol is removed from the beginning of
the input and pushed onto the stack. If a sequenceof symbols on the top of the
stack matches the right-hand side of some production, they can be reduced by
popping them and then pushing the symbol from the left-hand side of the same
production. The parse is successful when the entire input has been shifted on
the stack and reduced to the sentence symbol of the grammar.

As with LL parsing, there are choices to be made during this algorithm: there
may be several productions whose right-hand sides match the stack at any time.
Which one shouldbe reduced? As with LL parsing, the choice is made by doing a
table lookup with information from the current state of the parse.

To show how the basic parsing algorithm works, we do a simple example


where we hide thedetails of the parse table and, instead, make the choices by
guessing. Consider the simple CFG:
S → (S)
S → a
Here is the sequence of parser configurations that happens when parsing “((a)).”
The topof stack will be on the right to make the shifting more obvious.
Stack (top) input action
$ ((a))$ shift
$( (a))$ shift
$(( a))$ shift
$((a ))$ reduce S → a
$((S ))$ shift
$((S) )$ reduce S→ (S)
$(S )$ shift
$(S) $ reduce S→ (S)
$S $ accept

We can extract a derivation and parse tree from a shift-reduce parse.


⇒ The ⇒ sequenc⇒e
of reductions represents the reverse of a derivation. In this case, it is S =
(S) = ((S)) =
((a)). Although the example grammar doesn’t show it, it is also a rightmost
derivation. Tosee this, consider the CFG:

S → AB
A → a
B → b

The only string in the language is ab. Here is a parse:

Stack (top) input action


$ ab$ shift
$a b$ reduce A → a
$A b$ shift
$Ab $ reduce B→ b
$AB $ reduce S→ AB
$S $ accept
Although A → a is reduced before B → b, the derivation is reversed, so we
end up with
S =⇒ AB =⇒ Ab =⇒ ab, a rightmost derivation
LR(0) parsing

The first shift-reduce parsing algorithm we will discuss is LR(0) parsing. It is


an instanceof LR(k) parsing (which stands for “left-to-right (parsing),
rightmost (derivation) (with lookahead) of k symbols”). LR(0) parsing is pretty
much useless as a stand-alone parsing algorithm, but it is the basis for other
extremely useful algorithms.

The key idea in all LR parsing algorithms is to run a finite-state automaton


from the bottomof the parse stack to the top. The state of this automaton (at the
top of the stack) is used, along with some lookahed symbols, to choose
whether to shift another symbol or reduce aproduction, and (if the latter) which
production to reduce. The construction of the finite automaton is somewhat
involved.

The states of the automaton (which we will call the LR(0) state machine)
consist of LR(0) items. An LR(0) item is a production with a position
marked in it.

Definition 11 An LR(0) item is a pair consisting of a production A → α


and a position i, where 0 ≤ i ≤ |α|, where |α| is the length of α.

An item is written A → α • β, where • marks the position. For example, A →


•ab, A → a • b, and A → ab• are all items. An ϵ production has only one item,
which is written A → •ϵ. Note that this would be equivalent to A → ϵ•, if we
ever wrote that, which we don’t.

The finite-state automaton in this case is called the LR(0) machine. Each state
of the LR(0) machine is a set of items. If two states have the same set of items,
they aren’t two states – they are the same state. Intuitively, the LR(0) machine
keeps track of the productions that might eventually be reduced when more
stuff is pushed on the stack. The positions keep track of how much of the
production is already on the stack.

For convenience, the first step of any LR parsing algorithm is to add a new
sentence symbol
SJ and a new production SJ → S to the CFG. Let’s use the first example CFG
above.
SJ → S
S →
(S)
S → a

The first state starts with the item that says: “we are parsing a sentence, and we
haven’t seen anything yet:” SJ → •S.
The construction of a state starts with a kernel, which is a core set of items.
The kernel of the our first state is SJ → •S. The kernel is extended via the
closure operation. The closure
operation takes into account that whenever we are in the middle of an
→i t e m • A α Bβ (meaning “I might eventually be able to →reduce A αBβ, and
α is already on the stack”),it could be that the parser will see something that
can be reduced to B. The items that reflect this are those of the form B → •γ
(meaning “I might eventually be able to reduce B → γ, and I haven’t seen
anything in γ yet”). These are called closure items.

Definition 12 The LR(0) closure step adds all items of the form B → •γ to a
state when- ever the state contains an item A → α • Bβ.

There are two important points to make about the closure step:


When a closure item begins with a nonterminal, adding it to the state
may causeadditional closure items to be added.

The state is a set of items, which means it has no duplicates – “adding”
items thatalready there has no effect.

To be explicit, the procedure for

closure is:repeat until no change


if there is an item A → α • Bβ in the state add B → •γ to
the statefor all productions B → γ in the grammar
Our
first LR(0) state has a kernel → •S.
→of •SJ
Cl→osu• re introduces items S (S) andS a. No additional items can or
should be added (there are no more dots in front of non-terminals). So the
first state is:

A box is drawn around the state to indicate that the closure operation has been
performed.

To complete the state machine, we need to define the transitions between the
states. This is done by the goto function. If our state is q, whenever there is at
least one item of the form A → α • Xβ, where X is a terminal or nonterminal,
goto(q, X) is defined. It is theset of all items A → αX • β where A → α • Xβ is
an item in q. Important note: Never do this to items like A → •ϵ, which is a
reduce item. The goto function doesn’t apply to such
productions, because ϵ is contains neither terminal nor nonterminal symbols. I
would love tohave a quarter where no students add transitions from A → •ϵ to
A → ϵ•.

For each symbol X, goto generates the kernel of a successor state to q. There are
severalimportant points to notice:


For a given symbol X, goto operates on all of the items where X is the
next [Link] is only one successor on each X.
• If qJ = goto(q, X), all of the items in qJ are of the →f o rm A• αX •
β. I.e., the
is always
immediately after an X.

Once the kernels of the successor states have been generated, the closure operation
is applied to each to complete the set of items in the state. Given an LR(0) state, it
is possible to determine which of its items were in its kernel and which were
introduced by closur•e by checking whether the is at the beginning of the item
→ •
(closure) or not (kernel) (the oneexception to this r ule is the item that started
everything, SJ S, which is introduced by neither operation).
If, after applying closure, the set of items is exactly the same as some other set
of items, the two states are actually the same state. (An optimization to this is
to look at whether the kernel generated by goto is the same as the kernel of an
existing state, in which case thestates can be merged before wasting a closure
operation.)

Let’s apply the goto operation to the one state we have generated so far (let us
call it 0). There are three symbols of interest: “S”, “(”, and “a”. In each case,
there is only one item with that symbol next.

goto(0, S) = {SJ → S•}


goto(0, “(” ) = {S →
(•S)}goto(0, a) = {S →
a•}

In the resulting DFA, there is a transition from state qi to qj on symbol X if


and only if
qj = closure(goto(qi, X)). The initial state of the DFA is the one whose kernel
is SJ → •S.
The complete LR(0) machine for the current example is:
3 4
) 4 S →
(S)•

Let’s consider state 2 in more detail. The state was first generated by g{ot o→(0, J
(J) = S
(•S)}; this is the kernel of state 2. Because of the .•. . S . . . in the kernel item, closure
adds ite→ms• S (→S) •and S a, completing the state. goto(2{,J (→J ) ) •als}o
generates S ( S) , so at this point, we know we are going to state 2
(looping back to it, actually). Or we waituntil we generate the closure of our
“new” state, and then notice that we have exactly the same items we had in
state 2 and merge the states. State 2 is different from state 0 becausethe kernel
items are different (even though the closure items are the same).

It is illuminating to see how this machine directs parsing. When an item h a •s . . . a


. . . for some terminal a, we call it a shift item. It says that a should be shifted
onto the stack if it appears as the next input symbol.

An item of the form →A • α is a reduce item. It indicates that, when this state is
reached, the pro→duction A α should be reduced (α will be guaranteed to be
on top of the stack if theparser gets to this state). →Red•uci ng the item SJ
S accepts
→ •
the input s t r i n g . Important: an item like A ϵ is a reduce item; since the
RHS
is the empty string, position 0 is theend of the item.

Unlike the example of shift-reduce parsing, an LR(0) parser does not actually
shift symbolsonto the stack. Instead, it shifts states. No information is lost because
of the special structure of the LR(0) machine. Notice that every transition into a
state
has exactly the same symbol labelling it. If you see a state q on the stack, it is as
though a were shifted onto the [Link] stack will also have states corresponding
to nonterminal symbols.

In more detail, the LR(0) parsing algorithm starts with the first state (0 in
our example)and executes the following steps repeatedly:

shift If the next input is a and there is a transition on a from the top state on
the stack(call it qi) to some state qj, push qj on the stack and remove a
from the input.
reduce If the state has a reduce item A → α•

1. Pop one state on the stack for every symbol in α (note: symbols
associated withthese states will always match symbols in α).
2. Let the top state on the stack now be qi. There will be a transition in
the LR(0)machine on A to a state qj. Push qj on the stack
error If the state has no reduce item, the next input is a, and there is no
transition on a, report a parse error and halt.
accept When the item →S J S• is reduced accept if the next input symbol is
$, otherwise report an error and halt. (This rule is a bit weird. The
remaining LR-style parsing algorithms don’t need to check if the input
is empty. We define it this way so LR(0) parsing can do our simple
example grammar.)

LR(0) parsing requires that each of these steps be uniquely determined by the
LR(0) machineand the input. Therefore, if a state has a reduce item, it must not
have any other reduce items or shift items. With this restriction, the current
state determines whether to shift or reduce, and which production to reduce,
without looking at the next input. If it shifts, it can read the next input to see
which state to shift.

Let’s parse the input “((a))” using this LR(0) machine.

Stack input action


(top)
0 ((a)) shift 2
$ $
02 (a))$ shift 2
$(
022 a))$ shift 5
$((
0225 ))$ reduce S → a
$((a
0223 ))$ shift 4
$((S
02234 ))$ reduce S →
$((S) (S)
023 )$ shift 4
$(S
0234 $ reduce S →
$(S) (S)
01 $ accept
$S
At each step, we have listed the symbols associated with the states on the stack
(associatingthe “bottom of stack” symbol, $, with state 0). Let’s look a→t the
reductions of S (S) in more detail. When the first such reduction occurs, the
stack is 02234; three symbols are popped of (because the length of “(S)” is 3),
leaving a stack of 02. There is a transition from the top state, 2, on S to state 3,

so we push a 3, leaving 023 on the stack. The second time it reduces S (S), the
stack is 0234. When three states are popped, this leaves a stack with j →u s t 0
on it(so top-of-stack state is different this time). There is a transition from state
0 to
state 1 on S, so the new stack is 01. At this point, the action is to reduce SJ S
and the input has been consumed, so the parser accepts. (It helps to visualize
parsing by tracingthe top-of-stack state in the diagram with your finger as you
step through the parse.)

SLR(1) parsing

Here is an example of a CFG that is not LR(0):

0 SJ → S
1 S → Aa
2 S → Bb
3 S → ac
4 A → a
5 B → a
The productions are numbered because the numbers are used in the parse table
construction below.

Here is the LR(0) machine that results


0
S → •S

S →•
Aa
S
S′ → S•
S → •
Bb

S → •
ac

A → •a
B → •a
3
S → A a S →
• Aa•
a

5
S b
→ B • S →
b Bb•

The machine is not LR(0) because of shift/reduce and reduce/reduce conflicts in


state 6 (there is a shift item and two reduce items in the state, so the parser doesn’t
know whetherto shift or reduce, and if it decided to reduce, anyway, it wouldn’t
know which productionto reduce). Hence, this grammar is not LR(0).
However, if we allowed the parser to base its choice on the next input symbol,
the correct choice could be made reliably. If you examine the grammar
carefu→lly,
you can see that A ashould only be reduced w h →e n the next input is a, B a should
only be reduced when the next input is b, and, if the next input is c, the parser
should shift.

How could we determine this algorithmically? The next three parsing


algorithms all do itin different ways. The simplest method is SLR(1) parsing,
which uses FOLLOW sets to compute lookaheads for actions. Using the rules
for computing FOLLOW sets in L{ L}( 1) parsing, we {co}mpute FOLLOW (S{)
}
=
$ , FOLLOW (A) = a , and FOLLOW (B) = b . We can then associate
each reduce item with a lookahead set consisting of the FOLLOW set of the
symbol on the left-hand side of the item. State 6 would then look like:

S → a•c
A → a•, {a}
B → b•, {b}

Since the lookahead sets for each shift item are disjoint from each other, and
disjoint from the symbols that can be shifted, this state has no SLR(1) conflicts.
The parse table for SLR(1) has the same format as for the two other shift-
reduce parsing algorithms that are going to be discussed, LR(1) and LALR(1).
The parse table consists of two two-dimensional arrays: an ACTION table and
a GOTO table. Here is the SLR(1) parse table for the CFG above.

ACTION GOTO
a b c $ S A B
0 s6 1 2 4
1 acc
2 s3
3 r1
4 s5
5 r2
6 r4 r5 s7
7 r3
The rows of both tables are indexed by the states of the LR(0) machine. The
columns of theACTION table are indexed by terminal symbols and the end-of-
file marker $. The columns of the GOTO table are indexed by nonterminals.

The entries of the ACTION table are shift actions, of the form sn, where n is
index of an LR(0) state, or rp, where p is the index of a production in the CFG.
At each step, the parser looks in ACTION[q, a], where q is the current LR(0)
state and a is the next input [Link] the entry is “sn,” it shifts state number n
onto the stack. If the entry is “rp,” it reduces production p. If there is no entry
in the ACTION table, a parse error is reported (the inputis not in the language
of the CFG).

The GOTO table gives the next-state transition function for nonterminals
(given a current LR state and a nonterminal, it gives the next LR state). It is
used during reductions: after popping states for the right-hand side of a
production, it designates the state to be pushed for the left-hand side symbol.
Empty entries in the GOTO table are never referenced, even if the input string
is not parse-able (if there were a parse error, it would have been caught earlier
when an empty entry of the ACTION table was referenced).

As with LL(1) parsing, all LR parsing algorithms require that the table
uniquely determinethe next parse action: there must be at most one action in
each entry of the parse table, or the CFG cannot be handled by the parsing
algorithm, in which case the CFG is said tobe “not SLR(1)” (or LR(1) or
LALR(1)). When there are multiple actions in a table entry, there is said to
be a conflict. There can be shift/reduce conflicts (if there is a shift and a
reduce action in the same table entry), or reduce/reduce conflicts (if there are
several reduce actions in the same table entry). There is no such thing as a
shift/shift conflict, because the
LR machine goto operation ensure that each state has at most one successor.
Conflicts onlyshow up in the ACTION table, not the GOTO table.

The above table has no conflicts. The most interesting row is for state 6 (the
one with the LR(0) conflicts). Observe that the row has two different reduce
actions and a shift action, yetthey do not conflict because they are in
different columns (because their lookahead symbolsare disjoint).

LR parse algorithm

The table-driven version of the SLR(1) parsing algorithm is exactly the


same for LR(1) and LALR(1) parsing. The differences among these
algorithms are in the table constructions.

When production p is reduced, the parser looks up the production, and pops
one state off ofthe stack for each symbol on the right-hand side of
production
p. Then it looks up an entry in the GOTO table. The columns of the GOTO
table are indexed by nonterminal symbols, and the entries in the table are the
indexes of LR(0) states. The parser pushes onto the parse stack the state in
GOTO[n, A], where n is the LR(0) state on top of the stack (after popping
one state for each right-hand symbol), and A is the nonterminal on the left-
hand side of production p.

We assume state 0 is the start state, whose k{e r n →e l i•s }SJ S .


Initially, the stack has state 0 and nothing else on it, and the input is the
input string followed by the end-of-file marker,
$.

1. Let q to the top state on the stack, let a be the next input symbol, and
let act be
ACTION [q, a].
2. If act is “sn”, push n on the stack and remove a from the input;

3. Else, if act is “rp”, and supposing production p is A → α,

(a) Pop length(α) states off of the stack. Let qJ be the top of stack symbol
immediatelyafter doing this.
(b) Push GOTO[qJ, A] onto the stack;
4. Else, if act is “acc”, accept the input;

5. Else, if act is “error” (an empty entry), report an error (the input string
is not in thelanguage of the CFG).
The “accept” action only ever appears in the column $, so it will only be found
when theinput has been exhausted. Unlike LR(0) parsing, it is not necessary to
have a separate checkto see if the input is empty before accepting.

Example SLR(1) parse

As an example, let us parse the input “ab” using our CFG and table.

Stack input
(top)
0 ab$
$
06 b$
$a
04 b$
$B
045 $
$Bb
01 $
$S
accept

Filling in the parse tables

The ACTION table is filled in according to the following rules.

• If there is a transition from state q to state number n on terminal


symbol a, set
ACTION [q, a] = sn.
• there is a reduce i t e m A α in state q and a is in FOLLOW (A), set
If → •
ACTION [q, a] =rp, where →
p is the number of A → α• , unless the
item is S J
S , in which case set ACTION [q, a] = acc
(in this case a = $).
• ACTION [q, a] already has an entry when one of the rules above applies,
If
report thatthe CFG is “not SLR(1)” and halt.

The definition GOTO table is quite simple: if there is a transition from state
q to state n
on nonterminal A in the LR(0) machine, set GOTO[q, A] = n.

You should reconstruct the table above to see how the rules apply to it. We also
recommendthat you try doing the SLR(1) parse table construction for the CFG
S → Sa | ϵ.

What is Recursive Descent Parser?


Recursive Descent Parser uses the technique of Top-Down Parsing without
backtracking. It can be defined as a Parser that uses the various recursive
procedure to process the input string with no backtracking. It can be simply
performed using a Recursive language. The first symbol of the string of R.H.S of
production will uniquely determine the correct alternative to choose.
The major approach of recursive-descent parsing is to relate each non-terminal
with a procedure. The objective of each procedure is to read a sequence of input
characters that can be produced by the corresponding non-terminal, and return a
pointer to the root of the parse tree for the non-terminal. The structure of the
procedure is prescribed by the productions for the equivalent non-terminal.
The recursive procedures can be simply to write and adequately effective if written
in a language that executes the procedure call effectively. There is a procedure for
each non-terminal in the grammar. It can consider a global variable lookahead,
holding the current input token and a procedure match (Expected Token) is the
action of recognizing the next token in the parsing process and advancing the input
stream pointer, such that lookahead points to the next token to be parsed. Match ()
is effectively a call to the lexical analyzer to get the next token.
For example, input stream is a + b$.
lookahead == a
match()
lookahead == +
match ()
lookahead == b
……………………….
……………………….
In this manner, parsing can be done.
Example − In the following Grammar, first symbol, i.e., if, while & begin uniquely
determine, which of the alternative to choose.
As Statement starting with if will be a conditional statement & statement starting
with while will be an Iterative Statement.
Stmt → If condition then Stmt else Stmt
| While condition do Stmt
| begin Stmt end.
Example − Write down the algorithm using Recursive procedures to implement the
following Grammar.
E → TE′
E′ → +TE′
T → FT′
T′ →∗ FT′|ε
F → (E)|id
One of major drawback or recursive-descent parsing is that it can be implemented
only for those languages which support recursive procedure calls and it suffers
from the problem of left-recursion.
BOTTOM-UP OR SHIFT REDUCE PARSERS
In this article, we are discussing the Bottom Up parser. Bottom-up Parsers / Shift
Reduce Parsers Build the parse tree from leaves to root. Bottom-up parsing can be
defined as an attempt to reduce the input string w to the start symbol of grammar by
tracing out the rightmost derivations of w in reverse. Eg.

Classification of Bottom-up Parsers:


A general shift reduce parsing is LR parsing. The L stands for scanning the input from
left to right and R stands for constructing a rightmost derivation in reverse.
Benefits of LR parsing:
1. Many programming languages using some variations of an LR parser. It should be
noted that C++ and Perl are exceptions to it.
2. LR Parser can be implemented very efficiently.
3. Of all the Parsers that scan their symbols from left to right, LR Parsers detect
syntactic errors, as soon as possible.
In this discussion, we will explore the construction of the GOTO graph for a grammar
using all four LR parsing techniques. The GOTO graph is particularly useful in
solving questions in the GATE exam as it allows for a more efficient analysis of the
given grammar.
To construct the GOTO graph using LR(0) parsing, we rely on two essential
functions: Closure() and Goto().
Firstly, we introduce the concept of an augmented grammar. In the augmented
grammar, a new start symbol, S’, is added, along with a production S’ -> S. This
addition helps the parser determine when to stop parsing and signal the acceptance of
input. For example, if we have a grammar S -> AA and A -> aA | b, the augmented
grammar will be S’ -> S and S -> AA.
Next, we define LR(0) items. An LR(0) item of a grammar G is a production of G
with a dot (.) positioned at some point on the right-hand side. For instance, given the
production S -> ABC, we obtain four LR(0) items: S -> .ABC, S -> [Link], S ->
AB.C, and S -> ABC. It is worth noting that the production A -> ε generates only one
item: A -> .ε.
By utilizing the Closure() function, we can calculate the closure of a set of LR(0)
items. The closure operation involves expanding the items by considering the
productions that have the dot right before the non-terminal symbol. This step helps us
identify all the possible items that can be derived from the current set.
The Goto() function is employed to construct the transitions between LR(0) items in
the GOTO graph. It determines the next set of items by shifting the dot one position to
the right. This process allows us to navigate through the graph and track the parsing
progress.
Augmented Grammar: If G is a grammar with start symbol S then G’, the
augmented grammar for G, is the grammar with new start symbol S’ and a production
S’ -> S. The purpose of this new starting production is to indicate to the parser when it
should stop parsing and announce acceptance of input. Let a grammar be S -> AA A -
> aA | b, The augmented grammar for the above grammar will be S’ -> S S -> AA A -
> aA | b.
LR(0) Items: An LR(0) is the item of a grammar G is a production of G with a dot at
some position in the right side. S -> ABC yields four items S -> .ABC S -> [Link] S ->
AB.C S -> ABC. The production A -> ε generates only one item A -> .ε
Closure Operation: If I is a set of items for a grammar G, then closure(I) is the set of
items constructed from I by the two rules:
1. Initially every item in I is added to closure(I).
2. If A -> α.Bβ is in closure(I) and B -> γ is a production then add the item B -> .γ to
I, If it is not already there. We apply this rule until no more items can be added to
closure(I).
Eg:

Goto Operation : Goto(I, X) =


1. Add I by moving dot after X.
2. Apply closure to first

step.
Construction of GOTO graph-
 State I0 – closure of augmented LR(0) item
 Using I0 find all collection of sets of LR(0) items with the help of DFA
 Convert DFA to LR(0) parsing table
Construction of LR(0) parsing table:
 The action function takes as arguments a state i and a terminal a (or $ , the input
end marker). The value of ACTION[i, a] can have one of four forms:
1. Shift j, where j is a state.
2. Reduce A -> β.
3. Accept
4. Error
 We extend the GOTO function, defined on sets of items, to states: if GOTO[Ii , A]
= Ij then GOTO also maps a state i and a nonterminal A to state j.
Eg: Consider the grammar S ->AA A -> aA | b Augmented grammar S’ -> S S -> AA
A -> aA | b The LR(0) parsing table for above GOTO graph will be

Action part of the table contains all the terminals of the grammar whereas the goto
part contains all the nonterminals. For every state of goto graph we write all the goto
operations in the table. If goto is applied to a terminal then it is written in the action
part if goto is applied on a nonterminal it is written in goto part. If on applying goto a
production is reduced ( i.e if the dot reaches at the end of production and no further
closure can be applied) then it is denoted as Ri and if the production is not reduced
(shifted) it is denoted as Si. If a production is reduced it is written under the terminals
given by follow of the left side of the production which is reduced for ex: in I 5 S->AA
is reduced so R1 is written under the terminals in follow(S)={$} (To know more about
how to calculate follow function: Click here ) in LR(0) parser. If in a state the start
symbol of grammar is reduced it is written under $ symbol as accepted.

NOTE: If in any state both reduced and shifted productions are present or two
reduced productions are present it is called a conflict situation and the grammar is not
LR grammar.

NOTE:
1. Two reduced productions in one state – RR conflict.
2. One reduced and one shifted production in one state – SR conflict. If no SR or RR
conflict present in the parsing table then the grammar is LR(0) grammar. In above
grammar no conflict so it is LR(0) grammar.

You might also like