Evolution of Programming Languages
Evolution of Programming Languages
UNIT 1
SYNTAX AND SEMANTICS
The Evolution of Programming Languages:
1952: Autocode
Developed by Alick Glennie.
The first compiled computer programming language.
COBOL and FORTRAN are the languages referred to as Autocode.
1957: FORTRAN
Developers are John Backus and IBM.
It was designed for numeric computation and scientific computing.
Software for NASA probes voyager-1 (space probe) and voyager-2 (space probe)
was originally written in FORTRAN 5.
1958: ALGOL
ALGOL stands for ALGOrithmic Language.
The initial phase of the most popular programming languages of C, C++, and
JAVA.
It was also the first language implementing the nested function and has a simple
syntax than FORTRAN.
The first programming language to have a code block like “begin” that indicates
that your program has started and “end” means you have ended your code.
1959: COBOL
It stands for COmmon Business-Oriented Language.
In 1997, 80% of the world’s business ran on Cobol.
The US internal revenue service scrambled its path to COBOL-based IMF
(individual master file) in order to pay the tens of millions of payments mandated
by the coronavirus aid, relief, and economic security.
1964: BASIC
It stands for beginners All-purpose symbolic instruction code.
In 1991 Microsoft released Visual Basic, an updated version of Basic
The first microcomputer version of Basic was co-written by Bill Gates, Paul
Allen, and Monte Davidoff for their newly-formed company, Microsoft.
1972: C
It is a general-purpose, procedural programming language and the most popular
programming language till now.
All the code that was previously written in assembly language gets replaced by the
C language like operating system, kernel, and many other applications.
It can be used in implementing an operating system, embedded system, and also
on the website using the Common Gateway Interface (CGI).
C is the mother of almost all higher-level programming languages like C#, D, Go,
Java, JavaScript, Limbo, LPC, Perl, PHP, Python, and Unix’s C shell.
Some other programming languages that are popular among programmers are listed
below.
YEAR OF PROGRAMMING
RELEASE LANGUAGES FACTS
2. High-Level Languages: These languages are easier to learn and use than
low-level languages. They are used to write applications, games, and
websites. Examples include Java, Python, and Ruby.
Chapter 2
Describing sytax:
What is syntax in a programming language?
f you’re new to programming, you’ll find that learning syntax rules is a key part of
learning a programming language. Knowing what syntax is and why it’s important
will help you learn these rules and gain a better understanding of how code works
structurally.
Today, we’ll explore syntax in more detail, consider why it’s important, and learn
how it varies between a few major programming languages.
What is syntax?
Syntax is a set of rules that tell us what arrangements of characters create a valid
statement in a language. Human languages and programming languages are both
dependent on syntax. To use either type of language effectively, you need to know
how to fit elements together to achieve your goal. In the case of a human
language, this goal is successful communication. In the case of a programming
language, the goal is to issue a set of directives that a computer can read and act
on.
Syntax errors occur when the elements in a statement are disordered in a way that
impedes successful communication. Let’s look at a simple example from English:
1. The dog chased the rabbit.
2. The rabbit chased the dog.
3. Chased the rabbit the dog.
As you can see, word order matters a great deal in English. The words are the same in
these three sentences, but combining them in different ways results in very different
meanings. Sentence one contains a simple statement that accords with English syntax.
Sentence two switches the positions of the subject and direct object, so that while the
sentence still makes sense syntactically, the meaning is radically different. The third
sentence shuffles the words around in such a way that it’s difficult to read. It doesn’t
follow the conventions of English syntax, so it’s unclear what the sentence is saying.
Using correct syntax is arguably even more important in programming languages than
it is in human languages. If you’re learning a new human language and you only have
a beginner’s understanding of its syntax, you’ll often still be able to get your meaning
across. This is because syntax in human languages is often flexible, and human
listeners can problem-solve to figure out the meaning of an imperfect sentence.
Programming languages are less accommodating to syntax errors. Syntax determines
how we organize the elements in our code to make it legible to the computer. If
there’s a syntax error, the computer might not be able to read it.
If the syntax of a language is not followed, the code will not be understood by a
compiler or interpreter.
Compilers convert programming languages like Java or C++ into binary code that
computers can understand. If the syntax is incorrect, the code will not compile.
That’s why it is crucial that a programmer pays close attention to a language’s syntax.
No programmer likes to get a syntax error.
Every language has its own set of rules that make up its basic
syntax. Naming conventions are a primary component of basic
syntax conventions and vary by language.
Case Sensitive. Java, C++, and Python are examples of
languages that are case-sensitive. Identifiers such as world
and World have different meanings in these languages.
Languages such as Basic and SQL are insensitive, meaning
world and World have the same meaning.
Class Names. Java requires the first letter of each word in class
names be upper case. For example, class FirstJavaClass.
Languages such as C or C++ use an underscore to separate
words. In C, the class name would be first_java_class.
Program Filenames. The name of a Java program file must
match the class name with the extension ‘*.java” added to
the name. For example, [Link] would be the
name of the program file for the class FirstJavaClass. C and
C++ files require a “*.c” or “*.cpp” extension but have no
other stipulations.
Communication
Code integration
Consistency
Clarity
FORMAL METHODS
In the middle to late 1950s, two men, Noam Chomsky and John Backus, in
unrelated research efforts, developed the same syntax description formalism, which
subsequently became the most widely used method for programming language syntax.
Context-Free Grammars
Two of these grammar classes, named context-free and regular, turned out to be
useful for describing the syntax of programming languages.
The forms of the tokens of programming languages can be described by regular
grammars.
The syntax of whole programming languages, with minor exceptions, can be
described by context-free grammars.
BNF Fundamentals
Describing Lists
Communication
Code integration
Consistency
Clarity
If S → x1x2 …… xn is a production rule in a CFG, then the parse tree / derivation tree
will be as follows −
There are two different approaches to draw a derivation tree −
Top-down Approach −
Starts with the starting symbol S
Goes down to tree leaves using productions
Bottom-up Approach −
Starts from tree leaves
Proceeds upward to the root which is the starting symbol S
Derivation or Yield of a Tree
The derivation or the yield of a parse tree is the final string obtained by concatenating
the labels of the leaves of the tree from left to right, ignoring the Nulls. However, if
all the leaves are Null, derivation is Null.
Example
Let a CFG {N,T,P,S} be
N = {S}, T = {a, b}, Starting symbol = S, P = S → SS | aSb | ε
One derivation from the above CFG is “abaabb”
S → SS → aSbS → abS → abaSb → abaaSbb → abaabb
Sentential Form and Partial Derivation Tree
A partial derivation tree is a sub-tree of a derivation tree/parse tree such that either all
of its children are in the sub-tree or none of them are in the sub-tree.
Example
If in any CFG the productions are −
S → AB, A → aaA | ε, B → Bb| ε
the partial derivation tree can be the following −
If a partial derivation tree contains the root S, it is called a sentential form. The
above sub-tree is also in sentential form.
Leftmost and Rightmost Derivation of a String
Leftmost derivation − A leftmost derivation is obtained by applying
production to the leftmost variable in each step.
Rightmost derivation − A rightmost derivation is obtained by applying
production to the rightmost variable in each step.
Example
Let any set of production rules in a CFG be
X → X+X | X*X |X| a
over an alphabet {a}.
The leftmost derivation for the string "a+a*a" may be −
X → X+X → a+X → a + X*X → a+a*X → a+a*a
The stepwise derivation of the above string is shown as below −
The rightmost derivation for the above string "a+a*a" may be −
X → X*X → X*a → X+X*a → X+a*a → a+a*a
The stepwise derivation of the above string is shown as below −
Left and Right Recursive Grammars
In a context-free grammar G, if there is a production in the form X → Xa where X is
a non-terminal and ‘a’ is a string of terminals, it is called a left recursive production.
The grammar having a left recursive production is called a left recursive grammar.
And if in a context-free grammar G, if there is a production is in the form X →
aX where X is a non-terminal and ‘a’ is a string of terminals, it is called a right
recursive production. The grammar having a right recursive production is called
a right recursive grammar.
Previous Page
Next Page
In a CFG, it may happen that all the production rules and symbols are not needed for
the derivation of strings. Besides, there may be some null productions and unit
productions. Elimination of these productions and symbols is called simplification of
CFGs. Simplification essentially comprises of the following steps −
Reduction of CFG
Removal of Unit Productions
Removal of Null Productions
Reduction of CFG
CFGs are reduced in two phases −
Phase 1 − Derivation of an equivalent grammar, G’, from the CFG, G, such that each
variable derives some terminal string.
Derivation Procedure −
Step 1 − Include all symbols, W1, that derive some terminal and initialize i=1.
Step 2 − Include all symbols, Wi+1, that derive Wi.
Step 3 − Increment i and repeat Step 2, until Wi+1 = Wi.
Step 4 − Include all production rules that have Wi in it.
Phase 2 − Derivation of an equivalent grammar, G”, from the CFG, G’, such that
each symbol appears in a sentential form.
Derivation Procedure −
Step 1 − Include the start symbol in Y1 and initialize i = 1.
Step 2 − Include all symbols, Yi+1, that can be derived from Yi and include all
production rules that have been applied.
Step 3 − Increment i and repeat Step 2, until Yi+1 = Yi.
Problem
Find a reduced grammar equivalent to the grammar G, having production rules, P: S
→ AC | B, A → a, C → c | BC, E → aA | e
Solution
Phase 1 −
T = { a, c, e }
W1 = { A, C, E } from rules A → a, C → c and E → aA
W2 = { A, C, E } U { S } from rule S → AC
W3 = { A, C, E, S } U ∅
Since W2 = W3, we can derive G’ as −
G’ = { { A, C, E, S }, { a, c, e }, P, {S}}
where P: S → AC, A → a, C → c , E → aA | e
Phase 2 −
Y1 = { S }
Y2 = { S, A, C } from rule S → AC
Y3 = { S, A, C, a, c } from rules A → a and C → c
Y4 = { S, A, C, a, c }
Since Y3 = Y4, we can derive G” as −
G” = { { A, C, S }, { a, c }, P, {S}}
where P: S → AC, A → a, C → c
Removal of Unit Productions
Any production rule in the form A → B where A, B ∈ Non-terminal is called unit
production..
Removal Procedure −
A CFG is in Chomsky Normal Form if the Productions are in the following forms −
A→a
A → BC
S→ε
where A, B, and C are non-terminals and a is terminal.
Algorithm to Convert into Chomsky Normal Form −
Step 1 − If the start symbol S occurs on some right side, create a new start
symbol S’ and a new production S’→ S.
Step 2 − Remove Null productions. (Using the Null production removal algorithm
discussed earlier)
Step 3 − Remove unit productions. (Using the Unit production removal algorithm
discussed earlier)
Step 4 − Replace each production A → B1…Bn where n > 2 with A → B1C where C
→ B2 …Bn. Repeat this step for all productions having two or more symbols in the
right side.
Step 5 − If the right side of any production is in the form A → aB where a is a
terminal and A, B are non-terminal, then the production is replaced by A → XB and
X → a. Repeat this step for every production which is in the form A → aB.
Problem
Convert the following CFG into CNF
S → ASA | aB, A → B | S, B → b | ε
Solution
(1) Since S appears in R.H.S, we add a new state S0 and S0→S is added to the
production set and it becomes −
S0→S, S→ ASA | aB, A → B | S, B → b | ∈
(2) Now we will remove the null productions
− B → ∈ and A → ∈
After removing B → ε, the production set becomes −
S0→S, S→ ASA | aB | a, A → B | S | ∈, B → b
After removing A → ∈, the production set becomes −
S0→S, S→ ASA | aB | a | AS | SA | S, A → B | S, B → b
(3) Now we will remove the unit productions.
After removing S → S, the production set becomes −
S0→S, S→ ASA | aB | a | AS | SA, A → B | S, B → b
After removing S0→ S, the production set becomes −
S0→ ASA | aB | a | AS | SA, S→ ASA | aB | a | AS | SA
A → B | S, B → b
After removing A→ B, the production set becomes −
S0 → ASA | aB | a | AS | SA, S→ ASA | aB | a | AS | SA
A→S|b
B→b
After removing A→ S, the production set becomes −
S0 → ASA | aB | a | AS | SA, S→ ASA | aB | a | AS | SA
A → b |ASA | aB | a | AS | SA, B → b
(4) Now we will find out more than two variables in the R.H.S
Here, S0→ ASA, S → ASA, A→ ASA violates two Non-terminals in R.H.S.
Hence we will apply step 4 and step 5 to get the following final production set which
is in CNF −
S0→ AX | aB | a | AS | SA
S→ AX | aB | a | AS | SA
A → b |AX | aB | a | AS | SA
B→b
X→
SA
(5) We have to change the productions S0→ aB, S→ aB, A→ aB
And the final production set becomes −
S0→ AX | YB | a | AS | SA
S→ AX | YB | a | AS | SA
A → b A → b |AX | YB | a | AS | SA
B→b
X → SA
Y→a
Greibach Normal Form
Previous Page
Next Page
A CFG is in Greibach Normal Form if the Productions are in the following forms −
A→b
A → bD1…Dn
S→ε
where A, D1,....,Dn are non-terminals and b is a terminal.
Algorithm to Convert a CFG into Greibach Normal Form
Step 1 − If the start symbol S occurs on some right side, create a new start
symbol S’ and a new production S’ → S.
Step 2 − Remove Null productions. (Using the Null production removal algorithm
discussed earlier)
Step 3 − Remove unit productions. (Using the Unit production removal algorithm
discussed earlier)
Step 4 − Remove all direct and indirect left-recursion.
Step 5 − Do proper substitutions of productions to convert it into the proper form of
GNF.
Problem
Convert the following CFG into CNF
S → XY | Xn | p
X → mX | m
Y → Xn | o
Solution
Here, S does not appear on the right side of any production and there are no unit or null
productions in the production rule set. So, we can skip Step 1 to Step 3.
Step 4
Now after replacing
X in S → XY | Xo | p
with
mX | m
we obtain
S → mXY | mY | mXo | mo | p.
And after replacing
X in Y → Xn | o
with the right side
of X → mX | m
we obtain
Y → mXn | mn | o.
Two new productions O → o and P → p are added to the production set and then we
came to the final GNF as the following −
S → mXY | mY | mXC | mC | p
X → mX | m
Y → mXD | mD | o
O→o
P→p
Pumping Lemma for CFG
Lemma
0, uvixyiz ∈ L.
Applications of Pumping Lemma
Pumping lemma is used to check whether a grammar is context free or not. Let us
take an example and show how it is checked.
Problem
Find out whether the language L = {xnynzn | n ≥ 1} is context free or not.
Solution
Let L is context free. Then, L must satisfy pumping lemma.
At first, choose a number n of the pumping lemma. Then, take z as 0n1n2n.
Break z into uvwxy, where
|vwx| ≤ n and vx ≠ ε.
Hence vwx cannot involve both 0s and 2s, since the last 0 and the first 2 are at least
(n+1) positions apart. There are two cases −
Case 1 − vwx has no 2s. Then vx has only 0s and 1s. Then uwy, which would have to
be in L, has n 2s, but fewer than n 0s or 1s.
Case 2 − vwx has no 0s.
Here contradiction occurs.
Hence, L is not a context-free language.
CHAPTER 4
ATTRIBUTE
GRAMMAR
Attribute grammar is a special form of context-free grammar where some additional
information (attributes) are appended to one or more of its non-terminals in order to
provide context-sensitive information. Each attribute has well-defined domain of
values, such as integer, float, character, string, and expressions.
Attribute grammar is a medium to provide semantics to the context-free grammar and
it can help specify the syntax and semantics of a programming language. Attribute
grammar (when viewed as a parse-tree) can pass values or information among the
nodes of a tree.
Example:
E → E + T { [Link] = [Link] + [Link] }
The right part of the CFG contains the semantic rules that specify how the grammar
should be interpreted. Here, the values of non-terminals E and T are added together
and the result is copied to the non-terminal E.
Semantic attributes may be assigned to their values from their domain at the time of
parsing and evaluated at the time of assignment or conditions. Based on the way the
attributes get their values, they can be broadly divided into two categories :
synthesized attributes and inherited attributes.
Synthesized attributes
These attributes get values from the attribute values of their child nodes. To illustrate,
assume the following production:
S → ABC
If S is taking values from its child nodes (A,B,C), then it is said to be a synthesized
attribute, as the values of ABC are synthesized to S.
As in our previous example (E → E + T), the parent node E gets its value from its
child node. Synthesized attributes never take values from their parent nodes or any
sibling nodes.
Inherited attributes
In contrast to synthesized attributes, inherited attributes can take values from parent
and/or siblings. As in the following production,
S → ABC
A can get values from S, B and C. B can take values from S, A, and C. Likewise, C
can take values from S, A, and B.
Expansion : When a non-terminal is expanded to terminals as per a grammatical rule
Static Semantics
There are some characteristics of the structure of programming languages that
are difficult to describe with BNF, and some that are impossible. As an
example of a syntax rule that is difficult to specify with BNF, consider type
compatibility rules. In Java, for example, a floating-point value cannot be
assigned to an integer type variable, although the opposite is legal. Although
this restriction can be specified in BNF, it requires additional nonterminal
symbols and rules. If all of the typing rules of Java were specified in BNF, the
grammar would become too large to be useful, because the size of the grammar
determines the size of the syntax analyzer.
• Associated with each grammar symbol X is a set of attributes A(X). The set A(X)
consists of two disjoint sets S(X) and I(X), called synthesized and inherited attributes,
respectively. Synthesized attributes are used to pass semantic information up a parse
tree, while inherited attributes pass semantic information down and across a tree.
• Associated with each grammar rule is a set of semantic functions and a possibly
empty set of predicate functions over the attributes of the symbols in the grammar
rule. For a rule X0 SX1 c Xn, the synthesized attributes of X0 are computed with
semantic functions of the form S(X0) = f(A(X1), c , A(Xn)). So the value of a
synthesized attribute on a parse tree node depends only on the values of the attributes
on that node’s children nodes. Inherited attributes of symbols Xj, 1 … j … n (in the
rule above), are computed with a semantic function of the form I(Xj) = f(A(X0), c ,
A(Xn)). So the value of an inherited attribute on a parse tree node depends on the
attribute values of that node’s parent node and those of its sibling nodes. Note that, to
avoid circularity, inherited attributes are often restricted to functions of the form I(Xj)
= f(A(X0), c , A(X(j-1))). This form prevents an inherited attribute from depending on
itself or on attributes to the right in the
parse tree.
• A predicate function has the form of a Boolean expression on the union of the
attribute set {A(X0), c , A(Xn)} and a set of literal attribute values. The only
derivations allowed with an attribute grammar are those in which every predicate
associated with every nonterminal is true. A false predicate
function value indicates a violation of the syntax or static semantics rules
of the language.
A parse tree of an attribute grammar is the parse tree based on its underlying BNF
grammar, with a possibly empty set of attribute values attached to each node. If all the
attribute values in a parse tree have been computed, the tree is said to be fully
attributed. Although in practice it is not always done this way, it is convenient to
think of attribute values as being computed after the complete unattributed parse tree
has been constructed by the compiler.
Intrinsic Attributes
Intrinsic attributes are synthesized attributes of leaf nodes whose values are determined
outside the parse tree. For example, the type of an instance of a variable in a program
could come from the symbol table, which is used to store variable names and their
types. The contents of the symbol table are set based on earlier declaration statements.
Initially, assuming that an unattributed parse tree has been constructed and that
attribute values are needed, the only attributes with values are the intrinsic attributes
of the leaf nodes. Given the intrinsic attribute values on a parse tree, the semantic
functions can be used to compute the remaining attribute values.
Examples of Attribute Grammars
As a very simple example of how attribute grammars can be used to describe static
semantics, consider the following fragment of an attribute grammar that describes the
rule that the name on the end of an Ada procedure must match the procedure’s name.
(This rule cannot be stated in BNF.) The string attribute of <proc_name>, denoted by
<proc_name>.string, is the actual string of characters that were found immediately
following the reserved word procedure by the compiler. Notice that when there is
more than one occurrence of a nonterminal in a syntax rule in an attribute grammar,
the nonterminals are subscripted with brackets to distinguish them. Neither the
subscripts nor the brackets are part of the described language.
In this example, the predicate rule states that the name string attribute of the
<proc_name> nonterminal in the subprogram header must match the name string
attribute of the <proc_name> nonterminal following the end of the subprogram. Next,
we consider a larger example of an attribute grammar. In this case, the example
illustrates how an attribute grammar can be used to check the type rules of a simple
assignment statement. The syntax and static semantics of this assignment statement
are as follows: The only variable names are A, B, and C. The right side of the
assignments can be either a variable or an expression in the form of a variable added
to another variable. The variables can be one of two types: int or real. When there are
two variables on the right side of an assignment, they need not be the same type. The
type of the expression when the operand types are not the same is always real. When
they are the same, the expression type is that of the operands. The type of the left side
of the assignment must match the type of the right side. So the types of operands in
the right side can be mixed, but the assignment is valid only if the target and the value
resulting from evaluating the right side have the same type. The attribute grammar
specifies these static semantic rules.
The process of computing the attribute values of a parse tree, which is sometimes
called decorating the parse tree. If all attributes were inherited, this could proceed in a
completely top-down order, from the root to the leaves. Alternatively, it could
proceed in a completely bottomup order, from the leaves to the root, if all the
attributes were [Link] our grammar has both synthesized and inherited
attributes, the evaluation process cannot be in any single direction. The following is
an evaluation of the attributes, in an order in which it is possible to
compute them:
The tree in Figure 3.7 shows the flow of attribute values. Solid lines are used for the
parse tree; dashed lines show attribute flow in the tree. The tree in Figure 3.8 shows
the final attribute values on the nodes. In this example, A is defined as a real and B is
defined as an int. Determining attribute evaluation order for the general case of an
attribute grammar is a complex problem, requiring the construction of a dependency
mgraph to show all attribute dependencies.
Checking the static semantic rules of a language is an essential part of all compilers.
Even if a compiler writer has never heard of an attribute grammar, he or she would
need to use their fundamental ideas to design the checks of static semantics rules for
his or her compiler.
One of the main difficulties in using an attribute grammar to describe all of the syntax
and static semantics of a real contemporary programming language is the size and
complexity of the attribute grammar. The large number of attributes and semantic
rules required for a complete programming language make such grammars difficult to
write and read. Furthermore, the attribute values on a large parse tree are costly to
evaluate. On the other hand, less formal attribute grammars are a powerful and
commonly used tool for compiler writers, who are more interested in the process of
producing a compiler than they are in formalism.
CHAPTER 5
Describing semantics
What is semantic?
In programming language theory, semantics is the rigorous mathematical study of the
meaning of programming [Link] assigns computational meaning to
valid strings in a programming language syntax.
Semantics describes the processes a computer follows when executing a program in
that specific language. This can be shown by describing the relationship between the
input and output of a program, or an explanation of how the program will be executed
on a certain platform, hence creating a model of computation.
Types of semantics
Formal semantics
Lexical semantics
Conceptual semantics.
Formal semantics
With formal semantics, words and meaning are examined from a philosophical or
mathematical standpoint. This branch of semantics develops models to help determine
the truth behind words instead of only examining real-world examples.
Lexical semantics
Lexical semantics are the most commonly-known type of semantics. This branch
looks for the meaning of individual words by considering the context and text
surrounding them.
Conceptual semantics
With conceptual semantics, the dictionary definition of the word is examined before
any context is applied. After examining the definition, the context is examined by
looking for connecting words, how meaning gets assigned, and how meaning may
change over time. If the word conveys context, it may be called a sign in conceptual
semantics.
Examples of semantics
Below are examples of how the semantics of different words may be interpreted by
how they're used.
Run
The word "run" can have many different meanings. For example, "Joe enjoys
running" describes a person (Joe) who runs for exercise. However, "My computer is
running" doesn't mean the computer is running like a human; it means the computer
is operational.
Crash
A "crash" could refer to a car accident, a drop in the Stock Market, attending a party
without being invited, or a computer problem.
Move
"Move" can describe taking something and putting it elsewhere, pushing or pulling
something to an alternate location, or something that stirs emotion.
Dynamic Semantics
The difficult task of describing the dynamic semantics, or meaning, of the
expressions, statements, and program units of a programming language. Because of
the power and naturalness of the available notation, describing syntax is a relatively
simple matter. On the other hand, no universally accepted notation or approach has
been devised for dynamic semantics.
There are several different reasons underlying the need for a methodology and
notation for describing semantics. Programmers obviously need to know precisely
what the statements of a language do before they can use them effectively in their
programs. Compiler writers must know exactly what language constructs mean to
design implementations for them correctly. If there were a precise semantics
specification of a programming language, programs written in the language
potentially could be proven correct without testing. Also, compilers could be shown to
produce programs that exhibited exactly the behavior given in the language definition;
that is, their correctness could be verified. A complete specification of the syntax and
semantics of a programming language could be used by a tool to generate a compiler
for the language automatically. Finally, language designers, who would develop the
semantic descriptions of their languages, could in the process discover ambiguities
and inconsistencies in their designs.
There are several problems with using this approach for complete formal
semantics descriptions. First, the individual steps in the execution of machine
language and the resulting changes to the state of the machine are too small and
too numerous. Second, the storage of a real computer is too large and complex.
There are usually several levels of memory devices, as well as connections to
enumerable other computers and memory devices through networks. Therefore,
machine languages and real computers are not used for formal operational
semantics. Rather, intermediate-level languages and interpreters for idealized
computers are designed specifically for the process.
There are different levels of uses of operational semantics. At the highest level,
the interest is in the final result of the execution of a complete program. This is
sometimes called natural operational semantics. At the lowest level,
operational semantics can be used to determine the precise meaning of a
program through an examination of the complete sequence of state changes that
occur when the program is executed. This use is sometimes called structural
operational semantics.
Denotational Semantics
Denotational semantics is the most rigorous and most widely known formal
method for describing the meaning of programs. It is solidly based on recursive
function theory. A thorough discussion of the use of denotational semantics to
describe the semantics of programming languages is necessarily long and
complex. It is our intent to provide the reader with an introduction to the central
concepts of denotational semantics, along with a few simple examples that are
relevant to programming language specifications.
The process of constructing a denotational semantics specification for a
programming language requires one to define for each language entity both a
mathematical object and a function that maps instances of that language entity
onto instances of the mathematical object. Because the objects are rigorously
defined, they model the exact meaning of their corresponding entities. The idea
is based on the fact that there are rigorous ways of manipulating mathematical
objects but not programming language constructs. The difficulty with this
method lies in creating the objects and the mapping functions. The method is
named denotational because the mathematical objects denote the meaning of
their corresponding syntactic entities.
Static semantics
Principles
The program is not evaluated, i.e. the semantics don't directly guide the
semantic algorithm's flow (which follows the program's structure instead)
Should terminate on every program
Must not depend on external events (such as IO, system errors, etc)
The algorithm should not need to visit the same node in the AST twice (or more
than n times, for n fixed for the algorithm)
The output tells you something about how the program is defined, not what
it does
Application
Type checking
Warnings / coding suggestions
Optimizing
Compiling
Dynamic semantics
Principles
The program is evaluated - this can mean for example that semantics of control
flow instructions affect the way the semantic algorithm is evaluated itself
Can loop, halting problem can be undecidable
Can depend on arbitrary real-world events
It may be necessary to evaluate the same part of the AST indefinite number
of times
The output tells you the results of the implemented
algorithm Application
Chapter 6
Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes modified source code from
language preprocessors that are written in the form of sentences. The lexical analyzer
breaks these syntaxes into a series of tokens, by removing any whitespace or
comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer
works closely with the syntax analyzer. It reads character streams from the source
code, checks for legal tokens, and passes the data to the syntax analyzer when it
demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are
some predefined rules for every lexeme to be identified as a valid token. These rules
are defined by grammar rules, by means of a pattern. A pattern explains what can be a
token, and these patterns are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers,
operators and punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is a
set of English language alphabets.
Strings
Any finite sequence of alphabets (characters) is called a string. Length of the string is
the total number of occurrence of alphabets, e.g., the length of the string tutorialspoint
is 14 and is denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string
of zero length is known as an empty string and is denoted by ε (epsilon).
Special symbols
A typical high-level language contains the following symbols:-
Assignment =
Preprocessor #
Language
A language is considered as a finite set of strings over some finite set of alphabets.
Computer languages are considered as finite sets, and mathematically set operations
can be performed on them. Finite languages can be described by means of regular
expressions.
Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid
string/token/lexeme that belong to the language in hand. It searches for the pattern
defined by the language rules.
Regular expressions have the capability to express finite languages by defining a
pattern for finite strings of symbols. The grammar defined by regular expressions is
known as regular grammar. The language defined by regular grammar is known
as regular language.
Regular expression is an important notation for specifying patterns. Each pattern
matches a set of strings, so regular expressions serve as names for a set of strings.
Programming language tokens can be described by regular languages. The
specification of regular expressions is an example of a recursive definition. Regular
languages are easy to understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which
can be used to manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
Union of two languages L and M is written
as L U M = {s | s is in L or s is in M}
Concatenation of two languages L and M is written
as LM = {st | s is in L and t is in M}
The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : (r)|(s) is a regular expression denoting L(r) U L(s)
Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
Kleene closure : (r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
Precedence and Associativity
*, concatenation (.), and | (pipe sign) are left associative
* has the highest precedence
Concatenation (.) has the second highest precedence.
| (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … } x+
means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x* x?
means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language. [A-
Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Representation occurrence of symbols using regular expressions
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Representation of language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted
solution is to use finite automata for verification.
Finite Automata
Finite automata is a state machine that takes a string of symbols as input and changes its
state accordingly. Finite automata is a recognizer for regular expressions. When a
regular expression string is fed into finite automata, it changes its state for each literal.
If the input string is successfully processed and the automata reaches its final state, it
is accepted, i.e., the string just fed was said to be a valid token of the language in
hand.
The mathematical model of finite automata consists of:
Finite set of states (Q)
Finite set of input symbols (Σ)
One Start state (q0)
Set of final states (qf)
Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a finite set of input
symbols (Σ), Q × Σ ➔ Q
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).
States : States of FA are represented by circles. State names are of the state is
written inside the circle.
Start state : The state from where the automata starts, is known as start state.
Start state has an arrow pointed towards it.
Intermediate states : All intermediate states has at least two arrows; one
pointing to and another pointing out from them.
Final state : If the input string is successfully parsed, the automata is expected
to be in this state. Final state is represented by double circles. It may have any
odd number of arrows pointing to it and even number of arrows pointing out
from it. The number of odd arrows are one greater than even, i.e. odd =
even+1.
Transition : The transition from one state to another state happens when a
desired symbol in the input is found. Upon transition, automata can either move
to next state or stay in the same state. Movement from one state to another is
shown as a directed arrow, where the arrows points to the destination state. If
automata stays on the same state, an arrow pointing from a state to itself is
drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA =
{Q(q0, qf), Σ(0,1), q0, qf, δ}
Example
How Pleasant Is The Weather?
See this Lexical Analysis example; Here, we can easily recognize that there are five
words How Pleasant, The, Weather, Is. This is very natural for us as we can recognize
the separators, blanks, and the punctuation symbol.
What’s a token?
Tokens in compiler design are the sequence of characters which represents a unit of
information in the source program.
What is Pattern?
A pattern is a description which is used by the token. In the case of a keyword which
uses as a token, the pattern is a sequence of characters.
Lexical analyzer scans the entire source code of the program. It identifies each token
one by one. Scanners are usually implemented to produce tokens only when requested
by a parser. Here is how recognition of tokens in compiler design works-
Lexical Analyzer skips whitespaces and comments while creating these tokens. If
any error is present, then Lexical analyzer will correlate that error with the source file
and line number.
#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
Examples of Nontokens
Type Examples
Macro NUMS
Whitespace /n /b /t
Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical
error. Important facts about the lexical error:
Lexical errors are not very common, but it should be managed by a scanner
Misspelling of identifiers, operators, keyword are considered as lexical errors
Generally, a lexical error is caused by the appearance of some illegal character,
mostly at the beginning of a token.
The simplicity of design: It eases the process of lexical analysis and the
syntax analysis by eliminating unwanted tokens
To improve compiler efficiency: Helps you to improve compiler efficiency
Specialization: specialized techniques can be applied to improves the
lexical analysis process
Portability: only the scanner requires to communicate with the outside world
Higher portability: input-device-specific peculiarities restricted to the lexer
Advantages of Lexical analysis
Lexical analyzer method is used by programs like compilers which can use
the parsed data from a programmer’s code to create a compiled binary
executable code
It is used by web browsers to format and display a web page with the help
of parsed data from JavsScript, HTML, CSS
A separate lexical analyzer helps you to construct a specialized and potentially
more efficient processor for the task
You need to spend significant time reading the source program and partitioning
it in the form of tokens
Some regular expressions are quite difficult to understand compared to PEG
or EBNF rules
More effort is needed to develop and debug the lexer and its token descriptions
Additional runtime overhead is required to generate the lexer tables
and construct the tokens
Summary
Types of Parsing
The parsing is divided into two types, which are as follows:
1. Top-down Parsing
2. Bottom-up Parsing
Top-Down Parsing
Top-down parsing attempts to build the parse tree from the root node to the leaf node.
The top-down parser will start from the start symbol and proceed to the string. It
follows the leftmost derivation. In leftmost derivation, the leftmost non-terminal in
each sentential is always chosen.
1. Recursive parsing or predictive parsing are other names for top-down parsing.
2. A parse tree is built for an input string using bottom-up parsing.
3. When parsing is done top-down, the input symbol is first transformed into the
start symbol.
The top-down parsing is further categorized as follows:
1. With Backtracking:
Brute Force Technique
2. Without Backtracking:
Recursive Descent Parsing
Predictive Parsing or Non-Recursive Parsing or LL(1) Parsing or Table Driver
Parsing
Bottom-up Parsing
Bottom-up parsing builds the parse tree from the leaf node to the root node. The
bottom-up parsing will reduce the input string to the start symbol. It traces the
rightmost derivation of the string in reverse. Bottom-up parsers are also known as
shift-reduce parsers.
1. Shift-reduce parsing is another name for bottom-up parsing.
2. A parse tree is built for an input string using bottom-up parsing.
3. When parsing from the bottom up, the process begins with the input symbol and
builds the parse tree up to the start symbol by reversing the rightmost string
derivations.
Generally, bottom-up parsing is categorized into the following types:
1. LR parsing/Shift Reduce Parsing: Shift reduce Parsing is a process of parsing a
string to obtain the start symbol of the grammar.
LR(0)
SLR(1)
LALR
CLR
2. Operator Precedence Parsing: The grammar defined using operator grammar is
known as operator precedence parsing. In operator precedence parsing there should
be no null production and two non-terminals should not be adjacent to each other.
1 Top down parsing
There are two important styles of parsing: top-down and bottom-up. In actuality, very
many parsing algorithms have been proposed, not all of which fit neatly into one of
these categories. But the top-down/bottom-up distinction fits the parsing
algorithms in CS143 very well, andis very helpful for understanding parsing
algorithms.
Top-down parsing can be thought of as expanding a parse tree or derivation from the
sentence symbol until the frontier matches the desired input string (or until it
becomes clear that there is no way to make them match). Obviously, there are in
general an infinite number of trees that can be expanded. The tree that is expanded
depends on which productions areused to replace the nonterminals in the frontier
of the tree at each step of the [Link] differences among top-down parsing
methods are in the methods used to choose this production.
To introduce the idea of top-down parsing, let’s temporarily disregard the exact
method useto pick the next production to expand. Instead, we’ll do it by
making a “lucky guess” at each step about which production to expand. Later,
instead of guessing, we’ll look up what to do in a table.
At any point during the parse, the state of the parser is characterized by two items: the
remaining input and the stack. Once you know the values of these variables, you
know what will happen with the rest of the parse. The input contains a sequence of
terminal symbols.
Its initial value is the input string to be parsed, x, with the leftmost symbol
first. The stack contains terminal and nonterminal symbols. Initially, the stack
contains one symbol, which is S, the sentence symbol of the CFG.
Intuitively, the stack stores a partially expanded parse tree. The top of the stack
is a prediction about what the parser will see next. If the top symbol is a
terminal, it must match the next symbol in the input stream. If the top symbol
is a nonterminal, there must be some way to expand it to a string that matches
the beginning of the input (top-down parsing is sometimes called “predictive
parsing”).
Reflecting the discussion of the previous paragraph, there are two basic actions
that occur during parsing. When there is a terminal symbol on top of the stack,
it is matched with the next symbol on the input. Matching compares the two
symbols; if they are not equal, the input string is rejected, meaning that the
input string is not in the language of the CFG. If the symbols match, the top
symbol is popped off of the stack an→d the input symbol is removed from the
front
of the input. When there is a nonterminal A on top of the stack, it is expanded by
choosing a production A α from the CFG, popping A from the stack, and
pushing the symbols in α, rightmost first, onto the stack (so that the leftmost
symbol of α ends up on top of the stack). (Note: if α = ϵ, expanding has the
effect of simply popping A off of the stack.)
Example
S → AS |
BA → a
B → b
Here are the steps of the parse. $ represents “end of file” and “bottom of
stack.” The actionsexplain how we get from one step to the next. The top of the
stack is drawn on the left. Whether we expand or match is determined by
whether the top symbol is a nonterminal orterminal; if it’s a nonterminal, we
have to guess which production with that symbol on the left-hand side should
be expanded. At the end of the parse, we accept the input string if theinput and
stack are both empty. If there is a mismatch, we reject
parse (top) stack action
aab$ S$ expand S → AS
aab$ AS$ expand A → a
aab$ aS$ match
ab$ S$ expand S → AS
ab$ AS$ expand A → a
match
ab$ aS$
expand S → B
b$ S$
b$ B$ expand B→ b
b$ b$ match
$ $ accept
If this parse algorithm accepts, we can be sure that the input is in the language
of the CFG. If the parse does not accept, the string may or may not be in the
language. However, if we assume that the best possible guess is made at each
step, the parsing algorithm will accept if it is possible to do so, and does not
accept exactly when the input is not in the language.
A derivation and parse tree can be reconstructed from the history of expansions in
the parse. The derivation in this case is
S =⇒ AS =⇒ aS =⇒ aAS =⇒ aaS =⇒ aaB =⇒ aab.
This is a leftmost derivation because we always expanded the top symbol on the
stack, whichwas the leftmost symbol of the string derived up to that point.
LL(1) parsing
LL(1) parsing is very efficient (the running time is linear in the length of the
input string,with a very low constant factor). There is a tradeoff, however. Not
all CFGs can be handled by LL(1) parsing. If a CFG cannot be handled, we
say it is “not LL(1).” There is an LL(1) parse table generation algorithm that
either successfully builds a parse table (if the CFG is LL(1)) or reports that the
CFG is not LL(1) and why.
The LL(1) parse table construction is somewhat involved, so it will take a little
while to go through all the steps. Before doing so, we first describe a
transformation that increases thechances that a CFG will be LL(1).
A → Aα1
A → Aα2
A → β1
A → β2
A →
β1B A
→
β2B B →
ϵ
B →
α 1B B →
α 2B
Example
Consider the unambiguous expression grammar of the previous lecture, which had
the left- recursive productions
E → E+T
E →T
which can be rewritten as E E + T T . In turn, this is E T ( + T )∗, which can
be → | →
re-expanded right recu rsively i nto
E → TA
A → + TA A → ϵ
IMPORTANT NOTE: A real understanding of this method requires a hands- on
[Link] generating some strings from each set of productions and see why
they do the same thing. Try working through this on the whole grammar. Try
making up some other grammars and trying it. Otherwise, you won’t learn it!
Left factoring
A CFG has common left factors if the right-hand sides of two productions with
the same left-hand symbol start with the same symbol. The presence of
common left factors CFGs sabotages LL(1) parsing.
A → αβ A
→ αγ
We can convert these to EBNF, also: A → α(β | γ), and convert back:
A →
αBB
→β
B → γ
Example
E → T+E
E →T
E → TA A → +EA → ϵ
A CFG may still not be LL(1) even after eliminating left recursion and left
factors, but it
certainly will not be LL(1) if they are not eliminated.
Before discussing in detail the construction the LL(1) parse tables, let’s seen an
example of how the parsing algorithm uses one. This is the grammar that
results when left recursion is removed from the unambiguous expression
grammar, as described above. We will call this our LL(1) expression grammar.
It will be referred to frequently below.
E → TA
A → + TA |
ϵ T→ FB
B → ∗FB |
ϵF → (E) | a
a + ∗
( ) $
E E → E →
TA TA
T T → T →
FB FB
F F →a F →
(E)
A A → + A → A →
TA ϵ ϵ
B B→ϵ B → B → B →
∗FB ϵ ϵ
The parsing algorithm is as before, except that when the top symbol on the stack is
a nonterminal, we don’t guess. Instead, we look in the table, in the row for the top-
of-stack symbol and column for the next input symbol (which can be $ if the entire
input has been consumed). The table entry uniquely determines which production
to expand.
Here is what happens when we parse the string a + a ∗ a using this table.
E$ a+a*a$ expand E TA
TA$ a+a*a$ expand → T
FB FBA$ →
a+a*a$ expand F a
aBA$ a+a*a$ match
BA$ +a*a$ expand B→ ϵ
A$ +a*a$ expand A→ +TA
+TA$ +a*a$ match
TA$ a*a$ expand → T
FB FBA$ →
a*a$ expand F a
aBA$ a*a$ match
BA$ *a$ expand B→ ∗ FB
*FBA$ *a$ match
FBA$ a$ expand F→
a aBA$ a$ match
BA$ $ expand B→ ϵ
A$ $ expand A→ ϵ
$ $ accept
LL(1) Parse Table Construction
The basic idea behind the LL(1) parse table is to find the set of terminal symbols
that can appear at the beginning of a string expanded from a production. If the
nonterminal on top of the stack is A and the next input is a, a→nd there a→re
two productio→ns A α and A β, we choose A α if α can expand to
something
beginning with a. The choice must be unique; if β also expands to a string
beginning with a, the grammar is not LL(1) (LL(1) parsing won’t work, so we
either have to change the grammar or use a different parsing algorithm).
Constructing the LL(1) parse table requires some preliminary definitions and
[Link] first is the set of nullable nonterminals. To compute them,
we need the more general concept of a nullable string:
[Link] is an example:
S → ABC
A → ϵ
B → ϵ
C → AB
The next function that needs to be computed is the FNE set. FNE stands for
“First, no ϵ”. It is unique to this course, but I think it is easier to understand
than the standard approach, which I will also describe. FNE (α) is the set of
terminal symbols which can appear at the beginning of a terminal string
derived from α.
The domain of FNE consists of the set of all nonterminals and suffixes (i.e.,
tails) of strings appearing on the right-hand sides of productions, but the
codomain of FNE consists of the subsets of Σ (i.e., FNE (α) is a set of
terminal symbols). As with Nullable, we start witha “small” initial value for
FNE , the function that maps every string to the empty set, andapply a set of
rules in no particular order until no rules can change FNE .
Rule 2 below is partially obvious: the FNE set of a string X1X2 . . . Xn always
includes the
FNE of X1. However, it may include the FNE of X2 and later symbols,
also: If X1 is
nullable, there is a derivation X1X2 . . . Xn =⇒∗ ϵa . . . = a . . ., where X1
“expands” into the
empty string, so the first terminal symbol actually derives from X2. However,
even if X1 is nullable, it may end up contributing terminals to the FNE of X1X2 .
. . Xn because there may also be derivations where X1 expands to a non-empty
string (nullable symbols may expand to ϵ, but they may expand to other strings
as well). Clearly, if X1 . . . Xk are all nullable, the first terminal may come from
Xk+1. The first rule also implies that FNE (ϵ) = ∅.
F a rule 3
FB a rule 2
T a rule 3
TA a rule 2
(E) ( rule 2
F ( rule 3
FB ( rule 2
T ( rule 3
TA ( rule 2
E a, ( rule 3
Note: You should work through these in detail to see how the rules work, and try
anotherorder to see if you get the same result.
The resulting FNE sets for the nonterminals in the grammar are:
E { a, ( }
T { a, ( }
F { a, ( }
A {+ }
B {* }
Note: You should check the definition of FNE above and re-inspect the LL(1)
expressiongrammar to see why this makes sense.
This CFG didn’t use rule 2 in its full generality. Here is a simple example
that does.
S → ABC
A → aA |
ϵB → b | ϵ
C → c|d
FNE (ABC) includes terminals from A but also from B (because A is nullable)
and C (because B is nullable). Suggestion: Show for each of member of FNE
(S) that there is a way to derive a string from S beginning with that
terminal.
I will also describe the “standard” approach that appears in all textbooks on
the subject,in case you need to talk to someone else about this material who
didn’t take this class. The standard approach is not to define FNE sets, but
FIRST sets. The only difference is that FIRST (α) includes ϵ if α is nullable.
Rules for computing FIRST can be given that are very similar to the rules for
FNE. FIRST sets are defined the way they are because the concept generalizes
to LL(k) parse table construction for k > 1. But we don’t care about that, and
inclusion of ϵ seems to lead to confusion, so we use FNE sets, instead.
The last set that needs to be defined before constructing the LL(1) parse table is
the FOL- LOW set for each nonterminal. The FOLLOW sets are only used for
nullable productions. Recall that the LL(1) parse table selects a production
based on the nonterminal on top of the stack and the next input symbol. We
want to choose a production that is consistent with
the next input. FNE sets tell us what we want to know if the next terminal is
going to be derived from the top nonterminal on the stack (we just pick the
production whose right-handside has the next input in it’s FNE set). But,
suppose the nonterminal on top of the stack “expands” to ϵ! Then the next
nonterminal is going to come from some symbol after that nonterminal. The
FOLLOW sets say exactly which terminals can occur immediately aftera
nonterminal.
1. $ ∈ FOLLOW (S)
3. FOLLOW (B) ⊇ FOLLOW (A) when A → αBβ appears in the CFG and
β is nullable.
Let’s compute the FOLLOW sets for the LL(1) expression grammar.
String Added
Reason E
$ rule 1
E ) rule 2 (F → (E))
T + rule 2 (E → TA)
F * rule 2 (T → FB)
A $,) rule 3 (E → TA, ϵ
nullable) T $,) rule 3 (E → TA,
A nullable)B +,$,) rule 3 (T → FB,
ϵ nullable) F +,$,) rule 3 (T → FB,
B nullable)
The resulting FOLLOW sets are:
E { $, ) }
T { +, $, ) }
F { ∗, +, $, ) }
A { $, ) }
B { +, $, ) }
The LL(1) parse table is a two-dimensional array, which we will call Table.
The rows are indexed by nonterminals (for what is on the top of the stack) and
the columns by terminals (the next input symbol). Given A on the top of the
stack and a next in the input, we needto choose a production A → α to expand.
Table[A, a] should be A → α only if there is someway to expand A → α that
can match a next in the input.
There are two ways that expanding A → α can expand to something that matches
a. One way is if α expands to something beginning with a, so we set Table[A, a]
= A → α whenever a ∈ FNE (α). The other way is if α expands to ϵ, and
some symbol after α expands to a string beginning with a, so we also set
Table[A, a] = A → α when α is nullable and a ∈ FOLLOW (A).
The LL(1) parse table for the LL(1) expression grammar appears in full,
above, but let’s look at a few examples. E → TA appears in Table[E, a]
because a ∈ FNE (TA). The only productions with nullable right-hand
sides in this grammar are A → ϵ and B → ϵ. Table[B, +] = B → ϵ because +
∈ FOLLOW (B).
There is one more extremely important point to make about the LL(1) parse
table: If the rules above set Table[A, a] to two different productions, the parse table
construction fails. In this case, the CFG is not LL(1). LL(1) parsing demands
that we be able to choose exactlythe next production to expand based solely on
whether it is consistent with the next input symbol. So the LL(1) parse table
construction not only builds parse tables, it is a test for whether a CFG is LL(1)
or not.
What about the entries that have nothing in them? They are error entries: if the
parser ever looks at them, the input string can be rejected immediately. There
is no way to expandanything that can match the next input.
In spite of its limitations, LL(1) parsing is one of the two most widely used parsing
algorithms. The parsers can be built automatically, and the parsing algorithm is
easy to understand. It is usually simple to do processing associated with
individual grammar rules, during the parsing process (this is called syntax-
directed translation). Furthermore, it can also be used as the basis for simple
hand-written parsers, which allow a great deal of flexibility in handling special
cases and in error recovery and reporting (see below). However, LL(1) parsing
has
some limitations that can be annoying. One of them is that it is not as general
as the other common parsing algorithm, so there are some common
grammatical constructs it cannot handle. Another is the requirement to
eliminate left recursion and left factors, which can require rearranging a
grammar in ways that make it less clear.
LL(1) parsing and recursive descent
Recursive descent parsing is very widely used, because it requires no special parser
generation tools, it can be extended in ad hoc ways (for example, looking ahead to
several inputs, or looking at other context, when the next input does not uniquely
determine the production choice), and it allows the user great freedom in
generating error messages and doing error recovery.
While substantially different from the previous code, this is still basically the
same thing (it chooses whether to parse F or not at each point based on one
lookahead symbol.
2 Bottom-up parsing
The other major class of parsing methods are the bottom-up algorithms. As the
name suggests, bottom-up methods work by building a parse tree from the leaves
up. This involves reading the input until the right-hand side of a production is
recognized, then reducing the production instead of expanding productions until
they match the input.
The bottom-up algorithms we will study are all shift-reduce algorithms. Shift-
reduce parsinguses the same data structures as LL parsing: a stack and the
input stream. However, the stack is used in a different way. Terminal symbols
are shifted onto the stack, that is, a symbol is removed from the beginning of
the input and pushed onto the stack. If a sequenceof symbols on the top of the
stack matches the right-hand side of some production, they can be reduced by
popping them and then pushing the symbol from the left-hand side of the same
production. The parse is successful when the entire input has been shifted on
the stack and reduced to the sentence symbol of the grammar.
As with LL parsing, there are choices to be made during this algorithm: there
may be several productions whose right-hand sides match the stack at any time.
Which one shouldbe reduced? As with LL parsing, the choice is made by doing a
table lookup with information from the current state of the parse.
S → AB
A → a
B → b
The states of the automaton (which we will call the LR(0) state machine)
consist of LR(0) items. An LR(0) item is a production with a position
marked in it.
The finite-state automaton in this case is called the LR(0) machine. Each state
of the LR(0) machine is a set of items. If two states have the same set of items,
they aren’t two states – they are the same state. Intuitively, the LR(0) machine
keeps track of the productions that might eventually be reduced when more
stuff is pushed on the stack. The positions keep track of how much of the
production is already on the stack.
For convenience, the first step of any LR parsing algorithm is to add a new
sentence symbol
SJ and a new production SJ → S to the CFG. Let’s use the first example CFG
above.
SJ → S
S →
(S)
S → a
The first state starts with the item that says: “we are parsing a sentence, and we
haven’t seen anything yet:” SJ → •S.
The construction of a state starts with a kernel, which is a core set of items.
The kernel of the our first state is SJ → •S. The kernel is extended via the
closure operation. The closure
operation takes into account that whenever we are in the middle of an
→i t e m • A α Bβ (meaning “I might eventually be able to →reduce A αBβ, and
α is already on the stack”),it could be that the parser will see something that
can be reduced to B. The items that reflect this are those of the form B → •γ
(meaning “I might eventually be able to reduce B → γ, and I haven’t seen
anything in γ yet”). These are called closure items.
Definition 12 The LR(0) closure step adds all items of the form B → •γ to a
state when- ever the state contains an item A → α • Bβ.
There are two important points to make about the closure step:
•
When a closure item begins with a nonterminal, adding it to the state
may causeadditional closure items to be added.
•
The state is a set of items, which means it has no duplicates – “adding”
items thatalready there has no effect.
A box is drawn around the state to indicate that the closure operation has been
performed.
To complete the state machine, we need to define the transitions between the
states. This is done by the goto function. If our state is q, whenever there is at
least one item of the form A → α • Xβ, where X is a terminal or nonterminal,
goto(q, X) is defined. It is theset of all items A → αX • β where A → α • Xβ is
an item in q. Important note: Never do this to items like A → •ϵ, which is a
reduce item. The goto function doesn’t apply to such
productions, because ϵ is contains neither terminal nor nonterminal symbols. I
would love tohave a quarter where no students add transitions from A → •ϵ to
A → ϵ•.
For each symbol X, goto generates the kernel of a successor state to q. There are
severalimportant points to notice:
•
For a given symbol X, goto operates on all of the items where X is the
next [Link] is only one successor on each X.
• If qJ = goto(q, X), all of the items in qJ are of the →f o rm A• αX •
β. I.e., the
is always
immediately after an X.
Once the kernels of the successor states have been generated, the closure operation
is applied to each to complete the set of items in the state. Given an LR(0) state, it
is possible to determine which of its items were in its kernel and which were
introduced by closur•e by checking whether the is at the beginning of the item
→ •
(closure) or not (kernel) (the oneexception to this r ule is the item that started
everything, SJ S, which is introduced by neither operation).
If, after applying closure, the set of items is exactly the same as some other set
of items, the two states are actually the same state. (An optimization to this is
to look at whether the kernel generated by goto is the same as the kernel of an
existing state, in which case thestates can be merged before wasting a closure
operation.)
Let’s apply the goto operation to the one state we have generated so far (let us
call it 0). There are three symbols of interest: “S”, “(”, and “a”. In each case,
there is only one item with that symbol next.
Let’s consider state 2 in more detail. The state was first generated by g{ot o→(0, J
(J) = S
(•S)}; this is the kernel of state 2. Because of the .•. . S . . . in the kernel item, closure
adds ite→ms• S (→S) •and S a, completing the state. goto(2{,J (→J ) ) •als}o
generates S ( S) , so at this point, we know we are going to state 2
(looping back to it, actually). Or we waituntil we generate the closure of our
“new” state, and then notice that we have exactly the same items we had in
state 2 and merge the states. State 2 is different from state 0 becausethe kernel
items are different (even though the closure items are the same).
An item of the form →A • α is a reduce item. It indicates that, when this state is
reached, the pro→duction A α should be reduced (α will be guaranteed to be
on top of the stack if theparser gets to this state). →Red•uci ng the item SJ
S accepts
→ •
the input s t r i n g . Important: an item like A ϵ is a reduce item; since the
RHS
is the empty string, position 0 is theend of the item.
Unlike the example of shift-reduce parsing, an LR(0) parser does not actually
shift symbolsonto the stack. Instead, it shifts states. No information is lost because
of the special structure of the LR(0) machine. Notice that every transition into a
state
has exactly the same symbol labelling it. If you see a state q on the stack, it is as
though a were shifted onto the [Link] stack will also have states corresponding
to nonterminal symbols.
In more detail, the LR(0) parsing algorithm starts with the first state (0 in
our example)and executes the following steps repeatedly:
shift If the next input is a and there is a transition on a from the top state on
the stack(call it qi) to some state qj, push qj on the stack and remove a
from the input.
reduce If the state has a reduce item A → α•
1. Pop one state on the stack for every symbol in α (note: symbols
associated withthese states will always match symbols in α).
2. Let the top state on the stack now be qi. There will be a transition in
the LR(0)machine on A to a state qj. Push qj on the stack
error If the state has no reduce item, the next input is a, and there is no
transition on a, report a parse error and halt.
accept When the item →S J S• is reduced accept if the next input symbol is
$, otherwise report an error and halt. (This rule is a bit weird. The
remaining LR-style parsing algorithms don’t need to check if the input
is empty. We define it this way so LR(0) parsing can do our simple
example grammar.)
LR(0) parsing requires that each of these steps be uniquely determined by the
LR(0) machineand the input. Therefore, if a state has a reduce item, it must not
have any other reduce items or shift items. With this restriction, the current
state determines whether to shift or reduce, and which production to reduce,
without looking at the next input. If it shifts, it can read the next input to see
which state to shift.
SLR(1) parsing
0 SJ → S
1 S → Aa
2 S → Bb
3 S → ac
4 A → a
5 B → a
The productions are numbered because the numbers are used in the parse table
construction below.
S → •
ac
A → •a
B → •a
3
S → A a S →
• Aa•
a
5
S b
→ B • S →
b Bb•
S → a•c
A → a•, {a}
B → b•, {b}
Since the lookahead sets for each shift item are disjoint from each other, and
disjoint from the symbols that can be shifted, this state has no SLR(1) conflicts.
The parse table for SLR(1) has the same format as for the two other shift-
reduce parsing algorithms that are going to be discussed, LR(1) and LALR(1).
The parse table consists of two two-dimensional arrays: an ACTION table and
a GOTO table. Here is the SLR(1) parse table for the CFG above.
ACTION GOTO
a b c $ S A B
0 s6 1 2 4
1 acc
2 s3
3 r1
4 s5
5 r2
6 r4 r5 s7
7 r3
The rows of both tables are indexed by the states of the LR(0) machine. The
columns of theACTION table are indexed by terminal symbols and the end-of-
file marker $. The columns of the GOTO table are indexed by nonterminals.
The entries of the ACTION table are shift actions, of the form sn, where n is
index of an LR(0) state, or rp, where p is the index of a production in the CFG.
At each step, the parser looks in ACTION[q, a], where q is the current LR(0)
state and a is the next input [Link] the entry is “sn,” it shifts state number n
onto the stack. If the entry is “rp,” it reduces production p. If there is no entry
in the ACTION table, a parse error is reported (the inputis not in the language
of the CFG).
The GOTO table gives the next-state transition function for nonterminals
(given a current LR state and a nonterminal, it gives the next LR state). It is
used during reductions: after popping states for the right-hand side of a
production, it designates the state to be pushed for the left-hand side symbol.
Empty entries in the GOTO table are never referenced, even if the input string
is not parse-able (if there were a parse error, it would have been caught earlier
when an empty entry of the ACTION table was referenced).
As with LL(1) parsing, all LR parsing algorithms require that the table
uniquely determinethe next parse action: there must be at most one action in
each entry of the parse table, or the CFG cannot be handled by the parsing
algorithm, in which case the CFG is said tobe “not SLR(1)” (or LR(1) or
LALR(1)). When there are multiple actions in a table entry, there is said to
be a conflict. There can be shift/reduce conflicts (if there is a shift and a
reduce action in the same table entry), or reduce/reduce conflicts (if there are
several reduce actions in the same table entry). There is no such thing as a
shift/shift conflict, because the
LR machine goto operation ensure that each state has at most one successor.
Conflicts onlyshow up in the ACTION table, not the GOTO table.
The above table has no conflicts. The most interesting row is for state 6 (the
one with the LR(0) conflicts). Observe that the row has two different reduce
actions and a shift action, yetthey do not conflict because they are in
different columns (because their lookahead symbolsare disjoint).
LR parse algorithm
When production p is reduced, the parser looks up the production, and pops
one state off ofthe stack for each symbol on the right-hand side of
production
p. Then it looks up an entry in the GOTO table. The columns of the GOTO
table are indexed by nonterminal symbols, and the entries in the table are the
indexes of LR(0) states. The parser pushes onto the parse stack the state in
GOTO[n, A], where n is the LR(0) state on top of the stack (after popping
one state for each right-hand symbol), and A is the nonterminal on the left-
hand side of production p.
1. Let q to the top state on the stack, let a be the next input symbol, and
let act be
ACTION [q, a].
2. If act is “sn”, push n on the stack and remove a from the input;
(a) Pop length(α) states off of the stack. Let qJ be the top of stack symbol
immediatelyafter doing this.
(b) Push GOTO[qJ, A] onto the stack;
4. Else, if act is “acc”, accept the input;
5. Else, if act is “error” (an empty entry), report an error (the input string
is not in thelanguage of the CFG).
The “accept” action only ever appears in the column $, so it will only be found
when theinput has been exhausted. Unlike LR(0) parsing, it is not necessary to
have a separate checkto see if the input is empty before accepting.
As an example, let us parse the input “ab” using our CFG and table.
Stack input
(top)
0 ab$
$
06 b$
$a
04 b$
$B
045 $
$Bb
01 $
$S
accept
The definition GOTO table is quite simple: if there is a transition from state
q to state n
on nonterminal A in the LR(0) machine, set GOTO[q, A] = n.
You should reconstruct the table above to see how the rules apply to it. We also
recommendthat you try doing the SLR(1) parse table construction for the CFG
S → Sa | ϵ.
step.
Construction of GOTO graph-
State I0 – closure of augmented LR(0) item
Using I0 find all collection of sets of LR(0) items with the help of DFA
Convert DFA to LR(0) parsing table
Construction of LR(0) parsing table:
The action function takes as arguments a state i and a terminal a (or $ , the input
end marker). The value of ACTION[i, a] can have one of four forms:
1. Shift j, where j is a state.
2. Reduce A -> β.
3. Accept
4. Error
We extend the GOTO function, defined on sets of items, to states: if GOTO[Ii , A]
= Ij then GOTO also maps a state i and a nonterminal A to state j.
Eg: Consider the grammar S ->AA A -> aA | b Augmented grammar S’ -> S S -> AA
A -> aA | b The LR(0) parsing table for above GOTO graph will be
–
Action part of the table contains all the terminals of the grammar whereas the goto
part contains all the nonterminals. For every state of goto graph we write all the goto
operations in the table. If goto is applied to a terminal then it is written in the action
part if goto is applied on a nonterminal it is written in goto part. If on applying goto a
production is reduced ( i.e if the dot reaches at the end of production and no further
closure can be applied) then it is denoted as Ri and if the production is not reduced
(shifted) it is denoted as Si. If a production is reduced it is written under the terminals
given by follow of the left side of the production which is reduced for ex: in I 5 S->AA
is reduced so R1 is written under the terminals in follow(S)={$} (To know more about
how to calculate follow function: Click here ) in LR(0) parser. If in a state the start
symbol of grammar is reduced it is written under $ symbol as accepted.
NOTE: If in any state both reduced and shifted productions are present or two
reduced productions are present it is called a conflict situation and the grammar is not
LR grammar.
NOTE:
1. Two reduced productions in one state – RR conflict.
2. One reduced and one shifted production in one state – SR conflict. If no SR or RR
conflict present in the parsing table then the grammar is LR(0) grammar. In above
grammar no conflict so it is LR(0) grammar.