Compiler Construction
A SIMPLE SYNTAX-DIRECTED TRANSLATOR
Introduction, CFG and Parse Tree, Grammar for Arithmetic Expressions, Syntax-directed Translation
Introduction
As explained in the previous chapter, there are two parts to compilation: analysis and synthesis. the
analysis part of a compiler breaks up a source program into constituent pieces and produces an
internal representation for it, called intermediate code. The synthesis part translates the intermediate
code into the target program. The analysis part is called the compiler front-end and synthesis part is
called the compiler back-end. Following is a simplified view of the compiler front-end.
Fig 1. A simplified model of a compiler front-end
Analysis is organized around the syntax of the source language. The syntax of a programming
language describes the proper form of its programs, while the semantics of the language defines what
its programs mean; that is, what each program does when it executes. For specifying syntax, we use
context-free grammars (or simply grammar) notation.
Let us first talk about the syntax tree. In general, a tree is a data structure consisting of one or more
nodes (or vertices). The highest (top) node is the root node. All nodes except the root have a unique
parent node; the root has no parent. A parent node may have one or more child nodes (or children).
The children of same node are termed as siblings. A node with no children is called a leaf node. Nodes
between root and leaves are commonly known as interior nodes.
A syntax tree basically is of two types:
• Abstract syntax tree
• Concrete syntax tree (often called the Parse tree or Derivation tree)
Abstract syntax tree (or simply a syntax tree) is a tree whose root node and the interior nodes
represent operators (or operations) and whose leaf nodes represent the operands.
The construction of syntax tree for simple arithmetic expressions, assignment statements and other
statements like if-else, for-loop, while-loop etc. has already been discussed in the class.
Concrete syntax tree (or Parse tree) is a tree that pictorially shows how the start symbol of a
grammar generates a string of the language. A grammar derives strings by beginning with the start
symbol and repeatedly replacing a nonterminal by the body of a production for that nonterminal.
A parse tree, according to the grammar, is a tree with the following properties: (1) The root is labeled
by the start symbol, (2) Each leaf is labeled by a terminal or by ε, (3) Each interior node is labeled by
a nonterminal.
1
Compiler Construction
Context-Free Grammar and Parse Tree
A context-free grammar (or grammar) describes the hierarchical structure of most programming
language constructs. For example, we can write a rule for the syntax of if-else statement as follows:
statement → if ( expression ) statement else statement
Here, arrow → means “can have the form”. Using the variable “expr” to denote an expression and
the variable “stmt” to denote a statement, this syntax rule can be expressed as:
stmt → if ( expr ) stmt else stmt
Such a rule is called a production. In a production, lexical elements like the keyword if and the
parentheses are called terminals. Variables like expr and stmt represent sequences of terminals and
are called nonterminals. In general, a grammar has four components: (1) a set of terminals, (2) a set
of nonterminals, (3) a set of productions, and (4) a start symbol.
• Terminals are the basic or elementary symbols of the language defined by the grammar.
• Nonterminals are syntactic variables that define a language construct.
• The productions or rules of a grammar specify the manner in which the terminals and
nonterminals can be combined. Each production or rule has three parts:
o a nonterminal called the head or left-side of the production
o an arrow symbol “→”, that is read as “can have the form”.
o a body or right-side of the production having zero or more terminals and/or nonterminals.
• Start symbol is the first nonterminal from which a grammar starts.
Ex.1 » Consider the grammar: S→AB
A→a│aA
B→b│bB
Let the input string be “aaaabb”.
The derivation of this string from the given grammar is
illustrated by the parse tree in Fig 2.
Each node in the tree is labeled by a grammar symbol.
From left to right, the leaves of a parse tree form the yield
of the tree which is the string generated or derived from the
nonterminal at the root of the parse tree. In the present
example, the yield is “aaaabb”.
Another definition of the language generated by a grammar
is: the set of strings that can be generated by some parse
trees. The process of finding a parse tree for a given input
string is called parsing that string.
Fig 2. Parse tree for “aaaabb”
Work out: Construct a parse tree for the input string “a = b = c” using the following grammar.
right → letter = right │ letter
letter → a │ b │ c │ … │ z
2
Compiler Construction
Grammar for Arithmetic Expressions
Let us see an important grammar that represents arithmetic expressions in infix form. We discuss
three slightly different grammars according to the operators involved.
(1) G1: Grammar for arithmetic expressions involving plus and minus operators.
expr → expr + digit │ expr ‒ digit │ digit
digit → 0 │ 1 │ 2 │ … │ 9
(2) G2: Grammar for arithmetic expressions involving four arithmetic operators: +, ‒, ∗, /.
expr → expr + term │ expr ‒ term │ term
term → term ∗ digit │ term / digit │ digit
digit → 0 │ 1 │ 2 │ … │ 9
(3) G3: Grammar for arithmetic expressions involving four arithmetic operators and parenthesis.
expr → expr + term │ expr ‒ term │ term
term → term ∗ factor │ term / factor │ factor
factor → ( expr ) │ digit
digit → 0 │ 1 │ 2 │ … │ 9
Examples of language strings for each of the above grammar:
➢ The grammar G1 accepts the language strings such as: 1, 2+3, 9‒5+2 etc.
➢ The grammar G2 accepts the language strings such as: 1, 2+3, 9‒5+2, 9‒5∗2, 8+4/2∗3‒0 etc.
➢ The grammar G3 accepts the language strings such as: 1, 2+3, 9‒5+2, 9‒5∗2, (9‒5)∗2 etc.
Translation of Arithmetic Expressions from Infix Form to Postfix Form
Besides specifying the syntax of a language, a context-free grammar can be used to help guide the
translation of programs. In this section, we introduce a grammar-oriented compiling technique known
as syntax-directed translation. For simplicity, we consider the syntax-directed translation of infix
expressions to postfix form, a notation in which operators appear after their operands. For example,
the postfix form of the expression 9‒5+2 is 95‒2+.
Our ultimate objective is to create a syntax-directed translator that maps infix arithmetic expressions
into postfix expressions.
Before we delve into the details, let us check our ability to do infix-to-postfix translations manually.
Infix form Postfix form 9-5*2+3/7
9-5+2 95-2+
9 - 5 * 2 + 3 / 7
9-(5+2) 952+- 9 - 52* + 3 / 7
9-5*2 952*-
9-5*2+3/7 952*-37/+ 9 - 52* + 37/
9 - 52* + 37/
6+(5-2)*(8/3)+1 652-83/*+1+
(2+3)/5-(7+2)*9 952*- + 37/
5*3/(6-2)/8/(4*1) 952*-37/+
3
Compiler Construction
To achieve the objective of building infix-to-postfix translator, there are three main steps:
Step 1: Specify the source language in terms of grammar. Here, the source language is the arithmetic
expressions in infix form.
Step 2: Define the target language formally. Here, the target language consists of the postfix form of
the arithmetic expressions.
Step 3: Specify translation from source language to target language. That is, lay out the design or
procedure to translate/convert infix expression into postfix form.
◈◈◈
Writing the grammar for infix expressions: As well as step 1 is concerned, we have already built a
grammar (e.g., grammar G1) that represents infix arithmetic expressions. So, we move forward to the
next step, the Step 2 (defining the target language, i.e., postfix form).
Defining the Postfix Form: The postfix notation for an expression E can be defined inductively as
follows:
1. If E is a variable or constant, then the postfix notation for E is E itself.
2. If E is an expression of the form E1 op E2, where op is any binary operator, then the postfix
notation for E is E1 E2 op.
3. If E is a parenthesized expression of the form (E1), then the postfix notation for E is the same
as the postfix notation for E1.
Specifying translation from infix to postfix form: This step is termed as syntax-directed translation
and is accomplished through one of the two methods.
• Syntax-directed definition – It attaches semantic rules to the productions in a grammar.
• Syntax-directed translation scheme, or simply Translation scheme – It attaches program
fragments to the productions in a grammar.
◈◈◈
Syntax-Directed Definition (SDD)
A syntax-directed definition associates
1. With each grammar symbol, a set of attributes, and
2. With each production, a set of semantic rules for computing the values of the attributes
associated with the symbols appearing in the production.
An attribute is any quantity associated with a programming construct. For example, data type may be
considered as attribute of an expression. Since we use grammar symbols (nonterminals and terminals)
to represent programming constructs, therefore the notion of attribute is limited to the certain character
such as t (for type, i.e., data type).
To demonstrate the above idea, let us consider grammar G1 with a little bit alteration that the name
“digit” is replaced with the name “term”. Here is the grammar:
expr → expr + term │ expr ‒ term │ term
term → 0 │ 1 │ 2 │ … │ 9
Now, we express the syntax-directed definition for infix-to-postfix translation, using the grammar. We
first associate attribute t with nonterminals. Then, we attach rules to the productions of the grammar.
4
Compiler Construction
These are semantic rules that describe how the attributes are computed at those nodes of the parse
tree where the production in question is used to relate a node to its children.
Fig 3. Syntax-directed definition for infix-to-postfix translation
Note that the symbol || in the semantic rule is the operator for string concatenation. The postfix form
of a digit is the digit itself; e.g., the semantic rule associated with the production term → 9 defines
term.t to be 9 itself whenever this production is used at a node in a parse tree. The other digits are
translated similarly. As another example, when the production expr → term is applied, the value of
term.t becomes the value of expr.t. In the productions expr → expr1 + term, and expr → expr1 ‒ term,
the use of expr and expr1 is to just make distinction between which expr is on the left-hand side and
which is on the right-hand side.
Let us see how to use this syntax-directed definition. For a given input string x, construct a parse tree
for x. Then, apply the semantic rules to evaluate attributes at each node in the parse tree, as follows.
Suppose a node N in a parse tree is labeled by the grammar symbol X. We write X.a to denote the
value of attribute a of X at that node. A parse tree showing the attribute values at each node is called
an annotated parse tree. For example, an annotated parse tree for 9‒5+2 with an attribute t
associated with the nonterminals expr and term, is shown in Fig. 4. The value 95‒2+ of the attribute
at the root node is the postfix notation for 9‒5+2.
Fig 4. Annotated parse tree showing attribute values at nodes
An attribute is said to be synthesized attribute if its value at a parse-tree node N is determined from
attribute values at the children of N and at N itself. For instance, attribute t is a synthesized attribute.
5
Compiler Construction
Synthesized attributes have the desirable property that they can be evaluated during a single bottom-
up traversal of a parse tree. There is another type of attributes called inherited attributes. An inherited
attribute is one whose value at a parse-tree node is determined from attribute values at the node
itself, its parent, and its siblings in the parse tree.
Work out: Consider grammar G2 and build a syntax-directed definition to translate infix expressions
to postfix form. Also draw an annotated parse tree for the input string 9‒5∗2.
Translation Scheme
A syntax-directed translation scheme is a notation for specifying a translation by attaching program
fragments to productions in a grammar. A translation scheme is like a syntax-directed definition,
except that the order of evaluation of the semantic rules is explicitly specified.
Program fragments embedded within production bodies are called semantic actions. The position at
which an action is to be executed is shown by enclosing it between curly braces and writing it within
the production body.
Fig 5. (Syntax-directed) Translation Scheme for infix-to-postfix translation
When drawing a parse tree for a translation scheme, we indicate a semantic action by constructing an
extra child for it, connected by a dashed line to the node that corresponds to the head of the production.
For example, Fig. 6 depicts a parse tree built using the above translation scheme; the parse tree prints
the postfix form 95‒2+ of the infix expression 9‒5+2 provided we perform left-to-right depth-first
traversal of the tree and execute each print statement appearing at a leaf node.
Fig 6. Parse tree showing semantic actions for translating 9‒5+2 into 95‒2+.
Work out: Consider grammar G2 and build a (syntax-directed) translation scheme to translate infix
expressions to postfix form. Draw a parse tree for the input string 9‒5∗2 showing semantic actions for
translating 9‒5∗2 to 952∗‒.
6
Compiler Construction
LAB WORK
Write a C/C++ program to demonstrate the working of getchar() and putchar() functions.
[These functions are later used in a translator program for converting arithmetic expressions from
infix notation to postfix notation.]
PROGRAM LOGIC:
Accept from the keyboard any input followed by a period. The same is echoed as output, like a
typewriter does. The echo operation continues in an interactive manner if no period is entered at the
end of the input.
PROGRAM:
/* typewriter program */
/* This example shows the functionality of getchar() and putchar() */
#include <stdio.h>
#include <iostream>
int main()
{
char c;
puts("Enter text followed by a period: ");
do
{
c = getchar();
putchar(c);
}
while (c != '.');
putchar('\n');
return 0;
}
INPUT & OUTPUT:
Sample 1
Enter text followed by a period:
Input: Pakistan is our dear homeland.
Output: Pakistan is our dear homeland.
Sample 2
Enter text followed by a period:
Input: Programming is fun
Output: Programming is fun
Input: Happiness is the key to success
Output: Happiness is the key to success
Input: well done.
Output: well done.