BCS 324: COMPILER DESIGN &
CONSTRUCTION
Introduction to Compiling – Part B
1. A simple one-pass compiler
2. Lexical analysis
Dr. D. K. Muyobo
I. A simple one-pass compiler
A programming language can be defined
by describing
◦ What its program’s syntax looks like, and
◦ What its statements mean (i.e. semantics)
We use BNF (Backus-Naur Form) or
context-free grammars to specify the
syntax of a language
Then we use informal descriptions to
specify semantics of the language.
In addition to specifying syntax,
◦ Context-free grammar can also be used to guide the translation
of programs
◦ Such grammar-oriented compiling is known as syntax-directed
translation.
We use an example of a compiler to translate infix
expressions into polish postfix notation where:
◦ The lexical analyzer converts the stream of input characters into
a stream of tokens
◦ The token stream becomes input to the syntax-directed
translator.
◦ The syntax-directed translator then produces intermediate code
representation as output, as shown in Fig 1.
3
Syntax-
Character Lexical Token Intermediate
directed
stream analyzer stream representation
translator
Fig.1: Syntax-directed translator
4
Syntax definition
A context-free grammar describes the
hierarchical structure of many programming
language constructs e.g.
◦ The branch statement in C language has the form:
if(expression)statement else statement
◦ It is the concat of opening parenthesis, an expression, closing
parenthesis, statement, else keyword, statement.
◦ We can use variables as follows:
expr – to denote expression, and
stmt – to denote statement.
5
Thus, we have:
◦ stmt if(expr)stmt else stmt
Such a rule is called a production.
Individual lexical units of a production e.g.
if, parenthesis etc. are called tokens.
Variables like expr and stmt represent
sequences of tokens and are called
non-terminals.
6
Components of a context-free grammar
1. A set of tokens known as terminal symbols
2. A set of non-terminals e.g. expr, stmt.
3. A set of productions where each production
consists of a non-terminal called leftside of
the production, an arrow, a sequence of
tokens and/or non-terminals called the
rightside of the production
4. A designation of one of the non-terminals as
the start symbol.
7
The convention followed lists grammars
by their productions listing start symbol
first
Digits and signs such as <= and while etc.
are terminals e.g.
◦ list list + digit
◦ list list - digit
◦ list digit
◦ digit 0|1|2|3|4|5|6|7|8|9
8
The right sides of the 3 productions with
non-terminal list on the leftside can be
grouped equivalently as:
◦ list list+digit|list–digit|digit
Tokens of the grammar are the symbols
+-0123456789
9
Parse trees
Parse trees
◦ Pictorially show how the start symbol of a
grammar derives a string in the language
If non-terminal A has a production
AXYZ such that A is the start symbol,
then its parse tree is
A
X Y Z
Fig.2: Parse tree for non-terminal A
10
Properties of parse trees
1. Root is labeled by a start symbol
2. Each leaf is labeled by a terminal (token)
3. Each interior node is labeled by a non-terminal
4. The children of a node in a parse tree form the
rightside of its production while the statement
symbol forms the leftside e.g. node A with children
XYZ has the production
AXYZ
NB: Parsing is the process of finding a parse
tree for a given string of tokens.
11
Ambiguity
Ambiguity is caused by a string with more
than one parse tree
This is because it will have more than one
meaning
See example in Fig.3 below.
12
(a) (b)
string string
string + string -
string string
- string 2 9
string string + string
9 5 5 2
Fig. 3:Two parse trees for 9-5+2
13
Associativity of operators
9+5+2 (9+5)+2 and
9-5-2 (9-5)-2 and
The operand 5 given a plus operator to its right
and left will first take the left one i.e.
◦ (9+5) and thereafter take +2
This means math operators are all left
associative e.g.
◦ In the second case, the number 5 with a minus
operator on both sides will first take (9-5) and then -
2.
14
On the other hand, the assignment operator (=)
in C language is right associative e.g.
◦ a=b=c a=(b=c)
This statement has the following grammar:
◦ rightletter=right|letter
◦ lettera|b|…|z
NB: A left associative parse tree grows down towards the
left while a right associative parse tree grows down
towards the right as shown in Fig. 4.
15
(a) left associative (b) right associative
list right
list - digit =
letter right
- digit 2 a
list letter = right
digit 5 b letter
c
9
Fig. 4:Two parse trees for (a) 9-5-2 and (b) a=b=c
16
Precedence of operators
When more than one operator is used,
we follow precedence where
◦ The (* and /) operators are higher in
precedence than the (+ and -) operators
meaning that they take their operands first.
17
Constructing a grammar for arithmetic
operators
◦ Constructed from a table showing
associativity and precedence of the operators
◦ We create two non-terminal expressions
expr and term for the two levels of
precedence (i.e. *,/ and +,-)
◦ Then we create a third non-terminal factor
that will be used for generating basic units
such as digits and parenthesized expressions.
factordigit|(expr)
18
This gives us the following productions for the
*, / level
termterm*factor
|term/factor
|factor
Or: termterm*factor|term/factor|factor
And the following productions for the +, - level
exprexpr+term
|expr-term
|term
Or: exprexpr+term|expr-term|term
19
Resulting grammar is as follows
exprexpr+term|expr-term|term
termterm*factor|term/factor|factor
factordigit|(expr)
20
Syntax-directed translation
In order to translate a program, a
compiler needs to keep track of many
quantities of information e.g.
◦ Type of construct, location of first instruction
in the target code, number of instructions
generated etc.
◦ These are called attributes associated with
the constructs.
21
Syntax-directed definitions
◦ Uses grammar to specify the structure of the input
◦ A translation is an input-output mapping in the
following manner:
First construct a parse tree for X.
Suppose a node n in the parse tree is labeled by the grammar
symbol X, then we write X.a to denote the value of attribute
a of X at that node.
◦ A parse tree showing the attribute values at
each node is called an annotated parse
tree e.g.
In Fig.5, the 9-5+2 is translated from infix to postfix
notation, with attribute values attached to nodes.
22
expr.t=95-2+
expr.t=95- + term.t=2
- term.t=5 2
expr.t=9
term.t=9 5
Fig. 5: Annotated parse tree for 9-5+2
23
Exercise
◦ Study Fig.5 carefully and then draw a
corresponding ordinary parse tree for it.
24
Parsing
Definitions
◦ Parsing is the process of determining if a string of
tokens can be generated by a grammar
◦ Parsing may also be defined as the process of finding a
parse tree for a given string of tokens.
NB:
◦ A parser must be able to construct the tree,
otherwise the translation cannot be guaranteed
correct.
25
Parsing methods
◦ There are two main parsing methods,
distinguished by the order in which nodes in
the tree are constructed
Top-down methods
Construction here starts at the root and proceeds towards
the leaves
Bottom-up methods
Construction here starts at the leaves and proceeds
towards the root.
26
Justification for choosing a parsing
method
◦ Top-down methods
May be chosen when efficient parsers need to be
constructed easily
◦ Bottom-up methods
May be chosen when a large class of grammars
need to be generated.
27
Top-down parsing
◦ As an example, we construct a grammar for
subset of Pascal language.
◦ We construct an array, which uses dotdot
token in place of the actual “..” so as to
emphasize that the two dots are treated as
one token
28
type simple
| id
| array [simple] of type
simple integer
| char
| num dotdot num
29
We start with the root and label it with
starting non-terminal, and then
repeatedly perform the following two
steps:
1. At node n, labeled with non-terminal A,
select one of the productions for A and
construct children at n for the symbols on
the right side of the production
2. Find the next node at which a sub-tree is to
be constructed.
30
The current token being scanned in the input is
frequently referred to as the lookahead symbol
Initially, lookahead symbol is the first (leftmost
token) of the string
For example, in the Pascal array
array [num dotdot num] of integer
token array is initially the lookahead symbol
and the known part of the tree is only the root,
labeled with the starting non-terminal type.
NB: The generated tree should match the input string
31
(d) type
(a) type
array [ simple ] of type
(b) type
num dotdot num simple
array [ simple ] of type
(e) type
(c) type array [ simple ] of type
array [ simple ] of type num dotdot num simple
num dotdot num integer
Fig.6: Steps in top-down construction of a parse tree
32
2. Lexical analysis
The role of the lexical analyzer
◦ Its task is to read a character input stream and
produce as output a sequence of tokens that the
parser uses for syntax analysis
◦ May perform the secondary role as user interface
It is implemented as a subroutine of the lexical
analyzer as shown below.
Read
character
Lexical Pass token
Input and its attributes Parser
Pushback analyzer
character
Fig.7: Lexical analysis
33
In Fig.7, if the lexical analyzer reads >
followed by =, then it passes to the parser
the token >=
If it only reads > and no = symbol
following, then it pushes the other
character back to the input to act as the
beginning of next token and passes only >
to the parser.
34
Functions of the lexical analyzer
The functions of the lexical analyzer are:
1. Removes white space e.g. tabs, newlines, blanks etc.
2. Removes comments from source program
3. Corrects errors from source program
4. Recognizes and takes constants as digits i.e. takes
digit tokens e.g. num accompanied by their constant
value, such as <num,21><+><num,9>
5. Recognizes identifiers and keywords e.g.
Totalmarks = test1+test2*3 becomes
id1 = id2+id3*3
35
Patterns, tokens, and lexemes
Pattern
◦ Is a rule describing a set of input strings (lexemes)
associated with the token
Tokens
◦ Terminal symbols generated from lexemes e.g.
keywords, operators, identifiers, constants, literal
strings, parentheses, punctuation symbols, commas
and semicolons
Lexeme
◦ A sequence of characters in the source program that
is matched by the pattern for a token.
36
Lexical errors and error recovery actions
Wrongly spelt words e.g. fi. Change to if
Missing characters e.g. flot. Change to float
Extra characters e.g. flooat. Remove extra
character
Incorrect characters. Correct as appropriate.
Panic mode recovery
◦ If lexer cannot correct a certain error, it panics and
deletes the remaining input until it can find a well-
formed token.
37