0% found this document useful (0 votes)
16 views37 pages

Compiler Design: Syntax and Parsing

The document discusses the design and construction of a simple one-pass compiler, focusing on lexical analysis and syntax-directed translation. It explains the use of context-free grammars to define programming language syntax and the process of parsing, including the concepts of tokens, parse trees, and operator precedence. Additionally, it outlines the role of the lexical analyzer in transforming character input streams into tokens for further processing by the parser.

Uploaded by

bevonomwenga17
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views37 pages

Compiler Design: Syntax and Parsing

The document discusses the design and construction of a simple one-pass compiler, focusing on lexical analysis and syntax-directed translation. It explains the use of context-free grammars to define programming language syntax and the process of parsing, including the concepts of tokens, parse trees, and operator precedence. Additionally, it outlines the role of the lexical analyzer in transforming character input streams into tokens for further processing by the parser.

Uploaded by

bevonomwenga17
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BCS 324: COMPILER DESIGN &

CONSTRUCTION

Introduction to Compiling – Part B


1. A simple one-pass compiler
2. Lexical analysis

Dr. D. K. Muyobo
I. A simple one-pass compiler
 A programming language can be defined
by describing
◦ What its program’s syntax looks like, and
◦ What its statements mean (i.e. semantics)
 We use BNF (Backus-Naur Form) or
context-free grammars to specify the
syntax of a language
 Then we use informal descriptions to
specify semantics of the language.
 In addition to specifying syntax,
◦ Context-free grammar can also be used to guide the translation
of programs
◦ Such grammar-oriented compiling is known as syntax-directed
translation.
 We use an example of a compiler to translate infix
expressions into polish postfix notation where:
◦ The lexical analyzer converts the stream of input characters into
a stream of tokens
◦ The token stream becomes input to the syntax-directed
translator.
◦ The syntax-directed translator then produces intermediate code
representation as output, as shown in Fig 1.

3
Syntax-
Character Lexical Token Intermediate
directed
stream analyzer stream representation
translator

Fig.1: Syntax-directed translator

4
Syntax definition
 A context-free grammar describes the
hierarchical structure of many programming
language constructs e.g.
◦ The branch statement in C language has the form:
 if(expression)statement else statement
◦ It is the concat of opening parenthesis, an expression, closing
parenthesis, statement, else keyword, statement.
◦ We can use variables as follows:
 expr – to denote expression, and
 stmt – to denote statement.

5
 Thus, we have:
◦ stmt  if(expr)stmt else stmt
 Such a rule is called a production.
 Individual lexical units of a production e.g.
if, parenthesis etc. are called tokens.
 Variables like expr and stmt represent
sequences of tokens and are called
non-terminals.

6
Components of a context-free grammar

1. A set of tokens known as terminal symbols


2. A set of non-terminals e.g. expr, stmt.
3. A set of productions where each production
consists of a non-terminal called leftside of
the production, an arrow, a sequence of
tokens and/or non-terminals called the
rightside of the production
4. A designation of one of the non-terminals as
the start symbol.

7
 The convention followed lists grammars
by their productions listing start symbol
first
 Digits and signs such as <= and while etc.
are terminals e.g.
◦ list  list + digit
◦ list  list - digit
◦ list  digit
◦ digit  0|1|2|3|4|5|6|7|8|9

8
 The right sides of the 3 productions with
non-terminal list on the leftside can be
grouped equivalently as:
◦ list  list+digit|list–digit|digit
 Tokens of the grammar are the symbols
+-0123456789

9
Parse trees
 Parse trees
◦ Pictorially show how the start symbol of a
grammar derives a string in the language
 If non-terminal A has a production
AXYZ such that A is the start symbol,
then its parse tree is
A

X Y Z

Fig.2: Parse tree for non-terminal A

10
 Properties of parse trees
1. Root is labeled by a start symbol
2. Each leaf is labeled by a terminal (token)
3. Each interior node is labeled by a non-terminal
4. The children of a node in a parse tree form the
rightside of its production while the statement
symbol forms the leftside e.g. node A with children
XYZ has the production
AXYZ
NB: Parsing is the process of finding a parse
tree for a given string of tokens.
11
Ambiguity
 Ambiguity is caused by a string with more
than one parse tree
 This is because it will have more than one
meaning
 See example in Fig.3 below.

12
(a) (b)
string string

string + string -
string string
- string 2 9
string string + string

9 5 5 2

Fig. 3:Two parse trees for 9-5+2

13
Associativity of operators
 9+5+2  (9+5)+2 and
 9-5-2  (9-5)-2 and
 The operand 5 given a plus operator to its right
and left will first take the left one i.e.
◦ (9+5) and thereafter take +2
 This means math operators are all left
associative e.g.
◦ In the second case, the number 5 with a minus
operator on both sides will first take (9-5) and then -
2.

14
 On the other hand, the assignment operator (=)
in C language is right associative e.g.
◦ a=b=c  a=(b=c)
 This statement has the following grammar:
◦ rightletter=right|letter
◦ lettera|b|…|z

NB: A left associative parse tree grows down towards the


left while a right associative parse tree grows down
towards the right as shown in Fig. 4.

15
(a) left associative (b) right associative
list right

list - digit =
letter right
- digit 2 a
list letter = right

digit 5 b letter

c
9
Fig. 4:Two parse trees for (a) 9-5-2 and (b) a=b=c

16
Precedence of operators
 When more than one operator is used,
we follow precedence where
◦ The (* and /) operators are higher in
precedence than the (+ and -) operators
meaning that they take their operands first.

17
Constructing a grammar for arithmetic
operators
◦ Constructed from a table showing
associativity and precedence of the operators
◦ We create two non-terminal expressions
expr and term for the two levels of
precedence (i.e. *,/ and +,-)
◦ Then we create a third non-terminal factor
that will be used for generating basic units
such as digits and parenthesized expressions.
factordigit|(expr)

18
 This gives us the following productions for the
*, / level
termterm*factor
|term/factor
|factor
Or: termterm*factor|term/factor|factor
 And the following productions for the +, - level
exprexpr+term
|expr-term
|term
Or: exprexpr+term|expr-term|term

19
 Resulting grammar is as follows
exprexpr+term|expr-term|term
termterm*factor|term/factor|factor
factordigit|(expr)

20
Syntax-directed translation
 In order to translate a program, a
compiler needs to keep track of many
quantities of information e.g.
◦ Type of construct, location of first instruction
in the target code, number of instructions
generated etc.
◦ These are called attributes associated with
the constructs.

21
 Syntax-directed definitions
◦ Uses grammar to specify the structure of the input
◦ A translation is an input-output mapping in the
following manner:
 First construct a parse tree for X.
 Suppose a node n in the parse tree is labeled by the grammar
symbol X, then we write X.a to denote the value of attribute
a of X at that node.
◦ A parse tree showing the attribute values at
each node is called an annotated parse
tree e.g.
 In Fig.5, the 9-5+2 is translated from infix to postfix
notation, with attribute values attached to nodes.

22
expr.t=95-2+

expr.t=95- + term.t=2

- term.t=5 2
expr.t=9

term.t=9 5

Fig. 5: Annotated parse tree for 9-5+2

23
 Exercise
◦ Study Fig.5 carefully and then draw a
corresponding ordinary parse tree for it.

24
Parsing
 Definitions
◦ Parsing is the process of determining if a string of
tokens can be generated by a grammar
◦ Parsing may also be defined as the process of finding a
parse tree for a given string of tokens.
 NB:
◦ A parser must be able to construct the tree,
otherwise the translation cannot be guaranteed
correct.

25
 Parsing methods
◦ There are two main parsing methods,
distinguished by the order in which nodes in
the tree are constructed
 Top-down methods
 Construction here starts at the root and proceeds towards
the leaves
 Bottom-up methods
 Construction here starts at the leaves and proceeds
towards the root.

26
 Justification for choosing a parsing
method
◦ Top-down methods
 May be chosen when efficient parsers need to be
constructed easily
◦ Bottom-up methods
 May be chosen when a large class of grammars
need to be generated.

27
 Top-down parsing
◦ As an example, we construct a grammar for
subset of Pascal language.
◦ We construct an array, which uses dotdot
token in place of the actual “..” so as to
emphasize that the two dots are treated as
one token

28
type  simple
| id
| array [simple] of type
simple  integer
| char
| num dotdot num

29
 We start with the root and label it with
starting non-terminal, and then
repeatedly perform the following two
steps:
1. At node n, labeled with non-terminal A,
select one of the productions for A and
construct children at n for the symbols on
the right side of the production
2. Find the next node at which a sub-tree is to
be constructed.
30
 The current token being scanned in the input is
frequently referred to as the lookahead symbol
 Initially, lookahead symbol is the first (leftmost
token) of the string

 For example, in the Pascal array


array [num dotdot num] of integer
token array is initially the lookahead symbol
and the known part of the tree is only the root,
labeled with the starting non-terminal type.
NB: The generated tree should match the input string
31
(d) type
(a) type
array [ simple ] of type
(b) type
num dotdot num simple
array [ simple ] of type
(e) type

(c) type array [ simple ] of type

array [ simple ] of type num dotdot num simple

num dotdot num integer

Fig.6: Steps in top-down construction of a parse tree

32
2. Lexical analysis
 The role of the lexical analyzer
◦ Its task is to read a character input stream and
produce as output a sequence of tokens that the
parser uses for syntax analysis
◦ May perform the secondary role as user interface
 It is implemented as a subroutine of the lexical
analyzer as shown below.
Read
character
Lexical Pass token
Input and its attributes Parser
Pushback analyzer
character

Fig.7: Lexical analysis

33
 In Fig.7, if the lexical analyzer reads >
followed by =, then it passes to the parser
the token >=
 If it only reads > and no = symbol
following, then it pushes the other
character back to the input to act as the
beginning of next token and passes only >
to the parser.

34
Functions of the lexical analyzer
 The functions of the lexical analyzer are:
1. Removes white space e.g. tabs, newlines, blanks etc.
2. Removes comments from source program
3. Corrects errors from source program
4. Recognizes and takes constants as digits i.e. takes
digit tokens e.g. num accompanied by their constant
value, such as <num,21><+><num,9>
5. Recognizes identifiers and keywords e.g.
Totalmarks = test1+test2*3 becomes
id1 = id2+id3*3

35
Patterns, tokens, and lexemes
 Pattern
◦ Is a rule describing a set of input strings (lexemes)
associated with the token
 Tokens
◦ Terminal symbols generated from lexemes e.g.
keywords, operators, identifiers, constants, literal
strings, parentheses, punctuation symbols, commas
and semicolons
 Lexeme
◦ A sequence of characters in the source program that
is matched by the pattern for a token.

36
Lexical errors and error recovery actions

 Wrongly spelt words e.g. fi. Change to if


 Missing characters e.g. flot. Change to float
 Extra characters e.g. flooat. Remove extra
character
 Incorrect characters. Correct as appropriate.
 Panic mode recovery
◦ If lexer cannot correct a certain error, it panics and
deletes the remaining input until it can find a well-
formed token.

37

You might also like