Unit - II
Chapter 3
Lexical Analysis and
Lexical Analyzer Generators
Partha Sarathi Chakraborty
Assistant Professor
Department of Computer Science and Engineering
SRM University, Delhi NCR Campus
(C) 2014, Prepared by Partha Sarathi Chakraborty
Outline
Lexical Analysis
Role of Lexical Analyzer
Recognition of Tokens
Lexical Analyzer Generator: Lex/Flex
Design Aspects of Lexical Analyzer
The Reason Why Lexical Analysis is a
Separate Phase (Issues in Lexical Analysis)
Simplifies the design of the compiler
(C) 2014, Prepared by Partha Sarathi Chakraborty
LL(1) or LR(1) with 1 lookahead would not be possible
Provides efficient implementation
Systematic techniques to implement lexical analyzers
by hand or automatically
Stream buffering methods to scan input
Improves portability
Non-standard symbols and alternate character
encodings can be more easily translated
(C) 2014, Prepared by Partha Sarathi Chakraborty
Interaction of the Lexical
Analyzer with the Parser
Source
Program
Lexical
Analyzer
Token,
tokenval
Parser
Get next
token
error
error
Symbol Table
Attributes of Tokens
(C) 2014, Prepared by Partha Sarathi Chakraborty
y := 31 + 28*x
Lexical analyzer
<id, y> <assign, > <num, 31> <+, > <num, 28> <*, > <id, x>
token
tokenval
(token attribute)
Parser
(C) 2014, Prepared by Partha Sarathi Chakraborty
(C) 2014, Prepared by Partha Sarathi Chakraborty
Example
Tokens, Patterns, and Lexemes
A token is a classification of lexical units
(C) 2014, Prepared by Partha Sarathi Chakraborty
For example: id and num
Lexemes are the specific character strings that
make up a token
For example: abc and 123
Patterns are rules describing the set of lexemes
belonging to a token
For example: letter followed by letters and digits and
non-empty sequence of digits
Example
Consider Pascal Statement
(C) 2014, Prepared by Partha Sarathi Chakraborty
const pi = 3.1416;
10
Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of
other components, that there is a source-code error.
Example:
fi(a==f(x))
(C) 2014, Prepared by Partha Sarathi Chakraborty
fi is a valid lexeme for the token id to the parser.
Simplest recovery strategy is panic mode recovery.
Other possible error-recovery actions are:
Delete one character from the remaining input.
Insert a missing character into the remaining input.
Replace a character by another character.
Transpose two adjacent characters.
Another way is to Minimum distance error correction is a
convenient theoretical yardstick, but it is not generally
used in practice, because it too costly to implement.
11
(C) 2014, Prepared by Partha Sarathi Chakraborty
Input Buffering
Two-buffer input scheme to look ahead on
the input and identify tokens
Buffer pairs
Sentinels (Guards)
12
(C) 2014, Prepared by Partha Sarathi Chakraborty
Specification of Patterns for
Tokens: Terminology
An alphabet is a finite set of symbols
(characters)
A string s is a finite sequence of symbols
from
|s| denotes the length of string s
denotes the empty string, thus || = 0
A language is a specific set of strings over
some fixed alphabet
13
(C) 2014, Prepared by Partha Sarathi Chakraborty
Specification of Patterns for
Tokens: String Operations
The concatenation of two strings x and y is
denoted by xy
The exponentation of a string s is defined
by
s0 =
si = si-1s for i > 0
(note that s = s = s)
14
(C) 2014, Prepared by Partha Sarathi Chakraborty
Specification of Patterns for
Tokens: Language Operations
Union
L M = {s | s L or s M}
Concatenation
LM = {xy | x L and y M}
Exponentiation
L0 = {}; Li = Li-1L
Kleene closure
L* = i=0,, Li
Positive closure
L+ = i=1,, Li
15
Specification of Patterns for
Tokens: Regular Expressions
(C) 2014, Prepared by Partha Sarathi Chakraborty
Basis symbols:
is a regular expression denoting language {}
a is a regular expression denoting {a}
If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
r | s is a regular expression denoting L(r) M(s)
rs is a regular expression denoting L(r)M(s)
r* is a regular expression denoting L(r)*
(r) is a regular expression denoting L(r)
A language defined by a regular expression is
called a regular set
16
(C) 2014, Prepared by Partha Sarathi Chakraborty
Specification of Patterns for
Tokens: Regular Definitions
Naming convention for regular expressions:
d 1 r1
d 2 r2
d n rn
where ri is a regular expression over
{d1, d2, , di-1 }
Each dj in ri is textually substituted in ri
17
Specification of Patterns for
Tokens: Regular Definitions
(C) 2014, Prepared by Partha Sarathi Chakraborty
Example:
letter A | B | | Z | a | b | | z
digit 0 | 1 | | 9
id letter ( letter | digit )*
Cannot use recursion, this is illegal:
digits digit digits | digit
18
(C) 2014, Prepared by Partha Sarathi Chakraborty
Specification of Patterns for
Tokens: Notational Shorthands
We frequently use the following shorthands:
r+ = rr*
r? = r |
[a-z] = a | b | c | | z
For example:
digit [0-9]
num digit+ (. digit+)? ( E (+|-)? digit+ )?
19
Regular Definitions and
Grammars
(C) 2014, Prepared by Partha Sarathi Chakraborty
Grammar
stmt if expr then stmt
| if expr then stmt else stmt
|
expr term relop term
| term
term id
| num
Regular definitions
if if
then then
else else
relop < | <= | <> | > | >= | =
id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+|-)? digit+ )?
20
Implementing a Scanner Using
Transition Diagrams
relop < | <= | <> | > | >= | =
(C) 2014, Prepared by Partha Sarathi Chakraborty
start
<
=
>
other
=
>
id letter ( letter | digit )*
start
letter
5
6
return(relop, LE)
return(relop, NE)
4 * return(relop, LT)
return(relop, EQ)
=
7 return(relop, GE)
other
8 * return(relop, GT)
letter or digit
10
other
11 * return(gettoken(),
install_id())
(C) 2014, Prepared by Partha Sarathi Chakraborty
Implementing a Scanner Using
Transition Diagrams (Code)
token nexttoken()
{ while (1) {
switch (state) {
case 0: c = nextchar();
if (c==blank || c==tab || c==newline) {
state = 0;
lexeme_beginning++;
}
else if (c==<) state = 1;
else if (c===) state = 5;
else if (c==>) state = 6;
else state = fail();
break;
case 1:
case 9: c = nextchar();
if (isletter(c)) state = 10;
else state = fail();
break;
case 10: c = nextchar();
if (isletter(c)) state = 10;
else if (isdigit(c)) state = 10;
else state = 11;
break;
Decides what
other start state
is applicable
int fail()
{ forward = token_beginning;
swith (start) {
case 0: start = 9; break;
case 9: start = 12; break;
case 12: start = 20; break;
case 20: start = 25; break;
case 25: recover(); break;
default: /* error */
}
return start;
}
21
22
(C) 2014, Prepared by Partha Sarathi Chakraborty
The Lex and Flex Scanner
Generators
Lex and its newer cousin flex are scanner
generators
Systematically translate regular definitions
into C source code for efficient scanning
Generated code is easy to integrate in C
applications
23
(C) 2014, Prepared by Partha Sarathi Chakraborty
Creating a Lexical Analyzer with
Lex and Flex
lex
source
program
lex.l
[Link].c
input
stream
lex or flex
compiler
[Link].c
C
compiler
[Link]
[Link]
sequence
of tokens
24
(C) 2014, Prepared by Partha Sarathi Chakraborty
Lex Specification
A lex specification consists of three parts:
regular definitions, C declarations in %{ %}
%%
translation rules
%%
user-defined auxiliary procedures
The translation rules are of the form:
p1
{ action1 }
p2
{ action2 }
pn
{ actionn }
25
(C) 2014, Prepared by Partha Sarathi Chakraborty
Regular Expressions in Lex
x
match the character x
\.
match the character .
string match contents of string of characters
.
match any character except newline
^
match beginning of a line
$
match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x, y, and z
[a-z] match one of a to z
r*
closure (match zero or more occurrences)
r+
positive closure (match one or more occurrences)
r?
optional (match zero or one occurrence)
r1 r2
match r1 then r2 (concatenation)
r1|r2
match r1 or r2 (union)
(r)
grouping
r1\r2
match r1 when followed by r2
{d}
match the regular expression defined by d
26
(C) 2014, Prepared by Partha Sarathi Chakraborty
Lex actions
BEGIN: It indicates the start state. The lexical analyzer starts at
state 0.
ECHO: It emits the input as it is.
yytext: When lexer matches or recognizes the token from input
token then the lexeme is stored in a null terminated string called
yytext.
yylex(): As soon as call to yylex() is encountered scanner starts
scanning the source program.
yywrap(): It is called when scanner encounters eof i.e. return 0. If
returns 0 then scanner continues scanning.
yyin: It is the standard input file that stores input source program.
yyleng: when a lexer reconizes token then the lexeme is stored in a
null terminated string called yytext. It stores the length or number of
characters in the input string. The value in yyleng is same as strlen()
functions.
27
(C) 2014, Prepared by Partha Sarathi Chakraborty
Installing Software
Download Flex 2.5.4a
Download Bison 2.4.1
Download DevC++
Install Flex at "C:\GnuWin32"
Install Bison at "C:\GnuWin32"
Install DevC++ at "C:\Dev-Cpp"
Open Environment Variables.
Add "C:\GnuWin32\bin;C:\Dev-Cpp\bin;" to path.
(C) 2014, Prepared by Partha Sarathi Chakraborty
28
For Windows 8
29
Example Lex Specification 1
(C) 2014, Prepared by Partha Sarathi Chakraborty
Translation
rules
%{
#include <stdio.h>
%}
%%
[0-9]+ { printf("%s\n", yytext); }
.|\n {}
%%
int main( )
{
yylex( );
}
int yywrap( )
{
return 1;
}
Contains
the matching
lexeme
Invokes
the lexical
analyzer
lex spec.l
gcc [Link].c -ll
./[Link] < spec.l
(C) 2014, Prepared by Partha Sarathi Chakraborty
30
Execution Lex Specification 1
Digit
only
printed
31
Example Lex Specification 2
(C) 2014, Prepared by Partha Sarathi Chakraborty
Translation
rules
%{
#include <stdio.h>
int ch = 0, wd = 0, nl = 0;
%}
delim
[ \t]+
%%
\n
{ ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim}
{ ch+=yyleng; wd++; }
.
{ ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
Regular
definition
32
Example Lex Specification 3
(C) 2014, Prepared by Partha Sarathi Chakraborty
Translation
rules
%{
#include <stdio.h>
Regular
%}
definitions
digit
[0-9]
letter
[A-Za-z]
id
{letter}({letter}|{digit})*
%%
{digit}+ { printf(number: %s\n, yytext); }
{id}
{ printf(ident: %s\n, yytext); }
.
{ printf(other: %s\n, yytext); }
%%
main()
{ yylex();
}
33
(C) 2014, Prepared by Partha Sarathi Chakraborty
Example Lex Specification 4
%{ /* definitions of manifest constants */
#define LT (256)
%}
delim
[ \t\n]
ws
{delim}+
letter
[A-Za-z]
digit
[0-9]
id
{letter}({letter}|{digit})*
number
{digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%%
{ws}
{ }
if
{return IF;}
then
{return THEN;}
else
{return ELSE;}
{id}
{yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
<
{yylval = LT; return RELOP;}
<=
{yylval = LE; return RELOP;}
=
{yylval = EQ; return RELOP;}
<>
{yylval = NE; return RELOP;}
>
{yylval = GT; return RELOP;}
>=
{yylval = GE; return RELOP;}
%%
int install_id()
Return
token to
parser
Token
attribute
Install yytext as
identifier in symbol table
Lex Program
34
(Write a program in Lex to identify identifier and keyword in a sentence.)
%{
#include<stdio.h>
static int key_word=0;
static int identifier=0;
%}
%%
(C) 2014, Prepared by Partha Sarathi Chakraborty
"include"|"for"|"define" {key_word++;printf("keyword found");}
"int"|"char"|"float"|"double" {identifier++;printf("identifier found");}
%%
int main()
{
printf("enter the sentence");
yylex();
printf("keyword are: %d\n and identifier are:%d\n",key_word,identifier);
}
int yywrap()
{
return 1;
}
35
Design of a Lexical Analyzer
Generator
(C) 2014, Prepared by Partha Sarathi Chakraborty
Translate regular expressions to NFA
Translate NFA to an efficient DFA
Optional
regular
expressions
NFA
DFA
Simulate NFA
to recognize
tokens
Simulate DFA
to recognize
tokens
36
Nondeterministic Finite
Automata
(C) 2014, Prepared by Partha Sarathi Chakraborty
Definition: an NFA is a 5-tuple (S,,,s0,F)
where
S is a finite set of states
is a finite set of input symbol alphabet
is a mapping from S to a set of states
s0 S is the start state
F S is the set of accepting (or final) states
37
(C) 2014, Prepared by Partha Sarathi Chakraborty
Transition Graph
An NFA can be diagrammatically
represented by a labeled directed graph
called a transition graph
a
start
0
b
S = {0,1,2,3}
= {a,b}
s0 = 0
F = {3}
38
Transition Table
(C) 2014, Prepared by Partha Sarathi Chakraborty
The mapping of an NFA can be
represented in a transition table
(0,a) = {0,1}
(0,b) = {0}
(1,b) = {2}
(2,b) = {3}
State
Input
a
Input
b
{0, 1}
{0}
{2}
{3}
39
(C) 2014, Prepared by Partha Sarathi Chakraborty
The Language Defined by an
NFA
An NFA accepts an input string x iff there is some
path with edges labeled with symbols from x in
sequence from the start state to some accepting
state in the transition graph
A state transition from one state to another on the
path is called a move
The language defined by an NFA is the set of
input strings it accepts, such as (a|b)*abb for the
example NFA
40
Design of a Lexical Analyzer
Generator: RE to NFA to DFA
(C) 2014, Prepared by Partha Sarathi Chakraborty
Lex specification with
regular expressions
p1
p2
pn
{ action1 }
{ action2 }
{ actionn }
NFA
start
s0
N(p1)
N(p2)
N(pn)
action1
action2
actionn
Subset construction
(optional)
DFA
41
(C) 2014, Prepared by Partha Sarathi Chakraborty
From Regular Expression to NFA
(Thompsons Construction)
start
start
r1 | r2
start
r1 r2
r*
start
start
i
i
N(r1)
N(r2)
i N(r1)
N(r2) f
N(r)
42
Combining the NFAs of a Set of
Regular Expressions
(C) 2014, Prepared by Partha Sarathi Chakraborty
start
a
{ action1 }
abb { action2 }
a*b+ { action3 }
start
start
1
3
start
1
3
a
a
b
4
8
b
b
43
Simulating the Combined NFA
Example 1
(C) 2014, Prepared by Partha Sarathi Chakraborty
start
7
a
action1
action2
action3
none
action3
Must find the longest match:
Continue until no further moves are possible
When last state is accepting: execute action
44
Simulating the Combined NFA
Example 2
(C) 2014, Prepared by Partha Sarathi Chakraborty
start
7
a
action1
action2
action3
none
action2
action3
When two or more accepting states are reached, the
first action given in the Lex specification is executed
45
Deterministic Finite Automata
(C) 2014, Prepared by Partha Sarathi Chakraborty
A deterministic finite automaton is a special case
of an NFA
No state has an -transition
For each state s and input symbol a there is at most one
edge labeled a leaving s
Each entry in the transition table is a single state
At most one path exists to accept a string
Simulation algorithm is simple
46
(C) 2014, Prepared by Partha Sarathi Chakraborty
Example DFA
A DFA that accepts (a|b)*abb
b
start
a
b
1
a
2
a
47
(C) 2014, Prepared by Partha Sarathi Chakraborty
Conversion of an NFA into a
DFA
The subset construction algorithm converts an
NFA into a DFA using:
-closure(s) = {s} {t | s t}
-closure(T) = sT -closure(s)
move(T,a) = {t | s a t and s T}
The algorithm produces:
Dstates is the set of states of the new DFA
consisting of sets of states of the NFA
Dtran is the transition table of the new DFA
48
-closure and move Examples
(C) 2014, Prepared by Partha Sarathi Chakraborty
start
b
a
-closure({0}) = {0,1,3,7}
move({0,1,3,7},a) = {2,4,7}
-closure({2,4,7}) = {2,4,7}
move({2,4,7},a) = {7}
-closure({7}) = {7}
move({7},b) = {8}
-closure({8}) = {8}
move({8},a) =
none
Also used to simulate NFAs
49
(C) 2014, Prepared by Partha Sarathi Chakraborty
Simulating an NFA using
-closure and move
S := -closure({s0})
Sprev :=
a := nextchar()
while S do
Sprev := S
S := -closure(move(S,a))
a := nextchar()
end do
if Sprev F then
execute action in Sprev
return yes
else return no
50
(C) 2014, Prepared by Partha Sarathi Chakraborty
The Subset Construction
Algorithm
Initially, -closure(s0) is the only state in Dstates and it is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a do
U := -closure(move(T,a))
if U is not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
51
Subset Construction Example 1
(C) 2014, Prepared by Partha Sarathi Chakraborty
start
C
start
B
a
b
a
D
a
Dstates
A = {0,1,2,4,7}
B = {1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
D = {1,2,4,5,6,7,9}
E = {1,2,4,5,6,7,10}
10
52
Subset Construction Example 2
(C) 2014, Prepared by Partha Sarathi Chakraborty
start
1
3
a
a
b
a1
2
4
8
a3
a2
a3
C
b
start
D
a
a1
a3
a2 a3
Dstates
A = {0,1,3,7}
B = {2,4,7}
C = {8}
D = {7}
E = {5,8}
F = {6,8}
53
Minimizing the Number of States
of a DFA
(C) 2014, Prepared by Partha Sarathi Chakraborty
C
start
B
a
b
a
D
a
start
D
a
54
(C) 2014, Prepared by Partha Sarathi Chakraborty
From Regular Expression to DFA
Directly
The important states of an NFA are those
without an -transition, that is if
move({s},a) for some a then s is an
important state
The subset construction algorithm uses only
the important states when it determines
-closure(move(T,a))
55
(C) 2014, Prepared by Partha Sarathi Chakraborty
From Regular Expression to DFA
Directly (Algorithm)
Augment the regular expression r with a
special end symbol # to make accepting
states important: the new expression is r#
Construct a syntax tree for r#
Traverse the tree to construct functions
nullable, firstpos, lastpos, and followpos
56
From Regular Expression to DFA
Directly: Syntax Tree of (a|b)*abb#
concatenation
(C) 2014, Prepared by Partha Sarathi Chakraborty
#
6
closure
b
4
alternation
|
a
position
number
(for leafs )
57
(C) 2014, Prepared by Partha Sarathi Chakraborty
From Regular Expression to DFA
Directly: Annotating the Tree
nullable(n): the subtree at node n generates
languages including the empty string
firstpos(n): set of positions that can match the first
symbol of a string generated by the subtree at
node n
lastpos(n): the set of positions that can match the
last symbol of a string generated be the subtree at
node n
followpos(i): the set of positions that can follow
position i in the tree
(C) 2014, Prepared by Partha Sarathi Chakraborty
From Regular Expression to DFA
Directly: Annotating the Tree
Node n
nullable(n)
firstpos(n)
lastpos(n)
Leaf
true
Leaf i
false
{i}
{i}
|
/ \
c1
c2
nullable(c1)
or
nullable(c2)
firstpos(c1)
firstpos(c2)
lastpos(c1)
lastpos(c2)
/ \
c1
c2
nullable(c1)
and
nullable(c2)
if nullable(c1) then
firstpos(c1)
firstpos(c2)
else firstpos(c1)
if nullable(c2) then
lastpos(c1)
lastpos(c2)
else lastpos(c2)
*
|
c1
true
firstpos(c1)
lastpos(c1)
58
59
From Regular Expression to DFA
Directly: Syntax Tree of (a|b)*abb#
{1, 2, 3}
(C) 2014, Prepared by Partha Sarathi Chakraborty
{1, 2, 3}
{1, 2, 3}
nullable
{1, 2, 3}
{1, 2}
{1, 2}
{1, 2}
{1, 2}
{1} a {1}
1
{3}
{4}
{6} # {6}
6
{5} b {5}
5
{4} b {4}
4
{3} a {3}
3
{2} b {2}
2
{5}
{6}
firstpos
lastpos
60
(C) 2014, Prepared by Partha Sarathi Chakraborty
From Regular Expression to DFA
Directly: followpos
for each node n in the tree do
if n is a cat-node with left child c1 and right child c2 then
for each i in lastpos(c1) do
followpos(i) := followpos(i) firstpos(c2)
end do
else if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i) firstpos(n)
end do
end if
end do
From Regular Expression to DFA
Directly: Algorithm
(C) 2014, Prepared by Partha Sarathi Chakraborty
s0 := firstpos(root) where root is the root of the syntax tree
Dstates := {s0} and is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a do
let U be the set of positions that are in followpos(p)
for some position p in T,
such that the symbol at position p is a
if U is not empty and not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
61
62
(C) 2014, Prepared by Partha Sarathi Chakraborty
From Regular Expression to DFA
Directly: Example
Node
followpos
{1, 2, 3}
{1, 2, 3}
{4}
{5}
{6}
1,2,3
b
start
b
a
a
b
1,2,
3,4
a
1,2,
3,5
a
1,2,
3,6
63
(C) 2014, Prepared by Partha Sarathi Chakraborty
Time-Space Tradeoffs
Automaton
Space
(worst case)
Time
(worst case)
NFA
O(|r|)
O(|r||x|)
DFA
O(2|r|)
O(|x|)