0% found this document useful (0 votes)
8 views60 pages

4 Lexical Analysis

The document discusses lexical analysis in compiler design. It covers the role of the lexical analyzer, token recognition using regular expressions and finite state machines, and implementation techniques like input buffering. Issues in lexical analysis like errors and portability are also covered.

Uploaded by

vivek shah
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views60 pages

4 Lexical Analysis

The document discusses lexical analysis in compiler design. It covers the role of the lexical analyzer, token recognition using regular expressions and finite state machines, and implementation techniques like input buffering. Issues in lexical analysis like errors and portability are also covered.

Uploaded by

vivek shah
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Programming Language Translators

Unit 1 - Part2 : Lexical Analysis

Contents
The role of the lexical analyzer S ecification of to!ens " #egular $x ression #ecognition of to!ens " %inite State &achines %rom a #egular $x ressions to an '%A " Thomson Construciton Con(ert '%A to )%A " Su*set Construction &inimizing )%A

The #ole of Lexical Analyzer


Lexical analyzer is the first hase of a com iler, -ts main tas! is to rea. in ut characters an. [Link] as out ut a se/uence of to!ens that arser uses for syntax analysis,

%unctions of Lexical Analyzer


Scans the source rogram character *y character [Link] the stream of to!ens after recognizing them -.entifies the lexical errors $liminates the 0hite s ace an. comments -t generates the sym*ol ta*le -t !ee s trac! of line num*er -t re orts error encountere. 0hile generating the to!ens

-ssues in Lexical Analysis


There are se(eral reasons for se arating the analysis hase of com iling into lexical analysis an. arsing:
" Sim ler .esign " Com iler efficiency " Com iler orta*ility

Simpler design
Lexical an. Syntax analysis are se arate. for sim lifying the 0or! of the arsing hase $x: Comments an. 0hite s aces are eliminate. in lexical analysis

-ssues in Lexical Analysis


Improve the efficiency
Large amount of time is s ent on [Link] the source file an. it coul. *e inefficient if each hase has to rea. the source file -ncrease the efficiency 3lexical analyzer use the -'PUT 4U%%$#-'5 techni/ue

Enhanced portability
-n ut al ha*et eculiarities an. other .e(ice s ecific anomalies can *e restricte. to the lexical analyzer The #e resentation of s ecial or non [Link]. sym*ol can *e isolate. in the lexical analyzer $x: 678 in Pascal can *e isolate.

To!ens3 Patterns3 Lexemes


A lexeme is a se/uence of characters in the source rogram that is matche. *y the attern for a to!en, A lexeme is a *asic lexical unit of a language com rising one or se(eral 0or.s3 the elements of 0hich .o not se arately con(ey the meaning of the 0hole, The lexemes of a rogramming language inclu.e its i.entifier3 literals3 o erators3 an. s ecial 0or.s, A to!en of a language is a category of its lexemes, A attern is a rule .escri*ing the set of lexemes that can re resent as articular to!en in source rogram,

$xam les of To!ens

$xam le 1

Token ?ey0or. S l, Sym*ol [Link] #eAo [Link] S l, Sym*ol

-f < a = * >
Lexeme if < a = * > Pattern -f3 int3 float3for30hile@@, Any unctuation or s ecial sym*ols Letter follo0e. *y letter or .igit B3=3=B3C3CB3DB Letter follo0e. *y letter or .igit Any unctuation or s ecial sym*ols

Lexeme an. To!en: $xam le 2


Index = 2 * count +17; Lexemes
-[Link]

Tokens
-.entifier e/ualAsign o intAliteral multiAo [Link] lusAo intAliteral semicolon

B
2

H Count G
19

1E

To!en attri*utes
Ihen more than one attern matches a lexeme3 a lexical analyzer must ro(i.e a..itional information a*out articular lexeme that match to the su*se/uent hase of the com iler,

$xam le: $ B & H C HH 2


=i.3 ointer to sym*ol ta*le entry for $C =assign-o C =i.3 ointer to sym*ol ta*le entry for &C =mult-o C =i.3 ointer to sym*ol ta*le entry for CC =ex -o C =num*er3 integer (alue 2C
11

Lexical $rrors
%e0 errors are .iscerni*le at the lexical le(el alone3 *ecause a lexical analyzer has a (ery localize. (ie0 of a source rogram, Let some other hase of com iler [Link] any error, Panic mo.e $rror reco(ery

12

Lexical $rrors

Some errors are [Link]. an. rectifie. at lexical le(el alone

$x: cBHa* G 1EF this ty e of error is [Link]. an. rectifie. at lexical le(el 0ith hel of panic mode techni/ue<.elete successi(e characters from the remaining in ut until the lexical analyzer can fin. a 0ell-forme. to!en>

Lexical analysis can8t [Link] some a

earances

$x: fi<xBB1E> : here fi is [Link] as a (ali. to!en

Other error recovery actions


)eleting an extraneous characters -nserting a missing character #e lacing an incorrect character *y a correct character Trans osing t0o [Link] characters

1+

Input Buffering
There are three general approaches to the implementation of a lexical analy er! 1, Use a lexical analyzer generator 2, Irite the lexical analyzer in a con(entional rogramming language +, Irite the lexical analyzer in assem*ly language, -n one *uffer techni/ue3 the last lexeme [Link] rocess 0ill *e o(er-0ritten 0hen 0e reloa. the *uffer, T0o-*uffer scheme [Link] large loo! ahea. safely Ie use a *uffer .i(i.e. into t0o '-characters hal(es,

Buffer Pairs
T0o *uffers of the same size3 say KE;23 are alternately reloa.e., T0o ointers to the in ut are maintaine.:
Pointer lexeme_Begin mar!s the *eginning of the current lexeme, Pointer forward scans ahea. until a attern match is foun.,

"ode to advance for#ard pointer


-f for0ar. at en. of first half then *egin reloa. secon. halfF for0ar.:Bfor0ar. G 1F $n. $lse if for0ar. at en. of secon. half then *egin reloa. first halfF mo(e for0ar. to *eginning of first half $n. $lse for0ar.:Bfor0ar. G 1F

Sentinels
-f 0e use the in ut *uffer scheme3 0e must chec! each time 0e mo(e the for0ar. ointer that 0e ha(e noot mo(e. off one half of the *uffer, -f 0e .o then 0e must reloa. the other half, Ie can [Link] the t0o tests to one if 0e exten. each *uffer half to hol. a sentinel character at the en., The sentinal is a s ecial character that can not *e art of the source rogram A natural choice is eof,

$ B & H eof C H H 2 eof

eof

Lookahead code #ith sentinels


for0ar.:Bfor0ar.G1F -f for0ar. L B $M% then *egin -f for0ar. at en. of first half then *egin reloa. secon. halfF for0ar.:Bfor0ar. G 1F $n. $lse if for0ar. at en. of secon. half then *egin reloa. first halfF mo(e for0ar. to *eginning of first half $n. $lse terminate lexical analysisF

S ecification of To!ens
To!ens are s ecifie. *y #egular $x ression, #egular ex ressions are an im ortant notation for s ecifying atterns, Ie ha(e to .iscuss the follo0ing:
" " " " M eration on languages #egular ex ressions #egular .efinitions 'otational shorthan.s

1;

Languages
$lphabet % a finite set of sym*ols String :

%inite se/uence of sym*ols on an al ha*et Sentence an. 0or. are also use. in terms of string is the em ty string NsN is the length of string s,

Lang&age: Sets of strings o(er some fixe. al ha*et


the em ty set is a language, OP the set containing em ty string is a language The set of 0ell-forme. C rograms is a language The set of all ossi*le [Link] is a language,

2E

M erations on Languages

21

#egular $x ressions
#egular ex ression is a com act notation for .escri*ing string, -n Pascal3 an [Link] is a letter follo0e. *y zero or more letter or .igits letter<letter [Link]>H |: or *: zero or more instance

22

#ules
is a regular ex ression that .enotes OP3 the set containing em ty string, -f a is a sym*ol in 3 then a is a regular ex ression that .enotes O aP3 the set containing the string a, Su ose R an. S are regular ex ressions .enoting the language L< R> an. L<S>3 then " <R> N<S> is a regular ex ression .enoting L<R>L<S>, " <R>,<S> is regular ex ression .enoting L <R>,L<S>, " <R>H is a regular ex ression .enoting <L<R>>H, " <#> is a regular ex ression .enoting L<#>,

2+

[Link] Con(entions
The unary o erator H has the highest [Link] an. is left associati(e, Concatenation has the secon. highest [Link] an. is left associati(e, N has the lo0est [Link] an. is left associati(e, <a>N<b>*<c>aNb*c

2K

$xam le of #egular $x ressions

21

Pro erties of #egular $x ression

22

#egular )efinitions
To 0rite regular ex ression for some languages can *e .ifficult3 *ecause their regular ex ressions can *e /uite com lex, -n those cases3 0e may use regular definitions, -f is an al ha*et of *asic sym*ols3 then a regular definition is a se/uence of .efinitions of the form:

d1r1 d2r2
,,,

dnrn
0here each di is a .istinct name3 an. each ri is a regular ex ression o(er the sym*ols in Od1,d2,,di-1P3 i,e,3 the *asic sym*ols an. the re(iously .efine. names,

29

$xam les of #egular )efinitions

$xam le: Unsigne. num*ers

2:

'otational Shorthan.s
The follo0ing shorthan.s are often use.:

rG B r rH rQ B r Ra- S B a b c @
$xam les:
.igit R0-9S num .igitG <, .igitG>Q < <$ N e> <G->Q .igitG >Q

2;

To!en #ecognition
Lexical analysis is rocess of recognizing to!ens from in ut source rogram
So&rce Program

Inp&t b&ffer

Transition diagram

'$

Pattern Pattern matching $lgorithm

Transition )iagrams
(elop # #= #$ $ $= =
start

= $ other

2 + K

(et&rn <relop3 LE> (et&rn <relop3 NE>

H (et&rn <relop3 L

>

= $

1 2

(et&rn <relop3 E!> = 9 (et&rn <relop3 "E> other : H (et&rn <relop3 " >

id letter ) letterdigit >H


start

letter or digit letter

1E

other

11 H (et&rn <get token <>3


install _id <>>

-m lementation
to%en nextto%en&' ( w)ile &1' ( *witc) &*tate' ( ca*e +, c = nextc)ar&'; if &c==-lan% || c==ta- || c==newline' ( *tate = +; lexeme_-eginning++; . el*e if &c==/#0' *tate = 1; el*e if &c==/=0' *tate = 1; el*e if &c==/$0' *tate = 2; el*e *tate = fail&'; -rea%; ca*e 1, 3 ca*e 4, c = nextc)ar&'; if &i*letter&c'' *tate = 1+; el*e *tate = fail&'; -rea%; ca*e 1+, c = nextc)ar&'; if &i*letter&c'' *tate = 1+; el*e if &i*digit&c'' *tate = 1+; el*e *tate = 11; -rea%; 3

)[Link] the next start state to chec!

int fail&' ( forward = to%en_-eginning; *wit) &*tart' ( ca*e +, *tart = 4; -rea%; ca*e 4, *tart = 12; -rea%; ca*e 12, *tart = 2+; -rea%; ca*e 2+, *tart = 21; -rea%; ca*e 21, reco5er&'; -rea%; default, 6* error *6 . return *tart; .

%inite Automata
A recognizer for a language is a rogram that ta!es as in ut a string x an. ans0er TyesU if x is a sentence of the language an. TnoU other0ise, Ie com ile a regular ex ression into a recognizer *y constructing a generalize. transition .iagram calle. a finite automaton, A finite automaton can *e deterministic or nondeterministic3 0here [Link] means that more than one transition out of a state may *e ossi*le on the same in ut sym*ol,

)eterministic " faster recognizer3 *ut it may ta!e more s ace '[Link] " slo0er3 *ut it may ta!e less s ace )eterministic automatons are [Link] use. lexical analyzers,

++

'[Link] %inite Automata <'%A>


A [Link] finite automaton <'%A> is a mathematical [Link] that consists of: A set of states S A set of in ut sym*ols A transition function move that ma s state-sym*ol airs to sets of states A state s0 that is .istinguishe. as the start initial! state

A set of states " .istinguishe. as acce#ting final! states,

- transitions are allo0e. in '%As, -n other 0or.s3 0e can mo(e from one state to another one 0ithout consuming any sym*ol, A '%A acce ts a string x3 if an. only if there is a ath from the starting state to one of acce ting states such that [Link] la*els along this ath s ell out x,

+K

'%A
An '%A can *e re resente. .iagrammatically *y a la*ele. .irecte. gra h3 calle. a transition gra#$3 in 0hich the [Link] are the states an. the la*ele. [Link] re resent the transition function, % nondeterministic finite automaton recognizing t$e language a&b!*abb

+1

'%A Transition Ta*le


The easiest im lementation is a transition ta*le in 0hich there is a ro0 for each state an. a column for each in ut sym*ol an. 3 if necessary,

+2

$xam le 2

+9

)eterministic %inite Automata <)%A>


A )%A is a s ecial case of a '%A in 0hich " no state has an -transition " for each state s an. in ut sym*ol a3 there is at most one [Link] la*ele. a lea(ing s,

+:

Simulating a )%A

+;

)%A $xam le:

KE

%rom #egular $x ression to '%A


Thom son8s Construction
This is one 0ay to con(ert a #egular $x ression into a '%A, Thomson8s Construction is sim le an. systematic metho., -t guarantees that the resulting '%A 0ill ha(e exactly one final state3 an. one start state, Construction starts from sim lest arts <al ha*et sym*ols>, To create a '%A for a com lex regular ex ression3 '%As of its su*-ex ressions are com*ine. to create its '%A3

K1

&etho.
-n ut: a regular ex ression r o(er an al ha*et , Mut ut: an '%A ' acce ting ( r! %irst arse r into its constituent su*ex ressions, Construct '%A8s for each of the *asic sym*ols in r, " for " for a in - for the regular ex ression s&t3

-for the regular ex ression st3

K2

&etho.,,,
%or the regular ex ression s*3

%or the arenthesize. regular ex ression <s>3 use ' s! itself as the '%A, $(ery time 0e construct a ne0 state3 0e gi(e it a .istinct name,

K+

$xam le - construct ' r! for r) a&b!*abb

KK

$xam le <-->

K1

$xam le,,,

K2

Con(ersion of an '%A into )%A


Subset construction algorithm:
-n the transition ta*le of an '%A3 each entry is a set of statesF in the transition ta*le of a )%A3 each entry is Just a single state, The general [Link] *ehin. the '%A-to-)%A construction is that each )%A state corres on.s to a set of '%A states, The )%A uses its state to !ee trac! of all ossi*le states the '%A can *e in after [Link] each in ut sym*ol,

K9

Su*set Construction
- constructing a )%A from an '%A
-n ut : An '%A ',

Mut ut : A )%A ) acce ting the same language, &etho.:


Ie construct a transition ta*le )tran for ), $ach )%A state is a set of '%A states an. 0e construct )tran so that ) 0ill simulate Tin arallelU all ossi*le mo(es ' can ma!e on a gi(en in ut string,

K:

Su*set Construction <-->


s re resents an '%A state * re resents a set of '%A states,
*"los&re)s+ % set of ,'$ states reachable from ,'$ states s on *transition alone *"los&re)T+ % Set of ,'$ states reachable from some ,'$ state s in T on *transition alone move)T-a+ % Set of ,'$ states to #hich there is a transition on inp&t .a from some ,'$ state s in T

M erations on '%A states

K;

The Su*set Construction <--->


-nitially3 -closure<sE> is the only state in +states an. it is unmar!e. #hile there is an unmar!e. state * in +states do mar! * for each in ut sym*ol a do , :B -closure<move<*,a>> if , is not in +states then a.. , as an unmar!e. state to +states end if +tranR*3aS :B , end do end do

1E

Su*set Construction <-V>


<W-closure com utation> p&sh all states in T onto stack/ initiali e *clos&re)T+ to T/ w)ile stack is not empty do -egin pop t- the top element- off the stack/ for each state & #ith edge from t to & labeled do if & is not in *clos&re)T+ do -egin add & to *clos&re)T+ / p&sh & onto stack end end

11

$xam le

12

$xam le@@
'irst #e calc&late% *clos&re)0+ -closure<E> B OE3 13 23 K3 9P <all states reacha*le from E on -mo(es> Let ABOE3 13 23 K3 9P *e a state of ne0 )%A ) *clos&re(move($-a++ and *clos&re (move($-b++ -closure move A3a>> B -closure move - E31323K39P3a>>P B -closure<O+3:P> < since move<23a>B+ an. move<93a>B:> -closure<O+3:P> B O1323+3K32393:P <since +2 1 K3 2 93 an. 1 2 all *y -mo(es> Let 4BO1323+3K32393:P *e a ne0 state, )efine )tranRA3aS B 4,

1+

$xam le@
* : -closure move A3*>> B -closure move -E31323K39P3*>> B -closure<O1P> < since move<K3*>B1> %rom this 0e ha(e : -closure<O1P> B O1323K313239P <since 12 1 K3 2 93 an. 1 2 all *y -mo(es> Let CBO1323K313239P *e a ne0 state, )efine )tranRA3*S B C,

1K

$xam le ,,,

11

$xam le ,,,

12

&inimizing the num*er of states in )%A


&inimize the num*er of states of a )%A *y [Link] all grou s of states that can *e .istinguishe. *y some in ut string, $ach grou of states that cannot *e .istinguishe. is then merge. into a single state, )%A &inimization is a fairly [Link].a*le rocess3 an. is useful in se(eral areas

19

&inimizing the num*er of states in )%A <-->

1:

Construct 'e0 Partition

1;

$xam le

C
start a a a a a

start

a a

2E

You might also like