Programming Language Translators
Unit 1 - Part2 : Lexical Analysis
Contents
The role of the lexical analyzer S ecification of to!ens " #egular $x ression #ecognition of to!ens " %inite State &achines %rom a #egular $x ressions to an '%A " Thomson Construciton Con(ert '%A to )%A " Su*set Construction &inimizing )%A
The #ole of Lexical Analyzer
Lexical analyzer is the first hase of a com iler, -ts main tas! is to rea. in ut characters an. [Link] as out ut a se/uence of to!ens that arser uses for syntax analysis,
%unctions of Lexical Analyzer
Scans the source rogram character *y character [Link] the stream of to!ens after recognizing them -.entifies the lexical errors $liminates the 0hite s ace an. comments -t generates the sym*ol ta*le -t !ee s trac! of line num*er -t re orts error encountere. 0hile generating the to!ens
-ssues in Lexical Analysis
There are se(eral reasons for se arating the analysis hase of com iling into lexical analysis an. arsing:
" Sim ler .esign " Com iler efficiency " Com iler orta*ility
Simpler design
Lexical an. Syntax analysis are se arate. for sim lifying the 0or! of the arsing hase $x: Comments an. 0hite s aces are eliminate. in lexical analysis
-ssues in Lexical Analysis
Improve the efficiency
Large amount of time is s ent on [Link] the source file an. it coul. *e inefficient if each hase has to rea. the source file -ncrease the efficiency 3lexical analyzer use the -'PUT 4U%%$#-'5 techni/ue
Enhanced portability
-n ut al ha*et eculiarities an. other .e(ice s ecific anomalies can *e restricte. to the lexical analyzer The #e resentation of s ecial or non [Link]. sym*ol can *e isolate. in the lexical analyzer $x: 678 in Pascal can *e isolate.
To!ens3 Patterns3 Lexemes
A lexeme is a se/uence of characters in the source rogram that is matche. *y the attern for a to!en, A lexeme is a *asic lexical unit of a language com rising one or se(eral 0or.s3 the elements of 0hich .o not se arately con(ey the meaning of the 0hole, The lexemes of a rogramming language inclu.e its i.entifier3 literals3 o erators3 an. s ecial 0or.s, A to!en of a language is a category of its lexemes, A attern is a rule .escri*ing the set of lexemes that can re resent as articular to!en in source rogram,
$xam les of To!ens
$xam le 1
Token ?ey0or. S l, Sym*ol [Link] #eAo [Link] S l, Sym*ol
-f < a = * >
Lexeme if < a = * > Pattern -f3 int3 float3for30hile@@, Any unctuation or s ecial sym*ols Letter follo0e. *y letter or .igit B3=3=B3C3CB3DB Letter follo0e. *y letter or .igit Any unctuation or s ecial sym*ols
Lexeme an. To!en: $xam le 2
Index = 2 * count +17; Lexemes
-[Link]
Tokens
-.entifier e/ualAsign o intAliteral multiAo [Link] lusAo intAliteral semicolon
B
2
H Count G
19
1E
To!en attri*utes
Ihen more than one attern matches a lexeme3 a lexical analyzer must ro(i.e a..itional information a*out articular lexeme that match to the su*se/uent hase of the com iler,
$xam le: $ B & H C HH 2
=i.3 ointer to sym*ol ta*le entry for $C =assign-o C =i.3 ointer to sym*ol ta*le entry for &C =mult-o C =i.3 ointer to sym*ol ta*le entry for CC =ex -o C =num*er3 integer (alue 2C
11
Lexical $rrors
%e0 errors are .iscerni*le at the lexical le(el alone3 *ecause a lexical analyzer has a (ery localize. (ie0 of a source rogram, Let some other hase of com iler [Link] any error, Panic mo.e $rror reco(ery
12
Lexical $rrors
Some errors are [Link]. an. rectifie. at lexical le(el alone
$x: cBHa* G 1EF this ty e of error is [Link]. an. rectifie. at lexical le(el 0ith hel of panic mode techni/ue<.elete successi(e characters from the remaining in ut until the lexical analyzer can fin. a 0ell-forme. to!en>
Lexical analysis can8t [Link] some a
earances
$x: fi<xBB1E> : here fi is [Link] as a (ali. to!en
Other error recovery actions
)eleting an extraneous characters -nserting a missing character #e lacing an incorrect character *y a correct character Trans osing t0o [Link] characters
1+
Input Buffering
There are three general approaches to the implementation of a lexical analy er! 1, Use a lexical analyzer generator 2, Irite the lexical analyzer in a con(entional rogramming language +, Irite the lexical analyzer in assem*ly language, -n one *uffer techni/ue3 the last lexeme [Link] rocess 0ill *e o(er-0ritten 0hen 0e reloa. the *uffer, T0o-*uffer scheme [Link] large loo! ahea. safely Ie use a *uffer .i(i.e. into t0o '-characters hal(es,
Buffer Pairs
T0o *uffers of the same size3 say KE;23 are alternately reloa.e., T0o ointers to the in ut are maintaine.:
Pointer lexeme_Begin mar!s the *eginning of the current lexeme, Pointer forward scans ahea. until a attern match is foun.,
"ode to advance for#ard pointer
-f for0ar. at en. of first half then *egin reloa. secon. halfF for0ar.:Bfor0ar. G 1F $n. $lse if for0ar. at en. of secon. half then *egin reloa. first halfF mo(e for0ar. to *eginning of first half $n. $lse for0ar.:Bfor0ar. G 1F
Sentinels
-f 0e use the in ut *uffer scheme3 0e must chec! each time 0e mo(e the for0ar. ointer that 0e ha(e noot mo(e. off one half of the *uffer, -f 0e .o then 0e must reloa. the other half, Ie can [Link] the t0o tests to one if 0e exten. each *uffer half to hol. a sentinel character at the en., The sentinal is a s ecial character that can not *e art of the source rogram A natural choice is eof,
$ B & H eof C H H 2 eof
eof
Lookahead code #ith sentinels
for0ar.:Bfor0ar.G1F -f for0ar. L B $M% then *egin -f for0ar. at en. of first half then *egin reloa. secon. halfF for0ar.:Bfor0ar. G 1F $n. $lse if for0ar. at en. of secon. half then *egin reloa. first halfF mo(e for0ar. to *eginning of first half $n. $lse terminate lexical analysisF
S ecification of To!ens
To!ens are s ecifie. *y #egular $x ression, #egular ex ressions are an im ortant notation for s ecifying atterns, Ie ha(e to .iscuss the follo0ing:
" " " " M eration on languages #egular ex ressions #egular .efinitions 'otational shorthan.s
1;
Languages
$lphabet % a finite set of sym*ols String :
%inite se/uence of sym*ols on an al ha*et Sentence an. 0or. are also use. in terms of string is the em ty string NsN is the length of string s,
Lang&age: Sets of strings o(er some fixe. al ha*et
the em ty set is a language, OP the set containing em ty string is a language The set of 0ell-forme. C rograms is a language The set of all ossi*le [Link] is a language,
2E
M erations on Languages
21
#egular $x ressions
#egular ex ression is a com act notation for .escri*ing string, -n Pascal3 an [Link] is a letter follo0e. *y zero or more letter or .igits letter<letter [Link]>H |: or *: zero or more instance
22
#ules
is a regular ex ression that .enotes OP3 the set containing em ty string, -f a is a sym*ol in 3 then a is a regular ex ression that .enotes O aP3 the set containing the string a, Su ose R an. S are regular ex ressions .enoting the language L< R> an. L<S>3 then " <R> N<S> is a regular ex ression .enoting L<R>L<S>, " <R>,<S> is regular ex ression .enoting L <R>,L<S>, " <R>H is a regular ex ression .enoting <L<R>>H, " <#> is a regular ex ression .enoting L<#>,
2+
[Link] Con(entions
The unary o erator H has the highest [Link] an. is left associati(e, Concatenation has the secon. highest [Link] an. is left associati(e, N has the lo0est [Link] an. is left associati(e, <a>N<b>*<c>aNb*c
2K
$xam le of #egular $x ressions
21
Pro erties of #egular $x ression
22
#egular )efinitions
To 0rite regular ex ression for some languages can *e .ifficult3 *ecause their regular ex ressions can *e /uite com lex, -n those cases3 0e may use regular definitions, -f is an al ha*et of *asic sym*ols3 then a regular definition is a se/uence of .efinitions of the form:
d1r1 d2r2
,,,
dnrn
0here each di is a .istinct name3 an. each ri is a regular ex ression o(er the sym*ols in Od1,d2,,di-1P3 i,e,3 the *asic sym*ols an. the re(iously .efine. names,
29
$xam les of #egular )efinitions
$xam le: Unsigne. num*ers
2:
'otational Shorthan.s
The follo0ing shorthan.s are often use.:
rG B r rH rQ B r Ra- S B a b c @
$xam les:
.igit R0-9S num .igitG <, .igitG>Q < <$ N e> <G->Q .igitG >Q
2;
To!en #ecognition
Lexical analysis is rocess of recognizing to!ens from in ut source rogram
So&rce Program
Inp&t b&ffer
Transition diagram
'$
Pattern Pattern matching $lgorithm
Transition )iagrams
(elop # #= #$ $ $= =
start
= $ other
2 + K
(et&rn <relop3 LE> (et&rn <relop3 NE>
H (et&rn <relop3 L
>
= $
1 2
(et&rn <relop3 E!> = 9 (et&rn <relop3 "E> other : H (et&rn <relop3 " >
id letter ) letterdigit >H
start
letter or digit letter
1E
other
11 H (et&rn <get token <>3
install _id <>>
-m lementation
to%en nextto%en&' ( w)ile &1' ( *witc) &*tate' ( ca*e +, c = nextc)ar&'; if &c==-lan% || c==ta- || c==newline' ( *tate = +; lexeme_-eginning++; . el*e if &c==/#0' *tate = 1; el*e if &c==/=0' *tate = 1; el*e if &c==/$0' *tate = 2; el*e *tate = fail&'; -rea%; ca*e 1, 3 ca*e 4, c = nextc)ar&'; if &i*letter&c'' *tate = 1+; el*e *tate = fail&'; -rea%; ca*e 1+, c = nextc)ar&'; if &i*letter&c'' *tate = 1+; el*e if &i*digit&c'' *tate = 1+; el*e *tate = 11; -rea%; 3
)[Link] the next start state to chec!
int fail&' ( forward = to%en_-eginning; *wit) &*tart' ( ca*e +, *tart = 4; -rea%; ca*e 4, *tart = 12; -rea%; ca*e 12, *tart = 2+; -rea%; ca*e 2+, *tart = 21; -rea%; ca*e 21, reco5er&'; -rea%; default, 6* error *6 . return *tart; .
%inite Automata
A recognizer for a language is a rogram that ta!es as in ut a string x an. ans0er TyesU if x is a sentence of the language an. TnoU other0ise, Ie com ile a regular ex ression into a recognizer *y constructing a generalize. transition .iagram calle. a finite automaton, A finite automaton can *e deterministic or nondeterministic3 0here [Link] means that more than one transition out of a state may *e ossi*le on the same in ut sym*ol,
)eterministic " faster recognizer3 *ut it may ta!e more s ace '[Link] " slo0er3 *ut it may ta!e less s ace )eterministic automatons are [Link] use. lexical analyzers,
++
'[Link] %inite Automata <'%A>
A [Link] finite automaton <'%A> is a mathematical [Link] that consists of: A set of states S A set of in ut sym*ols A transition function move that ma s state-sym*ol airs to sets of states A state s0 that is .istinguishe. as the start initial! state
A set of states " .istinguishe. as acce#ting final! states,
- transitions are allo0e. in '%As, -n other 0or.s3 0e can mo(e from one state to another one 0ithout consuming any sym*ol, A '%A acce ts a string x3 if an. only if there is a ath from the starting state to one of acce ting states such that [Link] la*els along this ath s ell out x,
+K
'%A
An '%A can *e re resente. .iagrammatically *y a la*ele. .irecte. gra h3 calle. a transition gra#$3 in 0hich the [Link] are the states an. the la*ele. [Link] re resent the transition function, % nondeterministic finite automaton recognizing t$e language a&b!*abb
+1
'%A Transition Ta*le
The easiest im lementation is a transition ta*le in 0hich there is a ro0 for each state an. a column for each in ut sym*ol an. 3 if necessary,
+2
$xam le 2
+9
)eterministic %inite Automata <)%A>
A )%A is a s ecial case of a '%A in 0hich " no state has an -transition " for each state s an. in ut sym*ol a3 there is at most one [Link] la*ele. a lea(ing s,
+:
Simulating a )%A
+;
)%A $xam le:
KE
%rom #egular $x ression to '%A
Thom son8s Construction
This is one 0ay to con(ert a #egular $x ression into a '%A, Thomson8s Construction is sim le an. systematic metho., -t guarantees that the resulting '%A 0ill ha(e exactly one final state3 an. one start state, Construction starts from sim lest arts <al ha*et sym*ols>, To create a '%A for a com lex regular ex ression3 '%As of its su*-ex ressions are com*ine. to create its '%A3
K1
ðo.
-n ut: a regular ex ression r o(er an al ha*et , Mut ut: an '%A ' acce ting ( r! %irst arse r into its constituent su*ex ressions, Construct '%A8s for each of the *asic sym*ols in r, " for " for a in - for the regular ex ression s&t3
-for the regular ex ression st3
K2
ðo.,,,
%or the regular ex ression s*3
%or the arenthesize. regular ex ression <s>3 use ' s! itself as the '%A, $(ery time 0e construct a ne0 state3 0e gi(e it a .istinct name,
K+
$xam le - construct ' r! for r) a&b!*abb
KK
$xam le <-->
K1
$xam le,,,
K2
Con(ersion of an '%A into )%A
Subset construction algorithm:
-n the transition ta*le of an '%A3 each entry is a set of statesF in the transition ta*le of a )%A3 each entry is Just a single state, The general [Link] *ehin. the '%A-to-)%A construction is that each )%A state corres on.s to a set of '%A states, The )%A uses its state to !ee trac! of all ossi*le states the '%A can *e in after [Link] each in ut sym*ol,
K9
Su*set Construction
- constructing a )%A from an '%A
-n ut : An '%A ',
Mut ut : A )%A ) acce ting the same language, ðo.:
Ie construct a transition ta*le )tran for ), $ach )%A state is a set of '%A states an. 0e construct )tran so that ) 0ill simulate Tin arallelU all ossi*le mo(es ' can ma!e on a gi(en in ut string,
K:
Su*set Construction <-->
s re resents an '%A state * re resents a set of '%A states,
*"los&re)s+ % set of ,'$ states reachable from ,'$ states s on *transition alone *"los&re)T+ % Set of ,'$ states reachable from some ,'$ state s in T on *transition alone move)T-a+ % Set of ,'$ states to #hich there is a transition on inp&t .a from some ,'$ state s in T
M erations on '%A states
K;
The Su*set Construction <--->
-nitially3 -closure<sE> is the only state in +states an. it is unmar!e. #hile there is an unmar!e. state * in +states do mar! * for each in ut sym*ol a do , :B -closure<move<*,a>> if , is not in +states then a.. , as an unmar!e. state to +states end if +tranR*3aS :B , end do end do
1E
Su*set Construction <-V>
<W-closure com utation> p&sh all states in T onto stack/ initiali e *clos&re)T+ to T/ w)ile stack is not empty do -egin pop t- the top element- off the stack/ for each state & #ith edge from t to & labeled do if & is not in *clos&re)T+ do -egin add & to *clos&re)T+ / p&sh & onto stack end end
11
$xam le
12
$xam le@@
'irst #e calc&late% *clos&re)0+ -closure<E> B OE3 13 23 K3 9P <all states reacha*le from E on -mo(es> Let ABOE3 13 23 K3 9P *e a state of ne0 )%A ) *clos&re(move($-a++ and *clos&re (move($-b++ -closure move A3a>> B -closure move - E31323K39P3a>>P B -closure<O+3:P> < since move<23a>B+ an. move<93a>B:> -closure<O+3:P> B O1323+3K32393:P <since +2 1 K3 2 93 an. 1 2 all *y -mo(es> Let 4BO1323+3K32393:P *e a ne0 state, )efine )tranRA3aS B 4,
1+
$xam le@
* : -closure move A3*>> B -closure move -E31323K39P3*>> B -closure<O1P> < since move<K3*>B1> %rom this 0e ha(e : -closure<O1P> B O1323K313239P <since 12 1 K3 2 93 an. 1 2 all *y -mo(es> Let CBO1323K313239P *e a ne0 state, )efine )tranRA3*S B C,
1K
$xam le ,,,
11
$xam le ,,,
12
&inimizing the num*er of states in )%A
&inimize the num*er of states of a )%A *y [Link] all grou s of states that can *e .istinguishe. *y some in ut string, $ach grou of states that cannot *e .istinguishe. is then merge. into a single state, )%A &inimization is a fairly [Link].a*le rocess3 an. is useful in se(eral areas
19
&inimizing the num*er of states in )%A <-->
1:
Construct 'e0 Partition
1;
$xam le
C
start a a a a a
start
a a
2E