Prof.
Maya B S
Assistant Professor
Department of CS&E
BIT, Bangalore
Outline
Role of lexical analyzer
Input Buffering
Specification of tokens
Recognition of tokens
The role of lexical analyzer
token
Source Lexical To semantic
program Parser analysis
Analyzer
getNextToken
Symbol
table
Why to separate Lexical analysis
and parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
Tokens, Patterns and Lexemes
A token is a pair a token name and an optional token
value
A pattern is a description of the form that the lexemes
of a token may take.
A lexeme is a sequence of characters in the source
program that matches the pattern for a token
Example
Token Informal description Sample lexemes
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
id Letter followed by letter and digits pi, score, D2
number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”
printf(“total = %d\n”, score);
Attributes for tokens
E = M * C ** 2
<id, pointer to symbol table entry for E>
<assign-op>
<id, pointer to symbol table entry for M>
<mult-op>
<id, pointer to symbol table entry for C>
<exp-op>
<number, integer value 2>
Lexical errors
Some errors are out of power of lexical analyzer to
recognize:
fi (a == f(x)) …
However it may be able to recognize errors like:
d = 2r
Such errors are recognized when no pattern for tokens
matches a character sequence
Error recovery
Panic mode: successive characters are ignored until we
reach to a well formed token.
Delete one character from the remaining input
Insert a missing character into the remaining input
Replace a character by another character
Transpose two adjacent characters
Input buffering
The lexical analyser uses two pointers to read tokens.
Lexeme-begin Pointer (lb)marks the beginning of
the current lexeme ,whose extent are attempting to
determine.
Search-pointer(forward-pointer) (sp)that keep track
of the portion of the input string scanned.
Initially both pointers point to the beginning of a lexeme
in figure.
The forward pointer “sp” then start scanning forward to
search for the end of lexeme. The end of the lexeme is
indicated by blank space .
Lb sp
Begin I := I + 1 : J
Reading input character by character from the secondary
storage is costly.
A block of data is read first into a buffer and then scanned by
the lexical analyser.
because of the amount of time taken to process character and
the large number of character that must be processed during the
compilation of a large source program , specialized buffering
technique have been developed to reduce the amount of
overhead required to process a single input character.
Two methods are available
One buffer scheme: There are problems if a lexeme crosses the
buffer boundary. To scan the rest of the lexeme, the buffer has
to be refilled there by overwriting the first part of the lexeme.
Two buffer scheme: Buffer1 and buffer2 are scanned
alternatively when the end of the current buffer is reached,
other buffer is filled. Hence the problem encountered in the
previous method is solved.
Two buffer scheme
Each buffer is of the same size N.
N is usually of disk block 4096bytes.
Using one system read command we can read N character into
buffer, rather than using one system call per character.
If fewer than N character remain in the input file, then a special
character represented by eof marks the end of the source file
and is different from any possible character of the source
program.
E = M * C * * 2 eof
Advancing forward requires that we first test whether we have
reached the end of one of the buffer and if so, we must reload
the other buffer from the input and move forward to the
beginning of the newly loaded buffer.
Each time we advance forward for each character read, we
make two test.
One for the end of the buffer
One to determine what character is read.
We can combine the buffer end test with the test for current
character if we extend each buffer to hold a sentinel character
at the end.
The sentinels is a special character that cannot be part of the
source program and a natural choice is the character eof.
Eof retains its use as a marker for the end of the entire input.
Any eof that appears other than at the end of buffer means the
input is at an end.
Algorithm for forward pointer with sentinels
E = M eof * C * * 2 eof eof
Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
}
Specification of tokens
Strings and language:
Alphabet: An alphabet is any finite set of symbols.
Symbols are letter,digit,punctuation.
Ex: set {0,1} is binary alphabet.
String: A string over an alphabet is finite sequence of symbol
drawn from that alphabet.
The length of string is |S|.
Ex: S=banana, |S|= 6
Empty string epsilon.(Ɛ)
1.Ɛs=sƐ=s (empty string is identity under concatenation)
2. x=dog, y=cat, xy=dogcat
3. S^3 =SSS
Language: A language is any countable set of string over some
fixed alphabet.
Empty set {Ɛ} ,set containing only the empty string , are called
abstract language.
Operations on language:
UNION
CONCATENATION
CLOSURE Kleene , Positive
operations Definition and notation
Union of two languages L and M L U M = {s | s is in L or s is in M}
Concatenation of two languages L and LM = {st | s is in L and t is in M}
M
The Kleene Closure of a language L L* = Zero or more occurrence of
language L.
The positive Closure of a language L L+= one or more occurrence of language
L.
Let L be the set of letter {A,B,--Za,b---z},the alphabet of
uppercase and lower case and it is language ,all of whose string
happen to be length one.
Let D be the set of digits {0,1,---9} ,alphabet of digits and it is
language of string length one.
Some of the other language can be constructed from language L and D
using operation.
1. L U D= is the set of letter and digits , strictly speaking the language with -
--------- string of length one, each of which strings either one letter or one
digits.
2. LD=is the set of -----------string of length two, each consisting of one
letter followed by one digit.
3. L4 = is the set of all ----------letter string.
4. D+ = is the set of string of one or more digits.
5. L(L U D)*= is the set of all string of letter and digits beginning with
letter.
Regular Expressions
In theory of compilation regular expressions are used to
formalize the specification of tokens
Regular expressions are means for specifying regular languages
Example:
Letter_(letter_ | digit)*
Each regular expression is a pattern specifying the form of
strings.
RE are built recursively out of smaller RE, using below rules.
Each RE ‘r’ denotes a language L(r ), which is also define
recursively from the language denoted by ‘r’ subexpressions.
The rules denote RE over some alphabet ∑ & language that those expression
denote.
Basis: There are two rules that form basis
1. Ɛ is a regular expression, L(Ɛ) = {Ɛ},i,.e language contain empty
string.
2. If ‘a’ is a symbol in ∑ then a is a regular expression, L(a) =
{a},i.e language with one string of length one.
Induction: suppose r and s are RE denoting language L(r )&
L(s )respectively.
(r) | (s) is a regular expression denoting the language L(r) ∪ L(s)
(r)(s) is a regular expression denoting the language L(r)L(s)
(r)* is a regular expression denoting (L(r))*
(r) is a regular expression denting L(r), this rule say that we can
add additional pair of parenthesis around expressions without
changing the language.
Examples of RE
Let ∑={a,b}
1. RE a|b denotes language ----------
2. (a|b)(a|b) denotes language_________
3. a * denotes language_________
4. a|a*b denotes language______________
5. (a|b)*___________
6. (a*b*)*______________
Algebraic laws of RE
A language that can be defined by RE is called a “regular set”
If two RE r and s denote the same regular set, then both are
equivalent r=s.
Ex=(a|b)= (b|a)
Below table show the algebraic laws that hold for arbitrary RE r,
s,and t.
No LAW Description
1 r|s=s|r | is commutative
2 r |(s|t)=(r|s)|t | is associative
3 r(st)=(rs)t Concatenation is associative
4 r(s|t)=rs|rt Concatenation distributes over |
5 Ɛr=rƐ=r Ɛ is the identity for concatenation
6 r*= (r|Ɛ)* Ɛ is guaranteed in a closure
7 r**=r* * is idempotent
Regular definitions
We may give names to RE and use those names in subsequent expressions,
if the names were themself symbols.
If ∑ is an alphabet of basic symbols, then regular definition is a sequence of
definition of the form.
d1 -> r1
d2 -> r2
…
dn -> rn
Where ,
Each di is a new symbol , not in ∑ and not the same as any other of the di
and
Each ri is RE over the alphabet ∑ U {d1,d2,….di-1}
Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Extensions
One or more instances: (r)+= rr*
Zero of one instances: r? = r| Ɛ
Character classes: [abc]=[a|b|c]
[Link] Regular Definition for C language identifier in both
extended and short notation.
letter_ -> [A-Za-z_]
digit -> [0-9]
id -> letter_(letter|digit)*
2. Write a RD for unsigned number(integer or floating point) are string
5280, 0.01234,6.336E4, 1.89E-4
digit0|1|2|-----|9
digits digit digit*
optionalFraction.digits | Ɛ
optionalExponent(E(+|-| Ɛ)digits)| Ɛ
Number digits optionalFraction optionalExponent
Short notation:
digit[0-9]
digits digits+
Number digits(.digits)?(E[+-]?digits)?
Recognition of tokens
Study about how to take pattern for all needed tokens and build
a piece of code that examines the input string and find a prefix
that is a lexeme matching one of the character.
Starting point is the language grammar to understand the
tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.) for braching
statement
The next step is to define all the terminals recognize from
lexical analysis phase for patterns using RD:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
Write regular definition to handle whitespaces:
ws -> (blank | tab | newline)+
Below table hows the each lexeme , which token name is returned
to the parser and what attribute value.
Lexeme(pattern) Token-name Attribute value
Any ws - -
if if -
Then Then -
Else Else -
Any id Id Pointer to table entry
Any number Number Pointer to table entry
< Relop LT
> Relop GT
<= Relop LE
= Relop EQ
<> Relop NT
>= Relop GE
Transition diagrams
An intermediate step in the construction of a lexical analyser ,
we first convert pattern into flowcharts called transition
diagram(TD).
TD have a collection of nodes or circles called states.
Each states represents a condition that could occur during the
process of scanning the input looking for a lexeme that matches
one of several patterns.
Edges are directed from one state of the TD to another.
Each edge is labelled by a symbol or set of symbols.
Assume all our TD are deterministic, meaning that there is
never more than one edge out of a given state with a given
symbol among its labels.
Certain states are said to be accepting or final states. These
states indicate that lexeme has been found. We always indicate
an accepting state by a double circle.
If there is an action to be taken , typically returning s token and
attribute value to the parser . We shall attach that action to the
accepting state.
In addition , if it is necessary to retract the forward pointer one
position (i.e the lexeme does not include the symbol that got us
to accepting state), then we shall additionally place a * near
that accepting state.
One state is designated the start state or initial state, it is
indicated by an edge, labelled start entering from no where.
Finite Automata
Regular expressions = specification
Finite automata = implementation
A finite automaton consists of
An input alphabet
A set of states S
A start state n
A set of accepting states F S
A set of transitions state input state
35
Finite Automata
Transition
s1 a s2
Is read
In state s1 on input “a” go to state s2
If end of input
If in accepting state => accept, othewise => reject
If no transition possible => reject
36
Finite Automata State Graphs
A state
• The start state
• An accepting state
a
• A transition
37
1.A Simple Example
A finite automaton that accepts only “1”
A finite automaton accepts a string if we can follow
transitions labeled with the characters in the string
from the start to some accepting state
38
2. Another Simple Example
A finite automaton accepting any number of 1’s
followed by a single 0
Alphabet: {0,1}
Check that “1110” is accepted but “110…” is not
39
3. Another Example
Alphabet {0,1}
What language does this recognize?
1 0
0 0
1
1
40
Epsilon Moves
Another kind of transition: -moves
A B
• Machine can move from state A to state B
without reading input
41
Deterministic and Nondeterministic Automata
Deterministic Finite Automata (DFA)
One transition per input per state
No -moves
Nondeterministic Finite Automata (NFA)
Can have multiple transitions for one input in a given state
Can have -moves
Finite automata have finite memory
Need only to encode the current state
42
Execution of Finite Automata
A DFA can take only one path through the state graph
Completely determined by input
NFAs can choose
Whether to make -moves
Which of multiple transitions for a single input to take
43
Acceptance of NFAs
An NFA can get into multiple states
1
0 1
• Input: 1 0 1
• Rule: NFA accepts if it can get in a final state
44
NFA vs. DFA (1)
NFAs and DFAs recognize the same set of languages
(regular languages)
DFAs are easier to implement
There are no choices to consider
45
NFA vs. DFA (2)
For a given language the NFA can be simpler than the
DFA
1
0 0
NFA
0
1 0
0 0
DFA
1
1
• DFA can be exponentially larger than NFA
46
Regular Expressions to Finite Automata
High-level sketch
NFA
Regular
expressions DFA
Lexical Table-driven
Specification Implementation of DFA
47
[Link] diagrams for relational operator (VVIMP)
Transition diagram for relop, we must retract state 4,
has * to indicate that we must retract the input one
position.
Architecture of a transition-diagram-based lexical analyzer for relational
operator
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: c=nextchar();
If(c==“=‘”) state=2;
Else if(c==“>”) state=3;
Else state= 4;
Break;
Case 2: [Link]=LE;
return(retToken);
break;
case 3: [Link]=NE;
return(retToken);
break;
Case 4: retract();
[Link] = LT;
return(retToken);
Break;
Case 5: [Link]=EQ;
return(retToken);
break;
Case 6: c=nextchar();
if(c==“=“) state=7;
else state=8;
Case 7:[Link]=GE;
return(retToken);
break;
case 8: retract();
[Link] = GT;
}
2. Transition diagram for reserved words and identifiers.
Keywords like if,then,else,char,const etc are reserved words, so they are not
identifier even though they look like identifier. Above diagram work for both
identifier and reserved words.
Install the reserved word in the symbol table initially.
A field of the symbol table entry indicates that these string are never ordinary
identifier and tells which token they represent.
When we find an identifier ,a call to installID place it in the
symbol table if it is not already there and returns a pointer to the
symbol table entry for the lexeme found.
Any identifier not in the symbol table during lexical analysis
cannot be a reserved word, so it is id.
Program segment for identifier or reserverd
{starting in state 0}
If the next character is a letter then
Advance the input;
{now in state 1}
While the next character is a letter or digit do
Advance the input; {stay in state 1}
End while;
{go to state 2, without advancing the input}
Accept;
Else
{error or other cases}
End if
3. Transition diagram for unsigned numbers.
4. Transition diagram for whitespace
Write separate transition diagram for each keyword If, then,
case, char, continue.
Write transition diagram for C comment lines.
{state 1}
If the next character is “/” then
Advance the input ; {state 2;}
If the next character is “*” then
Advance the input ; {state 3;}
Done:=false;
While not done do
While the next input character is not “*” do
Advance the input;
End while;
Advance the input; {state 4;} other *
/ 2 * * /
1 2 3 other 4 5
2
While the next input character is “*” do
Advance the input;
End while;
If the next input character is “/” then
Done:=true;
End if
Advance the input;
End while;
Accept;{state 5;}
Else { other processing}
End if;
Else {other processing}
End if;
Lexical Analyzer Generator - Lex
Lex Source program Lexical [Link].c
lex.l Compiler
[Link].c
C [Link]
compiler
Sequence
Input stream [Link]
of tokens
Structure of Lex programs
Syntax:
declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions
Example for relational operator
%{
Int installID() {/* funtion to install the
/* definitions of manifest constants
lexeme, whose first character is
LT, LE, EQ, NE, GT, GE, pointed to by yytext, and whose
IF, THEN, ELSE, ID, NUMBER, RELOP */ length is yyleng, into the symbol
%} table and return a pointer thereto
*/
/* regular definitions }
delim [ \t\n]
ws {delim}+ Int installNum() { /* similar to
installID, but puts numerical
letter [A-Za-z]
constants into a separate table */
digit [0-9]
}
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER);}
…
Thank You