0% found this document useful (0 votes)
27 views62 pages

Lexical Analysis in Compiler Design

The document outlines the role of a lexical analyzer in compiler design, emphasizing the separation of lexical analysis and parsing for improved efficiency and portability. It discusses tokens, patterns, and lexemes, along with input buffering techniques, error recovery methods, and the specification of tokens using regular expressions. Additionally, it covers the construction of finite automata and transition diagrams for recognizing tokens in programming languages.

Uploaded by

AAA School
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views62 pages

Lexical Analysis in Compiler Design

The document outlines the role of a lexical analyzer in compiler design, emphasizing the separation of lexical analysis and parsing for improved efficiency and portability. It discusses tokens, patterns, and lexemes, along with input buffering techniques, error recovery methods, and the specification of tokens using regular expressions. Additionally, it covers the construction of finite automata and transition diagrams for recognizing tokens in programming languages.

Uploaded by

AAA School
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Prof.

Maya B S
Assistant Professor
Department of CS&E
BIT, Bangalore
Outline
 Role of lexical analyzer
 Input Buffering
 Specification of tokens
 Recognition of tokens
The role of lexical analyzer
token
Source Lexical To semantic
program Parser analysis
Analyzer
getNextToken

Symbol
table
Why to separate Lexical analysis
and parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
Tokens, Patterns and Lexemes
 A token is a pair a token name and an optional token
value
 A pattern is a description of the form that the lexemes
of a token may take.
 A lexeme is a sequence of characters in the source
program that matches the pattern for a token
Example
Token Informal description Sample lexemes
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2


number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”

printf(“total = %d\n”, score);


Attributes for tokens
 E = M * C ** 2
 <id, pointer to symbol table entry for E>
 <assign-op>
 <id, pointer to symbol table entry for M>
 <mult-op>
 <id, pointer to symbol table entry for C>
 <exp-op>
 <number, integer value 2>
Lexical errors
 Some errors are out of power of lexical analyzer to
recognize:
 fi (a == f(x)) …
 However it may be able to recognize errors like:
 d = 2r
 Such errors are recognized when no pattern for tokens
matches a character sequence
Error recovery
 Panic mode: successive characters are ignored until we
reach to a well formed token.
 Delete one character from the remaining input
 Insert a missing character into the remaining input
 Replace a character by another character
 Transpose two adjacent characters
Input buffering
The lexical analyser uses two pointers to read tokens.
 Lexeme-begin Pointer (lb)marks the beginning of
the current lexeme ,whose extent are attempting to
determine.
 Search-pointer(forward-pointer) (sp)that keep track
of the portion of the input string scanned.
Initially both pointers point to the beginning of a lexeme
in figure.
The forward pointer “sp” then start scanning forward to
search for the end of lexeme. The end of the lexeme is
indicated by blank space .
Lb sp
Begin I := I + 1 : J
 Reading input character by character from the secondary
storage is costly.

 A block of data is read first into a buffer and then scanned by


the lexical analyser.

 because of the amount of time taken to process character and


the large number of character that must be processed during the
compilation of a large source program , specialized buffering
technique have been developed to reduce the amount of
overhead required to process a single input character.
Two methods are available
 One buffer scheme: There are problems if a lexeme crosses the
buffer boundary. To scan the rest of the lexeme, the buffer has
to be refilled there by overwriting the first part of the lexeme.

 Two buffer scheme: Buffer1 and buffer2 are scanned


alternatively when the end of the current buffer is reached,
other buffer is filled. Hence the problem encountered in the
previous method is solved.
Two buffer scheme
 Each buffer is of the same size N.

 N is usually of disk block 4096bytes.

 Using one system read command we can read N character into


buffer, rather than using one system call per character.

 If fewer than N character remain in the input file, then a special


character represented by eof marks the end of the source file
and is different from any possible character of the source
program.
E = M * C * * 2 eof
 Advancing forward requires that we first test whether we have
reached the end of one of the buffer and if so, we must reload
the other buffer from the input and move forward to the
beginning of the newly loaded buffer.

 Each time we advance forward for each character read, we


make two test.

 One for the end of the buffer

 One to determine what character is read.

 We can combine the buffer end test with the test for current
character if we extend each buffer to hold a sentinel character
at the end.
 The sentinels is a special character that cannot be part of the
source program and a natural choice is the character eof.

 Eof retains its use as a marker for the end of the entire input.

 Any eof that appears other than at the end of buffer means the
input is at an end.
Algorithm for forward pointer with sentinels

E = M eof * C * * 2 eof eof


Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
}
Specification of tokens
 Strings and language:
Alphabet: An alphabet is any finite set of symbols.
Symbols are letter,digit,punctuation.
Ex: set {0,1} is binary alphabet.
String: A string over an alphabet is finite sequence of symbol
drawn from that alphabet.
The length of string is |S|.
Ex: S=banana, |S|= 6
Empty string epsilon.(Ɛ)
1.Ɛs=sƐ=s (empty string is identity under concatenation)
2. x=dog, y=cat, xy=dogcat
3. S^3 =SSS
Language: A language is any countable set of string over some
fixed alphabet.
 Empty set {Ɛ} ,set containing only the empty string , are called
abstract language.

Operations on language:
 UNION
 CONCATENATION
 CLOSURE Kleene , Positive
operations Definition and notation

Union of two languages L and M L U M = {s | s is in L or s is in M}

Concatenation of two languages L and LM = {st | s is in L and t is in M}


M

The Kleene Closure of a language L L* = Zero or more occurrence of


language L.

The positive Closure of a language L L+= one or more occurrence of language


L.
 Let L be the set of letter {A,B,--Za,b---z},the alphabet of
uppercase and lower case and it is language ,all of whose string
happen to be length one.
 Let D be the set of digits {0,1,---9} ,alphabet of digits and it is
language of string length one.
 Some of the other language can be constructed from language L and D
using operation.
1. L U D= is the set of letter and digits , strictly speaking the language with -
--------- string of length one, each of which strings either one letter or one
digits.
2. LD=is the set of -----------string of length two, each consisting of one
letter followed by one digit.
3. L4 = is the set of all ----------letter string.
4. D+ = is the set of string of one or more digits.
5. L(L U D)*= is the set of all string of letter and digits beginning with
letter.
Regular Expressions
 In theory of compilation regular expressions are used to
formalize the specification of tokens
 Regular expressions are means for specifying regular languages
 Example:
 Letter_(letter_ | digit)*

 Each regular expression is a pattern specifying the form of


strings.
 RE are built recursively out of smaller RE, using below rules.
 Each RE ‘r’ denotes a language L(r ), which is also define
recursively from the language denoted by ‘r’ subexpressions.
The rules denote RE over some alphabet ∑ & language that those expression
denote.

Basis: There are two rules that form basis


1. Ɛ is a regular expression, L(Ɛ) = {Ɛ},i,.e language contain empty
string.
2. If ‘a’ is a symbol in ∑ then a is a regular expression, L(a) =
{a},i.e language with one string of length one.
Induction: suppose r and s are RE denoting language L(r )&
L(s )respectively.
 (r) | (s) is a regular expression denoting the language L(r) ∪ L(s)
 (r)(s) is a regular expression denoting the language L(r)L(s)
 (r)* is a regular expression denoting (L(r))*
 (r) is a regular expression denting L(r), this rule say that we can
add additional pair of parenthesis around expressions without
changing the language.
Examples of RE
 Let ∑={a,b}
1. RE a|b denotes language ----------
2. (a|b)(a|b) denotes language_________
3. a * denotes language_________
4. a|a*b denotes language______________
5. (a|b)*___________
6. (a*b*)*______________
Algebraic laws of RE

A language that can be defined by RE is called a “regular set”

If two RE r and s denote the same regular set, then both are
equivalent r=s.

Ex=(a|b)= (b|a)

Below table show the algebraic laws that hold for arbitrary RE r,
s,and t.
No LAW Description

1 r|s=s|r | is commutative

2 r |(s|t)=(r|s)|t | is associative

3 r(st)=(rs)t Concatenation is associative

4 r(s|t)=rs|rt Concatenation distributes over |

5 Ɛr=rƐ=r Ɛ is the identity for concatenation

6 r*= (r|Ɛ)* Ɛ is guaranteed in a closure

7 r**=r* * is idempotent
Regular definitions
 We may give names to RE and use those names in subsequent expressions,
if the names were themself symbols.
 If ∑ is an alphabet of basic symbols, then regular definition is a sequence of
definition of the form.
d1 -> r1
d2 -> r2

dn -> rn
Where ,
 Each di is a new symbol , not in ∑ and not the same as any other of the di
and
 Each ri is RE over the alphabet ∑ U {d1,d2,….di-1}

Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Extensions
 One or more instances: (r)+= rr*
 Zero of one instances: r? = r| Ɛ
 Character classes: [abc]=[a|b|c]
[Link] Regular Definition for C language identifier in both
extended and short notation.
 letter_ -> [A-Za-z_]
 digit -> [0-9]
 id -> letter_(letter|digit)*
2. Write a RD for unsigned number(integer or floating point) are string
5280, 0.01234,6.336E4, 1.89E-4

 digit0|1|2|-----|9
 digits digit digit*
 optionalFraction.digits | Ɛ
 optionalExponent(E(+|-| Ɛ)digits)| Ɛ
 Number digits optionalFraction optionalExponent

Short notation:
 digit[0-9]
 digits digits+
 Number digits(.digits)?(E[+-]?digits)?
Recognition of tokens
 Study about how to take pattern for all needed tokens and build
a piece of code that examines the input string and find a prefix
that is a lexeme matching one of the character.
 Starting point is the language grammar to understand the
tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt

expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.) for braching
statement
 The next step is to define all the terminals recognize from
lexical analysis phase for patterns using RD:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
 Write regular definition to handle whitespaces:
ws -> (blank | tab | newline)+
Below table hows the each lexeme , which token name is returned
to the parser and what attribute value.
Lexeme(pattern) Token-name Attribute value
Any ws - -
if if -
Then Then -
Else Else -
Any id Id Pointer to table entry
Any number Number Pointer to table entry
< Relop LT
> Relop GT
<= Relop LE
= Relop EQ
<> Relop NT
>= Relop GE
Transition diagrams
 An intermediate step in the construction of a lexical analyser ,
we first convert pattern into flowcharts called transition
diagram(TD).

 TD have a collection of nodes or circles called states.

 Each states represents a condition that could occur during the


process of scanning the input looking for a lexeme that matches
one of several patterns.

 Edges are directed from one state of the TD to another.

 Each edge is labelled by a symbol or set of symbols.


 Assume all our TD are deterministic, meaning that there is
never more than one edge out of a given state with a given
symbol among its labels.
 Certain states are said to be accepting or final states. These
states indicate that lexeme has been found. We always indicate
an accepting state by a double circle.
 If there is an action to be taken , typically returning s token and
attribute value to the parser . We shall attach that action to the
accepting state.
 In addition , if it is necessary to retract the forward pointer one
position (i.e the lexeme does not include the symbol that got us
to accepting state), then we shall additionally place a * near
that accepting state.
 One state is designated the start state or initial state, it is
indicated by an edge, labelled start entering from no where.
Finite Automata
 Regular expressions = specification
 Finite automata = implementation

 A finite automaton consists of


 An input alphabet 
 A set of states S
 A start state n
 A set of accepting states F  S
 A set of transitions state input state

35
Finite Automata

 Transition
s1 a s2
 Is read
In state s1 on input “a” go to state s2

 If end of input
 If in accepting state => accept, othewise => reject
 If no transition possible => reject

36
Finite Automata State Graphs

 A state

• The start state

• An accepting state

a
• A transition

37
1.A Simple Example

 A finite automaton that accepts only “1”

 A finite automaton accepts a string if we can follow


transitions labeled with the characters in the string
from the start to some accepting state

38
2. Another Simple Example

 A finite automaton accepting any number of 1’s


followed by a single 0
 Alphabet: {0,1}

 Check that “1110” is accepted but “110…” is not

39
3. Another Example
 Alphabet {0,1}
 What language does this recognize?

1 0

0 0

1
1

40
Epsilon Moves

 Another kind of transition: -moves



A B

• Machine can move from state A to state B


without reading input

41
Deterministic and Nondeterministic Automata
 Deterministic Finite Automata (DFA)
 One transition per input per state
 No -moves
 Nondeterministic Finite Automata (NFA)
 Can have multiple transitions for one input in a given state
 Can have -moves
 Finite automata have finite memory
 Need only to encode the current state

42
Execution of Finite Automata

 A DFA can take only one path through the state graph
 Completely determined by input

 NFAs can choose


 Whether to make -moves
 Which of multiple transitions for a single input to take

43
Acceptance of NFAs

 An NFA can get into multiple states


1

0 1

• Input: 1 0 1

• Rule: NFA accepts if it can get in a final state

44
NFA vs. DFA (1)

 NFAs and DFAs recognize the same set of languages


(regular languages)

 DFAs are easier to implement


 There are no choices to consider

45
NFA vs. DFA (2)

 For a given language the NFA can be simpler than the


DFA
1
0 0
NFA
0

1 0
0 0
DFA
1
1

• DFA can be exponentially larger than NFA

46
Regular Expressions to Finite Automata

 High-level sketch

NFA

Regular
expressions DFA

Lexical Table-driven
Specification Implementation of DFA

47
[Link] diagrams for relational operator (VVIMP)
 Transition diagram for relop, we must retract state 4,
has * to indicate that we must retract the input one
position.
Architecture of a transition-diagram-based lexical analyzer for relational
operator
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: c=nextchar();
If(c==“=‘”) state=2;
Else if(c==“>”) state=3;
Else state= 4;
Break;
Case 2: [Link]=LE;
return(retToken);
break;
case 3: [Link]=NE;
return(retToken);
break;
Case 4: retract();
[Link] = LT;
return(retToken);
Break;
Case 5: [Link]=EQ;
return(retToken);
break;
Case 6: c=nextchar();
if(c==“=“) state=7;
else state=8;
Case 7:[Link]=GE;
return(retToken);
break;
case 8: retract();
[Link] = GT;
}
 2. Transition diagram for reserved words and identifiers.

Keywords like if,then,else,char,const etc are reserved words, so they are not
identifier even though they look like identifier. Above diagram work for both
identifier and reserved words.

Install the reserved word in the symbol table initially.

A field of the symbol table entry indicates that these string are never ordinary
identifier and tells which token they represent.
When we find an identifier ,a call to installID place it in the
symbol table if it is not already there and returns a pointer to the
symbol table entry for the lexeme found.

Any identifier not in the symbol table during lexical analysis


cannot be a reserved word, so it is id.
Program segment for identifier or reserverd
{starting in state 0}
If the next character is a letter then
Advance the input;
{now in state 1}
While the next character is a letter or digit do
Advance the input; {stay in state 1}
End while;
{go to state 2, without advancing the input}
Accept;
Else
{error or other cases}
End if
3. Transition diagram for unsigned numbers.
4. Transition diagram for whitespace
 Write separate transition diagram for each keyword If, then,
case, char, continue.
Write transition diagram for C comment lines.
 {state 1}
 If the next character is “/” then
 Advance the input ; {state 2;}
 If the next character is “*” then
 Advance the input ; {state 3;}
 Done:=false;
 While not done do
 While the next input character is not “*” do
 Advance the input;
 End while;
 Advance the input; {state 4;} other *

 / 2 * * /
1 2 3 other 4 5
2
 While the next input character is “*” do
 Advance the input;
 End while;
 If the next input character is “/” then
 Done:=true;
 End if
 Advance the input;
 End while;
 Accept;{state 5;}
 Else { other processing}
 End if;
 Else {other processing}
 End if;
Lexical Analyzer Generator - Lex

Lex Source program Lexical [Link].c


lex.l Compiler

[Link].c
C [Link]
compiler

Sequence
Input stream [Link]
of tokens
Structure of Lex programs

Syntax:

declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions
Example for relational operator
%{
Int installID() {/* funtion to install the
/* definitions of manifest constants
lexeme, whose first character is
LT, LE, EQ, NE, GT, GE, pointed to by yytext, and whose
IF, THEN, ELSE, ID, NUMBER, RELOP */ length is yyleng, into the symbol
%} table and return a pointer thereto
*/
/* regular definitions }
delim [ \t\n]
ws {delim}+ Int installNum() { /* similar to
installID, but puts numerical
letter [A-Za-z]
constants into a separate table */
digit [0-9]
}
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?

%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER);}

Thank You

You might also like