0% found this document useful (0 votes)

39 views46 pages

Lexical Analysis in Compiler Design

The document discusses lexical analysis, which breaks source code into tokens. It defines tokens, lexemes, and patterns. The lexical analyzer identifies lexemes in the source code and maps them to tokens consisting of a name and attribute value. These tokens are then passed to the parser for syntax analysis. The document contrasts lexical analysis with parsing and outlines other tasks of a lexical analyzer like removing comments and tracking line numbers.

Uploaded by

Daniel Bido Rasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views46 pages

Lexical Analysis in Compiler Design

Uploaded by

Daniel Bido Rasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Principles of Compiler Design

(SENG471)

Chapter Two
Lexical Analysis

1
Objective
At the end of this session students will be able to:
 Understand the basic roles of lexical analyzer (LA): Lexical Analysis Versus Parsing,

Tokens , Patterns, and Lexemes, Attributes for Tokens and Lexical Error.
 Understand the specification of Tokens: Strings and Languages, Operations on

Languages, Regular Expressions, Regular Definitions and Extensions of Regular

Expressions

 Understand the generation of Tokens: Transition Diagrams, Recognition of Reserved

Words and Identifiers, Completion of the Running Example, Architecture of a

Transition-Diagram-Based Lexical Analysis.

 Understand the basics of Automata: Nondeterministic Finite Automata (NFA) Versus

Deterministic Finite Automata(DFA), and conversion of an NFA to a DFA.

2
Introduction
 The lexical analysis phase of compilation breaks the text file (program) into smaller

chunks called tokens.

 A token describes a pattern of characters having same meaning in the source

program. (such as identifiers, operators, keywords, numbers, delimiters and so on)

The lexical analysis phase of the compiler is often called tokenization

 The lexical analyzer (LA) reads the stream of characters making up the source

program and groups the characters into meaningful sequences called lexemes.

 For each lexeme, the lexical analyzer produces as output a token of the form {token-

name, attribute-value} that it passes on to the subsequent phase, syntax analysis.

Where,
 token- name is an abstract symbol that is used during syntax analysis , and
3
 attribute-value points to an entry in the symbol table for this token.
Example.
 For example, suppose a source program contains the assignment statement

position = initial + rate * 60

 The characters in this assignment could be grouped into the following lexemes and

mapped into the following tokens passed on to the syntax analyzer:

1. position is a lexeme that would be mapped into a token {id, 1}, where id is an
abstract symbol standing for identifier and 1 points to the symbol table entry for
position.
 The symbol-table entry for an identifier holds information about the identifier,

such as its name and type.

2. The assignment symbol = is a lexeme that is mapped into the token {=}. Since this

token needs no attribute-value, the second component is omitted.

 We could have used any abstract symbol such as assign for the token-name , but for

notational convenience we have chosen to use the lexeme itself as the name of the
4
abstract symbol.
Contd.

4. + is a lexeme that is mapped into the token { + }.

5. rate is a lexeme that is mapped into the token { id, 3 }, where 3 points to the

symbol-table entry for rate.

6. * is a lexeme that is mapped into the token { * }.

7. 60 is a lexeme that is mapped into the token { 60 }.

Blanks separating the lexemes would be discarded by the lexical analyzer.

5
Contd. Parser uses tokens
produced by the LA to
create a tree-like
intermediate representation
that depicts the
grammatical structure of
 Ittheuses
token
thestream.
syntax tree and
the information in the
symbol table to check the
source program for
semantic consistency with
the language definition.
 An important part of
semantic analysis is type
In the process of translating a checking, where the
source program into target code, a compiler checks that each
compiler may construct one or operator has matching
more intermediate representations, operands
which are easy to produce and
easy to translate into the target The machine-independent code-
machine optimization phase attempts to
improve the intermediate code so
that better target code will result.
Usually better means faster, but
other objectives may be desired,
6
such as shorter code, or target
code that consumes less power.
The Role of Lexical Analyzer

Source token To semantic

Lexical
program Parser analysis
Analyzer
getNextToken

Symbol
Table

 It is common for the lexical analyzer to interact with the symbol table.

When the lexical analyzer discovers a lexeme constituting an identifier, it

needs to enter that lexeme into the symbol table.

In some cases, information regarding the kind of identifier may be read from

the symbol table by the lexical analyzer to assist it in determining the proper
7
token it must pass to the parser.
Other Tasks of an LA
 Since LA is the part of the compiler that reads the source text, it may perform

certain other tasks besides identification of lexemes such as:

1. Stripping out comments and whitespace (blank, newline, tab, and perhaps
other characters that are used to separate tokens in the input).
2. Correlating error messages generated by the compiler with the source program.
 For instance, the LA may keep track of the number of newline characters

seen, so it can associate a line number with each error message.

3. If the source program uses a macro-preprocessor, the expansion of macros may
also be performed by the lexical analyzer.
 Sometimes, lexical analyzers are divided into a cascade of two processes:

A. Scanning consists of the simple processes that do not require tokenization of

the input, such as deletion of comments and compaction of consecutive
whitespace characters into one.
B. Lexical analysis proper is the more complex portion, where the scanner
8
produces the sequence of tokens as output (tokenization)
Lexical Analysis Versus Parsing
 There are a number of reasons why the analysis portion of a compiler is normally

separated into lexical analysis and parsing (syntax analysis) phases:

1. Simplicity of Design:- is the most important consideration that often allows us to

simplify compilation tasks.

For example, a parser that had to deal with comments and whitespace as syntactic
units would be considerably more complex than one that can assume comments
and whitespace have already been removed by the lexical analyzer.
If we are designing a new language, separating lexical and syntactic concerns can
lead to a cleaner overall language design.
2. Compiler efficiency is improved:- A separate lexical analyzer allows us to apply

specialized techniques that serve only the lexical task, not the job of parsing.
In addition, specialized buffering techniques for reading input characters can
speed up the compiler significantly.
3. Compiler portability is enhanced:- Input-device-specific peculiarities can be
9
restricted to the lexical analyzer.
Tokens, Patterns, and Lexemes
A lexeme is a sequence of characters in the source program that matches the pattern for a
token.
Example: -5, i, sum, 100, for ; int 10 + -
Token:- is a pair consisting of a token name
 One andfor
token aneach
optional attribute
keyword. value. for a keyword is the
The pattern
same as the keyword itself.
Example:
 Tokens for the operators , either individually or in classes
Identifiers = { i, sum } such as the token in comparison operators
 One token representing all identifiers.
int_Constatnt = { 10, 100, -5 }  One or more tokens representing constants, such as numbers
Oppr = { +, - } and literal strings .
 Tokens for each punctuation symbol, such as left and right
rev_Words = { for, int } parentheses, comma, and semicolon
Separators = { ; , , }

Pattern is a rule describing the set of lexemes that can represent a particular token in source

program
It is a description of the form that the lexemes of a token may take.
10 In the case of a keyword as a token, the pattern is just the sequence of characters
Contd.
 One efficient but complex brute-force approach is to read character by character

to check if each one follows the right sequence of a token.

 The below example shows a partial code for this brute-force lexical analyzer.

 What we’d like is a set of tools that will allow us to easily create and modify a

lexical analyzer that has the same run-time efficiency as the brute-force
method.

if (cThe first tool that we use to attack this problem is Deterministic Finite Automata

= nextchar() == ‘c’) {
if (c = nextchar() == ‘l’) {
(DFA).
// code to handle the rest of either “class” or any identifier that starts with
“cl”
} else if (c == ‘a’) {
// code to handle the rest of either “case” or any identifier that starts with
“ca”
} else {
// code to handle any identifier that starts with c
11 }
} else if ( c = …..) {
Contd.

“Literal String”

12
Consideration for a simple design of Lexical Analyzer
Lexical Analyzer can allow source program to be
1. Free-Format Input:- the alignment of lexeme should not be necessary in determining the
correctness of the source program such restriction put extra load on Lexical Analyzer
2. Blanks Significance:- Simplify the task of identifying tokens
E.g. int a indicates <int is keyword> <a is identifier>
inta indicates <inta is Identifier>
3. Keyword must be reserved:- Keywords should be reserved otherwise LA will have to
predict whether the given lexeme should be treated as a keyword or as identifier
E.g. if then then then =else;
else else = then;
The above statements are misleading as then and else keywords are not reserved.
Approaches to implementation
 Use assembly language- Most efficient but most difficult to implement

 Use high level languages like C- Efficient but difficult to implement

13  Use tools like Lex, flex, javacc… Easy to implement but not as efficient as the first
Lexical Errors

 Are primarily of two kinds.

1. Lexemes whose length exceed the bound specified by the language.

 Most languages have bound on the precision of numeric constants.

 A constant whose bound exceeds this bound is a lexical error.

2. Illegal character in the program

 Characters like ~, , ,  occurring in a given programming

language (but not within a string or comment) are lexical errors.

3. Un terminated strings or comments.

14
Handling Lexical Errors
 It is hard for a LA to tell , without the aid of other components, that there is a

source-code error.
 For instance, if the string fi is encountered for the first time in a java program

in the context : fi ( a == 2*4+3 ) a lexical analyzer cannot tell whether fi is a

misspelling of the keyword if or an undeclared function identifier.

 However, suppose a situation arises in which the lexical analyzer is unable to

proceed because none of the patterns for tokens matches any prefix of the

remaining input.
 The simplest recovery strategy is "panic mode" recovery.

 We delete successive characters from the remaining input , until the lexical

analyzer can find a well-formed token at the beginning of what input is left.

15
 This recovery technique may confuse the parser, but in an interactive computing
Contd.

 Other possible error-recovery actions are:

1. Insert a missing character into the remaining input.

2. Replacing an incorrect character by a correct character

3. Transpose two adjacent characters(such as , fi=>if).

4. Deleting an extraneous character

5. Pre-scanning

16
Input Buffering
 There are some ways that the task of reading the source program can be

speeded
 This task is made difficult by the fact that we often have to look one or more

characters beyond the next lexeme before we can be sure we have the right
lexeme
 In C language: we need to look after -, = or < to decide what token to return

 We shall introduce a two-buffer scheme that handles large look-aheads safely

 We then consider an improvement involving sentinels that saves time checking

for the ends of buffers

 Because of the amount of time taken to process characters and the large

number of characters that must be processed during compilation of a large

source program, specialized buffering techniques have been developed to
17
reduce the amount of overhead to process a single input character
Contd.

 Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096

bytes
 Using one system read command, we can read N characters into a buffer,

rather than using one system call per character

 If fewer than N characters remain in the input file, then a special character,

represented by eof, marks the end of the source file and is different from any

possible character of the source program

 Two pointers to the input are maintained:

1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose

extent we are attempting to determine
18
2. Pointer forward scans ahead until a pattern match is found
Contd.
 Once the lexeme is determined, forward is set to the character at its right end (involves

retracting)
 Then, after the lexeme is recorded as an attribute value of the token returned to the

parser; lexemeBegin is set to the character immediately after the lexeme just
found
 Advancing forward requires that we first test whether we have reached the end of one

of the buffers, and if so, we must reload the other buffer from the input, and move
forward to the beginning of the newly loaded buffer

Sentinels
 If we use the previous scheme, we must check each time we advance forward, that

we have not moved off one of the buffers; if we do, then we must also reload the
other buffer
 Thus, for each character read, we must make two tests: one for the end of the
19
buffer, and one to determine which character is read
Contd.

Fig. Sentinels at the end of each buffer

 The sentinel is a special character that Switch (*forward++) {

case eof:
cannot be part of the source program, if (forward is at end of first buffer) {
reload second buffer;
and a natural choice is the character
forward = beginning of second buffer;
eof }
else if {forward is at end of second buffer) {
 Note that eof retains its use as a reload first buffer;\
forward = beginning of first buffer;
marker for the end of the entire
}
input else /* eof within a buffer marks the end of
input */
 Any eof that appears other than at the terminate lexical analysis;
break;
end of buffer means the input is at an cases for the other characters;
20 end }
Specification of Tokens
 Regular expressions are an important notation for specifying lexeme patterns.

 While they cannot express all possible patterns, they are very effective in specifying

those types of patterns that we actually need for tokens.

 In theory of compilation regular expressions are used to formalize the

specification of tokens

 Regular expressions are means for specifying regular languages

Strings and Languages

 An alphabet is any finite set of symbols. Typical examples of symbols are letters ,

digits, and punctuation.

 The set { 0, 1 } is the binary alphabet.

 ASCII is an important example of an alphabet ; it is used in many software

21
systems.
Contd.
 A string over an alphabet is a finite sequence of symbols drawn from that

alphabet .
 In language theory, the terms "sentence" and "word" are often used as
synonyms for "string."
 The length of a string s, usually written |s|, is the number of occurrences of

symbols in s.
 For example, banana is a string of length six.

 The empty string, denoted Ɛ, is the string of length zero.

 A language is any countable set of strings over some fixed alphabet.

 Abstract languages like  , the empty set, or {Ɛ} , the set containing only the

empty string, are languages under this definition.

 So too are the set of all syntactically well-formed java programs and the set

of all grammatically correct English sentences, although the latter two languages
22
are difficult to specify exactly.
Terms for Parts of Strings
 The following string-related terms are commonly used:

1. A prefix of string S is any string obtained by removing zero or more symbols from the
end of s.
 For example, ban, banana, and Ɛ are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from
the beginning of s.
 For example, nana, banana, and Ɛ are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s.
 For instance, banana, nan, and Ɛ are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s that are not Ɛ or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily

23 consecutive positions of s.

Operations on Languages
 The following string-related terms are commonly used:

 In lexical analysis, the most important operations on languages are union,

concatenation, and closure, which are defined formally below:

1. Union (U):- is the familiar operation on sets.
Example Union of L and M, L U M = {s|s is in L or s is in M}
2. Concatenation :- is all strings formed by taking a string from the first language and a
string from the second language, in all possible ways , and concatenating them.
Example Concatenation of L and M, LM = {st|s is in L and t is in M}
3. Kleene closure:- the kleene closure of a language L, denoted by L*, is the set of strings
you get by concatenating L zero, or more times.
Example Kleene closure of L,
Note that:- L0 , the "concatenation of L zero times," is defined to be {Ɛ} , and inductively, Li is
Li-1 L.
4. Positive closure:- is the positive closure of a language L, denoted by L+, is the same as
24
the Kleene closure, but without the term L0 . That is, Ɛ will not be in L+ unless it is L in
Contd.
Let L be the set of letters {A, B , . . . , Z , a, b, . . . , z} and let D be the set of digits {0, 1, . . . 9}.
We may think of L and D in two, essentially equivalent, ways.
 One way is that L and D are, respectively, the alphabets of uppercase and lowercase

letters and of digits.

 The second way is that L and D are languages, all of whose strings happen to be of

length one.
 Here are some other languages that can be constructed from languages L and D, using

the above operators:

1. L U D is the set of letters and digits - strictly speaking the language with 62
strings of length one, each of which strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one letter followed by one
digit.

3. L4 is the set of all 4-letter strings.

25
4. L* is the set of all strings of letters, including Ɛ, the empty string.
Regular Expressions
 In theory of compilation regular expressions are used to formalize the specification of

tokens
 Regular expressions are means for specifying regular languages
Example: ID  letter_(letter|digit)*
letter A|B|…|Z|a|b|…|z|_
digit 0|1|2|…|9
 Each regular expression is a pattern specifying the form of strings
 The regular expressions are built recursively out of smaller regular expressions,

using the following two rules:

R1:- Ɛ is a regular expression, and L(Ɛ) = {Ɛ}, that is , the language whose sole member
is the empty string.
R2:- If a is a symbol in ∑ then a is a regular expression, L(a) = {a}, that is , the language
with one string, of length one , with a in its one position.
Note:- By convention, we use italics for symbols, and boldface for their corresponding
regular expression.
26
 Each regular expression r denotes a language L(r), which is also defined
Contd.
 There are four parts to the induction whereby larger regular expressions are

built from smaller ones.

 Suppose r and s are regular expressions denoting languages L(r) and L(s) ,
respectively.
1. (r) |( s) is a regular expression denoting the language L(r) U L(s) .
2. (r) (s) is a regular expression denoting the language L(r)L (s) .
3. (r )* is a regular expression denoting (L (r) )*.
4. (r) is a regular expression denoting L(r) .
 This last rule says that we can add additional pairs of parentheses around
expressions without changing the language they denote.
 Regular expressions often contain unnecessary pairs of parentheses .

 We may drop certain pairs of parentheses if we adopt the conventions that :

a. The unary operator * has highest precedence and is left associative.

b. Concatenation has second highest precedence and is left associative.
c. | has lowest precedence and is left associative.
27
Example:- (a) |( (b) *(c)) can be represented by a|b*c. Both expressions denote the
Contd.
 A language that can be defined by a regular expression is called a regular set.
 If two regular expressions r and s denote the same regular set , we say they are

equivalent and write r = s.

 For instance, (a|b) = (b|a).
 There are a number of algebraic laws for regular expressions; each law asserts that

expressions of two different forms are equivalent.

LAW DESCRIPTION
r|s = s|r | is commutative
r|(s|t) = (r|s)|t |is associative
r(st) = (rs)t Concatenation is associative
r(s|t) = rs|rt; Concatenation distributes over |
(s|t)r= sr|tr
Ɛr=rƐ=r Ɛ is the identity for concatenation
r* = (r|Ɛ)* Ɛ is guaranteed in a closure
r** = r* * is idempotent

28
Table: The algebraic laws that hold for arbitrary regular expressions r, s,
Regular Definitions
 If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of

definitions of the form : Where

d1  r1 1. Each di is a new symbol, not in ∑ and not
d2  r2 the same as any other of the d's, and
… 2. Each ri is a regular expression over the
dn  rn alphabet ∑ U { d1 ,d2 , ... ,di-1}.
Examples
(a) Regular Definition for Java Identifiers (b) Regular Definition for Java statements
stmt  if expr then stmt
ID  letter(letter|digit)*
| if expr then stmt else stmt
letter A|B|…|Z|a|b|…|z|_
|Ɛ
digit 0|1|2|…|9 expr  term relop term
(c) Regular Definition for unsigned numbers
| term
such as 5280 , 0.0 1 234, 6. 336E4, or 1.
89E-4 term  id
digit 0|1|2|…|9 | number
digits digit digit*
optFrac . digits|Ɛ
29
optExp (E(+|-|Ɛ) digits|Ɛ
Extensions of Regular Expressions
 Many extensions have been added to regular expressions to enhance their ability to

specify string patterns.

 Here are few notational extensions:

1. One or more instances(+):- is a unary postfix operator that represents the positive
closure of a regular expression and its language. That is , if r is a regular
expression, then (r)+ denotes the language (L( r) ) + .
1. The operator + has the same precedence and associativity as the operator *.
2. Two useful algebraic laws, r*=r+|Ɛ and r+=rr*=r*r.
2. Zero or one instance(?):- r? is equivalent to r|Ɛ , or put another way, L(r?) = L(r) U
{Ɛ}.
 The ? operator has the same precedence and associativity as * and +.

3. Character classes:- A regular expression a1|a2|···| an , where the ai's are each
symbols of the alphabet , can be replaced by the shorthand [al a2 ··· an].
Example:
30
 [abc] is shorthand for a| b | c, and
Example
 Using the above short hands we can rewrite the regular expression for examples a and c

in slide number 29 as follows :

Examples(a) Regular Definition for Java Identifiers (c) Regular Definition for unsigned numbers
ID  letter(letter|digit)* such as 5280 , 0.01234, 6.336E4, or 1.89E-
4
letter [A-Za-z_] digit [0-9]
digit [0-9] digits digit+
number  digits (. digits)? (E [+ -]? digits)?
Exercises
1. Consult the language reference manuals to determine
A. The sets of characters that form the input alphabet (excluding those that may
only appear in character strings or comments) ,
B. The lexical form of numerical constants, and
C. The lexical form of identifiers , for C++ and java programming languages.
2. Describe the languages denoted by the following regular expressions:
A. a(a|b) *a
B. ((Ɛ|a)b * )*
C. (a|b) *a (a|b) (a|b)
31 D. a* ba*ba*ba*.
E. (aa|bb) * ((ab |ba) (aa| bb) * (ab|b a) (aa|bb )*)*
Contd.

3. Write regular definitions for the following languages:

A. All strings of lowercase letters that contain the five vowels in order.

B. All strings of lowercase letters in which the letters are in ascending

lexicographic order.

C. All strings of binary digits with no repeated digit s.

D. All strings of binary digits with at most one repeated digit.

E. All strings of a ' s and b's where every a is preceded by b.

F. All strings of a 's and b's that contain substring abab.

32
Recognition of Regular Expressions
1. Starting point is the language grammar to understand the tokens:
stmt  if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr  term relop term
| term
term  id
| number

2. The next step is to formalize the patterns: 3. We also need to handle whitespaces:
digit  [0-9]
ws  (blank | tab | newline)+
Digits  digit+
number  digit(.digits)? (E[+-]? Digit)?
letter  [A-Za-z_]
id  letter (letter|digit)*
If  if
Then  then
Else  else
33
Relop  < | > | <= | >= | = | <>
Transition Diagram
Lexemes Token Names Attribute Value
Any ws - -
if if -
then then -
else else -
Any id id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE

Fig. Transition diagram for relop

Table: Tokens, their patterns, and attribute values

34
Fig. Transition diagram for reserved words and identifiers
Contd.

Fig. Transition diagram for unsigned numbers

Fig. Transition diagram for white spaces where delim

represents one or more whitespace characters

35
Finite Automata
 The lexical analyzer tools use finite automata, at the heart of the transition, to convert the

input program into a lexical analyzer .

 These are essentially graphs, like transition diagrams, with a few differences:

1. Finite automata are recognizers ; they simply say "yes" or "no" about each
possible input string.
2. Finite automata come in two flavors :
A. Nondeterministic finite automata (NFA) have no restrictions on the labels of
their edges . A symbol can label several edges out of the same state, and
Ɛ, the empty string, is a possible label.
B. Deterministic finite automata (DFA) have, for each state, and for each
symbol of its input alphabet exactly one edge with that symbol leaving
that state.
 Both deterministic and nondeterministic finite automata are capable of recognizing

the same languages .

36  In fact these languages are exactly the same languages , called the regular
Finite Automata State Graphs
A state Example 1: A finite automaton that accepts
only “1” 1

The start state

(Initial State)
Example 2: A finite automaton accepting any
number of 1’s followed by a single 0
Accepting state
(Final State) 1

0
a
A Transition

Q: Check that “1110” is accepted but “110…”

is not?
37
Contd.
Question: What language does this recognize?

0 0

1
1

38
Nondeterministic Finite Automata (NFA)
 A nondeterministic finite automaton (NFA) consists of:

1. A finite set of states S.

2. A set of input symbols ∑, the input alphabet. We assume that Ɛ, which stands
for the empty string, is never a member of ∑ .
3. A transition function that gives, for each state, and for each symbol in ∑ U {Ɛ
} a set of next states.
4. A state s0 from S that is distinguished as the start state (or initial state) .

5. A set of states F, a subset of S, that is distinguished as the accepting states

(or final states) .
 We can represent either an NFA or DFA by a transition graph, where the nodes

are states and the labeled edges represent the transition function.
 There is an edge labeled a from state s to state t if and only if t is one of

the next states for state s and input a.

 This graph is very much like a transition diagram, except:

1. The same symbol can label edges from one state to several different states,
39
and
Contd.
 An NFA can get into multiple states

b a
Q1 Q2 Q3

b
Input: a b a

Rule: NFA accepts if it can get in a final state

40
Transition Tables
 We can also represent an NFA by a transition table, whose rows correspond to states,

and whose columns correspond to the input symbols and Ɛ.

 The entry for a given state and input is the value of the transition function applied

to those arguments.
 If the transition function has no information about that state-input pair, we put  in

the table for the pair.

Example:- The transition table for the NFA on the pervious slide is represented as:

The transition table has the advantage that Input a b Ɛ

we can easily find the transitions on a given State
state and input. Q1 Q1 { Q1 , Q2 } 
Its disadvantage is that it takes a lot of
Q2 Q3  
space, when the input alphabet is large, yet
most states do not have any moves on most Q3   
of the input symbols.
41
Acceptance of Input Strings by Automata
 An NFA accepts input string x if and only if there is some path in the transition graph

from the start state to one of the accepting(Final) states, such that the symbols along

the path spell out x.

 Note that Ɛ labels along the path are effectively ignored, since the empty string

does not contribute to the string constructed along the path.

Example:- Strings aaba and bbbba are accepted by the NFA in slide 40.

 The language defined (or accepted) by an NFA is the set of strings labeling some path

from the start to an accepting state.

 As was mentioned above, the NFA of slide # 40 defines the same language as does

the regular expression (a|b)* ba, that is, all strings from the alphabet {a, b} that

42 end in ba.
Deterministic Finite Automata (DFA)
 A deterministic finite automaton (DFA) is a special case of an NFA where:

1. There are no moves on input Ɛ , and

2. For each state S and input symbol a, there is exactly one edge out of s labeled a.

 If we are using a transition table to represent a DFA, then each entry is a single state.

 we may therefore represent this state without the curly braces that we use to form

sets.

 While the NFA is an abstract representation of an algorithm to recognize the strings of a

certain language, the DFA is a simple, concrete algorithm for recognizing strings.

 It is fortunate indeed that every regular expression and every NFA can be converted to a

DFA accepting the same language, because it is the DFA that we really implement or

simulate when building lexical analyzers.

43
 DFA recognize strings:- When a string is fed into a DFA, if the DFA recognizes the
Contd.
 A DFA is a collection of states and transitions:- Given the input string, the transitions

tell us how to move among the states.

 One of the states is denoted as the initial state, and a subset of the states are final

states.
 We start from the initial state, move from state to state via the transitions, and

check to see if we are in the final state when we have checked each character in the
string.
 If we are, then the string is accepted, otherwise, the string is rejected

 A DFA is a quintuple, a machine with five parameters, M = (Q, ∑, δ, q0, F), where

 Q is a finite set of states

 ∑ is a finite set called the alphabet

 δ is a total function from (Q x ∑) to Q known as transition function (a function

44
that takes a state and a symbol as inputs and returns a state)
Example
b b
1) q1 c q2 a q4

c c
a b

c q3 a A DFA that can accept the strings which begin

with a or b, or begin with c and contain at most
b
one a.

45 A DFA accepting (a|b) * abb

Thank
Thank You
You ...
...

Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
48 pages
Lexical Error Handling Techniques
No ratings yet
Lexical Error Handling Techniques
56 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
55 pages
Chapter 2-Lexical Analysis
No ratings yet
Chapter 2-Lexical Analysis
52 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
56 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
59 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
58 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
67 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
59 pages
Chapter Two
No ratings yet
Chapter Two
13 pages
UNIT-1 Compiler Design
No ratings yet
UNIT-1 Compiler Design
38 pages
Lexical Analyzer Design and Functions
No ratings yet
Lexical Analyzer Design and Functions
15 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
23 pages
Lexical Analysis in Compilers Explained
No ratings yet
Lexical Analysis in Compilers Explained
12 pages
Lec 7
No ratings yet
Lec 7
14 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
10 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
22 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
14 pages
Lexical Analyzer in Compiler Design
No ratings yet
Lexical Analyzer in Compiler Design
16 pages
Language Processors and Lexical Analysis
No ratings yet
Language Processors and Lexical Analysis
85 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
67 pages
Lexical Analyzer and Tokenization Overview
No ratings yet
Lexical Analyzer and Tokenization Overview
16 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
107 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
10 pages
Lexical Analyzer Output Explained
No ratings yet
Lexical Analyzer Output Explained
16 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
13 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
16 pages
Role of Lexical Analyzer in Compilers
No ratings yet
Role of Lexical Analyzer in Compilers
38 pages
Compiler Design: Lexical Analysis Overview
No ratings yet
Compiler Design: Lexical Analysis Overview
51 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
52 pages
Understanding Lexical Analysis Concepts
No ratings yet
Understanding Lexical Analysis Concepts
6 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
16 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
37 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
52 pages
Compiler Construction: Lexical Analysis Basics
No ratings yet
Compiler Construction: Lexical Analysis Basics
33 pages
RMM TOCC Unit 1 Part 2 Lect02 Lexical Analysis
No ratings yet
RMM TOCC Unit 1 Part 2 Lect02 Lexical Analysis
68 pages
Lexical Analyzer Overview and Functions
No ratings yet
Lexical Analyzer Overview and Functions
56 pages
Compiler Lexical Analyzer
No ratings yet
Compiler Lexical Analyzer
16 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
44 pages
Unit 2
No ratings yet
Unit 2
20 pages
Language Processing and Lexical Analysis
No ratings yet
Language Processing and Lexical Analysis
60 pages
Overview of Lexical Analysis in Compilers
No ratings yet
Overview of Lexical Analysis in Compilers
114 pages
CSC 409 Part 2 New
No ratings yet
CSC 409 Part 2 New
8 pages
Lexical Analyzer Design and Tools
No ratings yet
Lexical Analyzer Design and Tools
20 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
14 pages
Understanding Lexical Analysis Basics
No ratings yet
Understanding Lexical Analysis Basics
14 pages
Lexical Analyzer in Compiler Design
No ratings yet
Lexical Analyzer in Compiler Design
26 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
10 pages
Compiler Design: Lexical Analysis & Phases
No ratings yet
Compiler Design: Lexical Analysis & Phases
51 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
22 pages
CD Chapter 1
No ratings yet
CD Chapter 1
28 pages
Regular Expressions and Definitions Explained
No ratings yet
Regular Expressions and Definitions Explained
61 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
14 pages
DFA in Lexical Analysis of Compilers
No ratings yet
DFA in Lexical Analysis of Compilers
24 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
17 pages
CD Unit-1 Notes
No ratings yet
CD Unit-1 Notes
12 pages
Past Simple vs. Past Progressive Lesson
0% (1)
Past Simple vs. Past Progressive Lesson
2 pages
Linguistic Concepts and Morphology Overview
No ratings yet
Linguistic Concepts and Morphology Overview
16 pages
PSC Exam Question Paper 2024
No ratings yet
PSC Exam Question Paper 2024
459 pages
Understanding Nouns and Their Types
No ratings yet
Understanding Nouns and Their Types
43 pages
Present Simple Worksheet for Kids
100% (1)
Present Simple Worksheet for Kids
3 pages
Christmas Holiday English Worksheet
No ratings yet
Christmas Holiday English Worksheet
6 pages
Deixis in SpongeBob Movie Analysis
No ratings yet
Deixis in SpongeBob Movie Analysis
15 pages
Sentence Improvement Exercises
No ratings yet
Sentence Improvement Exercises
15 pages
Introduction to English Linguistics Course
No ratings yet
Introduction to English Linguistics Course
60 pages
Zulu Language Grammar Guide
No ratings yet
Zulu Language Grammar Guide
7 pages
Raksit Nanthatanti's Academic Records
No ratings yet
Raksit Nanthatanti's Academic Records
5 pages
Riyad Sir's English Exam Guide
No ratings yet
Riyad Sir's English Exam Guide
8 pages
Exclamatory & Imperative Speech Change
50% (4)
Exclamatory & Imperative Speech Change
8 pages
Understanding Sentence Structures
No ratings yet
Understanding Sentence Structures
24 pages
English Language Scheme of Work 2020
No ratings yet
English Language Scheme of Work 2020
10 pages
Comprehensive English Grammar Guide
No ratings yet
Comprehensive English Grammar Guide
65 pages
NLTK and spaCy: NLP Tools Overview
No ratings yet
NLTK and spaCy: NLP Tools Overview
24 pages
Setswana-English Dictionary Overview
No ratings yet
Setswana-English Dictionary Overview
678 pages
Noun and Its Kinds
No ratings yet
Noun and Its Kinds
3 pages
Grade 7 English Summative Test Guide
No ratings yet
Grade 7 English Summative Test Guide
3 pages
Understanding the Accusative Case in German
No ratings yet
Understanding the Accusative Case in German
3 pages
Cause and Effect
No ratings yet
Cause and Effect
8 pages
UK English Spellchecker in MS Word
No ratings yet
UK English Spellchecker in MS Word
2 pages
Chat GPT 2
No ratings yet
Chat GPT 2
35 pages
Bilal Bentahar Course Schedule 2018-2019
No ratings yet
Bilal Bentahar Course Schedule 2018-2019
40 pages
Past Participles: Exercises & Answers
No ratings yet
Past Participles: Exercises & Answers
6 pages
Functions of Nouns Explained
No ratings yet
Functions of Nouns Explained
92 pages
Analyzing Diction and Sentence Structure
No ratings yet
Analyzing Diction and Sentence Structure
3 pages
Grade 7 Digital Life Lesson Plan
No ratings yet
Grade 7 Digital Life Lesson Plan
2 pages
IFP English Course: Simple Present & Adverbs
No ratings yet
IFP English Course: Simple Present & Adverbs
37 pages