0% found this document useful (0 votes)
39 views46 pages

Lexical Analysis in Compiler Design

The document discusses lexical analysis, which breaks source code into tokens. It defines tokens, lexemes, and patterns. The lexical analyzer identifies lexemes in the source code and maps them to tokens consisting of a name and attribute value. These tokens are then passed to the parser for syntax analysis. The document contrasts lexical analysis with parsing and outlines other tasks of a lexical analyzer like removing comments and tracking line numbers.

Uploaded by

Daniel Bido Rasa
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views46 pages

Lexical Analysis in Compiler Design

The document discusses lexical analysis, which breaks source code into tokens. It defines tokens, lexemes, and patterns. The lexical analyzer identifies lexemes in the source code and maps them to tokens consisting of a name and attribute value. These tokens are then passed to the parser for syntax analysis. The document contrasts lexical analysis with parsing and outlines other tasks of a lexical analyzer like removing comments and tracking line numbers.

Uploaded by

Daniel Bido Rasa
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Principles of Compiler Design

(SENG471)

Chapter Two
Lexical Analysis

1
Objective
At the end of this session students will be able to:
 Understand the basic roles of lexical analyzer (LA): Lexical Analysis Versus Parsing,

Tokens , Patterns, and Lexemes, Attributes for Tokens and Lexical Error.
 Understand the specification of Tokens: Strings and Languages, Operations on

Languages, Regular Expressions, Regular Definitions and Extensions of Regular


Expressions

 Understand the generation of Tokens: Transition Diagrams, Recognition of Reserved

Words and Identifiers, Completion of the Running Example, Architecture of a

Transition-Diagram-Based Lexical Analysis.

 Understand the basics of Automata: Nondeterministic Finite Automata (NFA) Versus

Deterministic Finite Automata(DFA), and conversion of an NFA to a DFA.

2
Introduction
 The lexical analysis phase of compilation breaks the text file (program) into smaller

chunks called tokens.

 A token describes a pattern of characters having same meaning in the source

program. (such as identifiers, operators, keywords, numbers, delimiters and so on)

The lexical analysis phase of the compiler is often called tokenization


 The lexical analyzer (LA) reads the stream of characters making up the source

program and groups the characters into meaningful sequences called lexemes.

 For each lexeme, the lexical analyzer produces as output a token of the form {token-

name, attribute-value} that it passes on to the subsequent phase, syntax analysis.

Where,
 token- name is an abstract symbol that is used during syntax analysis , and
3
 attribute-value points to an entry in the symbol table for this token.
Example.
 For example, suppose a source program contains the assignment statement

position = initial + rate * 60


 The characters in this assignment could be grouped into the following lexemes and

mapped into the following tokens passed on to the syntax analyzer:

1. position is a lexeme that would be mapped into a token {id, 1}, where id is an
abstract symbol standing for identifier and 1 points to the symbol table entry for
position.
 The symbol-table entry for an identifier holds information about the identifier,

such as its name and type.

2. The assignment symbol = is a lexeme that is mapped into the token {=}. Since this

token needs no attribute-value, the second component is omitted.


 We could have used any abstract symbol such as assign for the token-name , but for

notational convenience we have chosen to use the lexeme itself as the name of the
4
abstract symbol.
Contd.

4. + is a lexeme that is mapped into the token { + }.

5. rate is a lexeme that is mapped into the token { id, 3 }, where 3 points to the

symbol-table entry for rate.

6. * is a lexeme that is mapped into the token { * }.

7. 60 is a lexeme that is mapped into the token { 60 }.

Blanks separating the lexemes would be discarded by the lexical analyzer.

5
Contd. Parser uses tokens
produced by the LA to
create a tree-like
intermediate representation
that depicts the
grammatical structure of
 Ittheuses
token
thestream.
syntax tree and
the information in the
symbol table to check the
source program for
semantic consistency with
the language definition.
 An important part of
semantic analysis is type
In the process of translating a checking, where the
source program into target code, a compiler checks that each
compiler may construct one or operator has matching
more intermediate representations, operands
which are easy to produce and
easy to translate into the target The machine-independent code-
machine optimization phase attempts to
improve the intermediate code so
that better target code will result.
Usually better means faster, but
other objectives may be desired,
6
such as shorter code, or target
code that consumes less power.
The Role of Lexical Analyzer

Source token To semantic


Lexical
program Parser analysis
Analyzer
getNextToken

Symbol
Table

 It is common for the lexical analyzer to interact with the symbol table.

When the lexical analyzer discovers a lexeme constituting an identifier, it

needs to enter that lexeme into the symbol table.

In some cases, information regarding the kind of identifier may be read from

the symbol table by the lexical analyzer to assist it in determining the proper
7
token it must pass to the parser.
Other Tasks of an LA
 Since LA is the part of the compiler that reads the source text, it may perform

certain other tasks besides identification of lexemes such as:


1. Stripping out comments and whitespace (blank, newline, tab, and perhaps
other characters that are used to separate tokens in the input).
2. Correlating error messages generated by the compiler with the source program.
 For instance, the LA may keep track of the number of newline characters

seen, so it can associate a line number with each error message.


3. If the source program uses a macro-preprocessor, the expansion of macros may
also be performed by the lexical analyzer.
 Sometimes, lexical analyzers are divided into a cascade of two processes:

A. Scanning consists of the simple processes that do not require tokenization of


the input, such as deletion of comments and compaction of consecutive
whitespace characters into one.
B. Lexical analysis proper is the more complex portion, where the scanner
8
produces the sequence of tokens as output (tokenization)
Lexical Analysis Versus Parsing
 There are a number of reasons why the analysis portion of a compiler is normally

separated into lexical analysis and parsing (syntax analysis) phases:


1. Simplicity of Design:- is the most important consideration that often allows us to

simplify compilation tasks.


For example, a parser that had to deal with comments and whitespace as syntactic
units would be considerably more complex than one that can assume comments
and whitespace have already been removed by the lexical analyzer.
If we are designing a new language, separating lexical and syntactic concerns can
lead to a cleaner overall language design.
2. Compiler efficiency is improved:- A separate lexical analyzer allows us to apply

specialized techniques that serve only the lexical task, not the job of parsing.
In addition, specialized buffering techniques for reading input characters can
speed up the compiler significantly.
3. Compiler portability is enhanced:- Input-device-specific peculiarities can be
9
restricted to the lexical analyzer.
Tokens, Patterns, and Lexemes
A lexeme is a sequence of characters in the source program that matches the pattern for a
token.
Example: -5, i, sum, 100, for ; int 10 + -
Token:- is a pair consisting of a token name
 One andfor
token aneach
optional attribute
keyword. value. for a keyword is the
The pattern
same as the keyword itself.
Example:
 Tokens for the operators , either individually or in classes
Identifiers = { i, sum } such as the token in comparison operators
 One token representing all identifiers.
int_Constatnt = { 10, 100, -5 }  One or more tokens representing constants, such as numbers
Oppr = { +, - } and literal strings .
 Tokens for each punctuation symbol, such as left and right
rev_Words = { for, int } parentheses, comma, and semicolon
Separators = { ; , , }

Pattern is a rule describing the set of lexemes that can represent a particular token in source

program
It is a description of the form that the lexemes of a token may take.
10 In the case of a keyword as a token, the pattern is just the sequence of characters
Contd.
 One efficient but complex brute-force approach is to read character by character

to check if each one follows the right sequence of a token.


 The below example shows a partial code for this brute-force lexical analyzer.

 What we’d like is a set of tools that will allow us to easily create and modify a

lexical analyzer that has the same run-time efficiency as the brute-force
method.

if (cThe first tool that we use to attack this problem is Deterministic Finite Automata

= nextchar() == ‘c’) {
if (c = nextchar() == ‘l’) {
(DFA).
// code to handle the rest of either “class” or any identifier that starts with
“cl”
} else if (c == ‘a’) {
// code to handle the rest of either “case” or any identifier that starts with
“ca”
} else {
// code to handle any identifier that starts with c
11 }
} else if ( c = …..) {
Contd.

“Literal String”

12
Consideration for a simple design of Lexical Analyzer
Lexical Analyzer can allow source program to be
1. Free-Format Input:- the alignment of lexeme should not be necessary in determining the
correctness of the source program such restriction put extra load on Lexical Analyzer
2. Blanks Significance:- Simplify the task of identifying tokens
E.g. int a indicates <int is keyword> <a is identifier>
inta indicates <inta is Identifier>
3. Keyword must be reserved:- Keywords should be reserved otherwise LA will have to
predict whether the given lexeme should be treated as a keyword or as identifier
E.g. if then then then =else;
else else = then;
The above statements are misleading as then and else keywords are not reserved.
Approaches to implementation
 Use assembly language- Most efficient but most difficult to implement

 Use high level languages like C- Efficient but difficult to implement


13  Use tools like Lex, flex, javacc… Easy to implement but not as efficient as the first
Lexical Errors

 Are primarily of two kinds.

1. Lexemes whose length exceed the bound specified by the language.

 Most languages have bound on the precision of numeric constants.

 A constant whose bound exceeds this bound is a lexical error.

2. Illegal character in the program

 Characters like ~, , ,  occurring in a given programming

language (but not within a string or comment) are lexical errors.

3. Un terminated strings or comments.

14
Handling Lexical Errors
 It is hard for a LA to tell , without the aid of other components, that there is a

source-code error.
 For instance, if the string fi is encountered for the first time in a java program

in the context : fi ( a == 2*4+3 ) a lexical analyzer cannot tell whether fi is a

misspelling of the keyword if or an undeclared function identifier.


 However, suppose a situation arises in which the lexical analyzer is unable to

proceed because none of the patterns for tokens matches any prefix of the

remaining input.
 The simplest recovery strategy is "panic mode" recovery.

 We delete successive characters from the remaining input , until the lexical

analyzer can find a well-formed token at the beginning of what input is left.

15
 This recovery technique may confuse the parser, but in an interactive computing
Contd.

 Other possible error-recovery actions are:

1. Insert a missing character into the remaining input.

2. Replacing an incorrect character by a correct character

3. Transpose two adjacent characters(such as , fi=>if).

4. Deleting an extraneous character

5. Pre-scanning

16
Input Buffering
 There are some ways that the task of reading the source program can be

speeded
 This task is made difficult by the fact that we often have to look one or more

characters beyond the next lexeme before we can be sure we have the right
lexeme
 In C language: we need to look after -, = or < to decide what token to return

 We shall introduce a two-buffer scheme that handles large look-aheads safely

 We then consider an improvement involving sentinels that saves time checking

for the ends of buffers


 Because of the amount of time taken to process characters and the large

number of characters that must be processed during compilation of a large


source program, specialized buffering techniques have been developed to
17
reduce the amount of overhead to process a single input character
Contd.

 Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096

bytes
 Using one system read command, we can read N characters into a buffer,

rather than using one system call per character


 If fewer than N characters remain in the input file, then a special character,

represented by eof, marks the end of the source file and is different from any

possible character of the source program


 Two pointers to the input are maintained:

1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose


extent we are attempting to determine
18
2. Pointer forward scans ahead until a pattern match is found
Contd.
 Once the lexeme is determined, forward is set to the character at its right end (involves

retracting)
 Then, after the lexeme is recorded as an attribute value of the token returned to the

parser; lexemeBegin is set to the character immediately after the lexeme just
found
 Advancing forward requires that we first test whether we have reached the end of one

of the buffers, and if so, we must reload the other buffer from the input, and move
forward to the beginning of the newly loaded buffer

Sentinels
 If we use the previous scheme, we must check each time we advance forward, that

we have not moved off one of the buffers; if we do, then we must also reload the
other buffer
 Thus, for each character read, we must make two tests: one for the end of the
19
buffer, and one to determine which character is read
Contd.

Fig. Sentinels at the end of each buffer

 The sentinel is a special character that Switch (*forward++) {


case eof:
cannot be part of the source program, if (forward is at end of first buffer) {
reload second buffer;
and a natural choice is the character
forward = beginning of second buffer;
eof }
else if {forward is at end of second buffer) {
 Note that eof retains its use as a reload first buffer;\
forward = beginning of first buffer;
marker for the end of the entire
}
input else /* eof within a buffer marks the end of
input */
 Any eof that appears other than at the terminate lexical analysis;
break;
end of buffer means the input is at an cases for the other characters;
20 end }
Specification of Tokens
 Regular expressions are an important notation for specifying lexeme patterns.

 While they cannot express all possible patterns, they are very effective in specifying

those types of patterns that we actually need for tokens.

 In theory of compilation regular expressions are used to formalize the

specification of tokens

 Regular expressions are means for specifying regular languages

Strings and Languages

 An alphabet is any finite set of symbols. Typical examples of symbols are let­ters ,

digits, and punctuation.

 The set { 0, 1 } is the binary alphabet.

 ASCII is an important example of an alphabet ; it is used in many software

21
systems.
Contd.
 A string over an alphabet is a finite sequence of symbols drawn from that

alphabet .
 In language theory, the terms "sentence" and "word" are often used as
synonyms for "string."
 The length of a string s, usually written |s|, is the number of occurrences of

symbols in s.
 For example, banana is a string of length six.

 The empty string, denoted Ɛ, is the string of length zero.

 A language is any countable set of strings over some fixed alphabet.

 Abstract languages like  , the empty set, or {Ɛ} , the set containing only the

empty string, are languages under this definition.


 So too are the set of all syntactically well-formed java programs and the set

of all grammatically correct English sentences, although the latter two languages
22
are difficult to specify exactly.
Terms for Parts of Strings
 The following string-related terms are commonly used:

1. A prefix of string S is any string obtained by removing zero or more symbols from the
end of s.
 For example, ban, banana, and Ɛ are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from
the beginning of s.
 For example, nana, banana, and Ɛ are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s.
 For instance, banana, nan, and Ɛ are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s that are not Ɛ or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily

23 consecutive positions of s.

Operations on Languages
 The following string-related terms are commonly used:

 In lexical analysis, the most important operations on languages are union,

concatenation, and closure, which are defined formally below:


1. Union (U):- is the familiar operation on sets.
Example Union of L and M, L U M = {s|s is in L or s is in M}
2. Concatenation :- is all strings formed by taking a string from the first language and a
string from the second language, in all possible ways , and concatenating them.
Example Concatenation of L and M, LM = {st|s is in L and t is in M}
3. Kleene closure:- the kleene closure of a language L, denoted by L*, is the set of strings
you get by concatenating L zero, or more times.
Example Kleene closure of L,
Note that:- L0 , the "concatenation of L zero times," is defined to be {Ɛ} , and inductively, Li is
Li-1 L.
4. Positive closure:- is the positive closure of a language L, denoted by L+, is the same as
24
the Kleene closure, but without the term L0 . That is, Ɛ will not be in L+ unless it is L in
Contd.
Let L be the set of letters {A, B , . . . , Z , a, b, . . . , z} and let D be the set of digits {0, 1, . . . 9}.
We may think of L and D in two, essentially equivalent, ways.
 One way is that L and D are, respectively, the alphabets of uppercase and lowercase

letters and of digits.


 The second way is that L and D are languages, all of whose strings happen to be of

length one.
 Here are some other languages that can be constructed from languages L and D, using

the above operators:


1. L U D is the set of letters and digits - strictly speaking the language with 62
strings of length one, each of which strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one letter followed by one
digit.

3. L4 is the set of all 4-letter strings.


25
4. L* is the set of all strings of letters, including Ɛ, the empty string.
Regular Expressions
 In theory of compilation regular expressions are used to formalize the specification of

tokens
 Regular expressions are means for specifying regular languages
Example: ID  letter_(letter|digit)*
letter A|B|…|Z|a|b|…|z|_
digit 0|1|2|…|9
 Each regular expression is a pattern specifying the form of strings
 The regular expressions are built recursively out of smaller regular expres­sions,

using the following two rules:


R1:- Ɛ is a regular expression, and L(Ɛ) = {Ɛ}, that is , the language whose sole member
is the empty string.
R2:- If a is a symbol in ∑ then a is a regular expression, L(a) = {a}, that is , the language
with one string, of length one , with a in its one position.
Note:- By convention, we use italics for symbols, and boldface for their corresponding
regular expression.
26
 Each regular expression r denotes a language L(r), which is also defined
Contd.
 There are four parts to the induction whereby larger regular expressions are

built from smaller ones.


 Suppose r and s are regular expressions denoting languages L(r) and L(s) ,
respectively.
1. (r) |( s) is a regular expression denoting the language L(r) U L(s) .
2. (r) (s) is a regular expression denoting the language L(r)L (s) .
3. (r )* is a regular expression denoting (L (r) )*.
4. (r) is a regular expression denoting L(r) .
 This last rule says that we can add additional pairs of parentheses around
expressions without changing the language they denote.
 Regular expressions often contain unnecessary pairs of parentheses .

 We may drop certain pairs of parentheses if we adopt the conventions that :

a. The unary operator * has highest precedence and is left associative.


b. Concatenation has second highest precedence and is left associative.
c. | has lowest precedence and is left associative.
27
Example:- (a) |( (b) *(c)) can be represented by a|b*c. Both expressions denote the
Contd.
 A language that can be defined by a regular expression is called a regular set.
 If two regular expressions r and s denote the same regular set , we say they are

equivalent and write r = s.


 For instance, (a|b) = (b|a).
 There are a number of algebraic laws for regular expressions; each law asserts that

expressions of two different forms are equivalent.


LAW DESCRIPTION
r|s = s|r | is commutative
r|(s|t) = (r|s)|t |is associative
r(st) = (rs)t Concatenation is associative
r(s|t) = rs|rt; Concatenation distributes over |
(s|t)r= sr|tr
Ɛr=rƐ=r Ɛ is the identity for concatenation
r* = (r|Ɛ)* Ɛ is guaranteed in a closure
r** = r* * is idempotent

28
Table: The algebraic laws that hold for arbitrary regular expressions r, s,
Regular Definitions
 If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of

definitions of the form : Where


d1  r1 1. Each di is a new symbol, not in ∑ and not
d2  r2 the same as any other of the d's, and
… 2. Each ri is a regular expression over the
dn  rn alphabet ∑ U { d1 ,d2 , ... ,di-1}.
Examples
(a) Regular Definition for Java Identifiers (b) Regular Definition for Java statements
stmt  if expr then stmt
ID  letter(letter|digit)*
| if expr then stmt else stmt
letter A|B|…|Z|a|b|…|z|_

digit 0|1|2|…|9 expr  term relop term
(c) Regular Definition for unsigned numbers
| term
such as 5280 , 0.0 1 234, 6. 336E4, or 1.
89E-4 term  id
digit 0|1|2|…|9 | number
digits digit digit*
optFrac . digits|Ɛ
29
optExp (E(+|-|Ɛ) digits|Ɛ
Extensions of Regular Expressions
 Many extensions have been added to regular expressions to enhance their ability to

specify string patterns.


 Here are few notational extensions:

1. One or more instances(+):- is a unary postfix operator that represents the positive
closure of a regular expression and its language. That is , if r is a regular
expression, then (r)+ denotes the language (L( r) ) + .
1. The operator + has the same precedence and associativity as the operator *.
2. Two useful algebraic laws, r*=r+|Ɛ and r+=rr*=r*r.
2. Zero or one instance(?):- r? is equivalent to r|Ɛ , or put another way, L(r?) = L(r) U
{Ɛ}.
 The ? operator has the same precedence and associativity as * and +.

3. Character classes:- A regular expression a1|a2|···| an , where the ai's are each
symbols of the alphabet , can be replaced by the shorthand [al a2 ··· an].
Example:
30
 [abc] is shorthand for a| b | c, and
Example
 Using the above short hands we can rewrite the regular expression for examples a and c

in slide number 29 as follows :


Examples(a) Regular Definition for Java Identifiers (c) Regular Definition for unsigned numbers
ID  letter(letter|digit)* such as 5280 , 0.01234, 6.336E4, or 1.89E-
4
letter [A-Za-z_] digit [0-9]
digit [0-9] digits digit+
number  digits (. digits)? (E [+ -]? digits)?
Exercises
1. Consult the language reference manuals to determine
A. The sets of characters that form the input alphabet (excluding those that may
only appear in character strings or comments) ,
B. The lexical form of numerical constants, and
C. The lexical form of identifiers , for C++ and java programming languages.
2. Describe the languages denoted by the following regular ex­pressions:
A. a(a|b) *a
B. ((Ɛ|a)b * )*
C. (a|b) *a (a|b) (a|b)
31 D. a* ba*ba*ba*.
E. (aa|bb) * ((ab |ba) (aa| bb) * (ab|b a) (aa|bb )*)*
Contd.

3. Write regular definitions for the following languages:

A. All strings of lowercase letters that contain the five vowels in order.

B. All strings of lowercase letters in which the letters are in ascending

lexicographic order.

C. All strings of binary digits with no repeated digit s.

D. All strings of binary digits with at most one repeated digit.

E. All strings of a ' s and b's where every a is preceded by b.

F. All strings of a 's and b's that contain substring abab.

32
Recognition of Regular Expressions
1. Starting point is the language grammar to understand the tokens:
stmt  if expr then stmt
| if expr then stmt else stmt

expr  term relop term
| term
term  id
| number

2. The next step is to formalize the patterns: 3. We also need to handle whitespaces:
digit  [0-9]
ws  (blank | tab | newline)+
Digits  digit+
number  digit(.digits)? (E[+-]? Digit)?
letter  [A-Za-z_]
id  letter (letter|digit)*
If  if
Then  then
Else  else
33
Relop  < | > | <= | >= | = | <>
Transition Diagram
Lexemes Token Names Attribute Value
Any ws - -
if if -
then then -
else else -
Any id id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE

Fig. Transition diagram for relop


Table: Tokens, their patterns, and attribute values

34
Fig. Transition diagram for reserved words and identifiers
Contd.

Fig. Transition diagram for unsigned numbers

Fig. Transition diagram for white spaces where delim


represents one or more whitespace characters

35
Finite Automata
 The lexical analyzer tools use finite automata, at the heart of the transition, to convert the

input program into a lexical analyzer .


 These are essentially graphs, like transition diagrams, with a few differences:

1. Finite automata are recognizers ; they simply say "yes" or "no" about each
possible input string.
2. Finite automata come in two flavors :
A. Nondeterministic finite automata (NFA) have no restrictions on the labels of
their edges . A symbol can label several edges out of the same state, and
Ɛ, the empty string, is a possible label.
B. Deterministic finite automata (DFA) have, for each state, and for each
symbol of its input alphabet exactly one edge with that symbol leaving
that state.
 Both deterministic and nondeterministic finite automata are capable of rec­ognizing

the same languages .


36  In fact these languages are exactly the same languages , called the regular
Finite Automata State Graphs
A state Example 1: A finite automaton that accepts
only “1” 1

The start state


(Initial State)
Example 2: A finite automaton accepting any
number of 1’s followed by a single 0
Accepting state
(Final State) 1

0
a
A Transition

Q: Check that “1110” is accepted but “110…”


is not?
37
Contd.
Question: What language does this recognize?

0 0

1
1

38
Nondeterministic Finite Automata (NFA)
 A nondeterministic finite automaton (NFA) consists of:

1. A finite set of states S.


2. A set of input symbols ∑, the input alphabet. We assume that Ɛ, which stands
for the empty string, is never a member of ∑ .
3. A transition function that gives, for each state, and for each symbol in ∑ U {Ɛ
} a set of next states.
4. A state s0 from S that is distinguished as the start state (or initial state) .

5. A set of states F, a subset of S, that is distinguished as the accepting states


(or final states) .
 We can represent either an NFA or DFA by a transition graph, where the nodes

are states and the labeled edges represent the transition function.
 There is an edge labeled a from state s to state t if and only if t is one of

the next states for state s and input a.


 This graph is very much like a transition diagram, except:

1. The same symbol can label edges from one state to several different states,
39
and
Contd.
 An NFA can get into multiple states

b a
Q1 Q2 Q3

b
Input: a b a

Rule: NFA accepts if it can get in a final state

40
Transition Tables
 We can also represent an NFA by a transition table, whose rows correspond to states,

and whose columns correspond to the input symbols and Ɛ.


 The entry for a given state and input is the value of the transition function applied

to those arguments.
 If the transition function has no information about that state-input pair, we put  in

the table for the pair.


Example:- The transition table for the NFA on the pervious slide is represented as:

The transition table has the advantage that Input a b Ɛ


we can easily find the transitions on a given State
state and input. Q1 Q1 { Q1 , Q2 } 
Its disadvantage is that it takes a lot of
Q2 Q3  
space, when the input alphabet is large, yet
most states do not have any moves on most Q3   
of the input symbols.
41
Acceptance of Input Strings by Automata
 An NFA accepts input string x if and only if there is some path in the transition graph

from the start state to one of the accepting(Final) states, such that the symbols along

the path spell out x.


 Note that Ɛ labels along the path are effectively ignored, since the empty string

does not contribute to the string constructed along the path.

Example:- Strings aaba and bbbba are accepted by the NFA in slide 40.

 The language defined (or accepted) by an NFA is the set of strings labeling some path

from the start to an accepting state.

 As was mentioned above, the NFA of slide # 40 defines the same language as does

the regular expression (a|b)* ba, that is, all strings from the alphabet {a, b} that

42 end in ba.
Deterministic Finite Automata (DFA)
 A deterministic finite automaton (DFA) is a special case of an NFA where:

1. There are no moves on input Ɛ , and

2. For each state S and input symbol a, there is exactly one edge out of s labeled a.

 If we are using a transition table to represent a DFA, then each entry is a single state.

 we may therefore represent this state without the curly braces that we use to form

sets.

 While the NFA is an abstract representation of an algorithm to recognize the strings of a

certain language, the DFA is a simple, concrete algorithm for recognizing strings.

 It is fortunate indeed that every regular expression and every NFA can be converted to a

DFA accepting the same language, because it is the DFA that we really implement or

simulate when building lexical analyzers.


43
 DFA recognize strings:- When a string is fed into a DFA, if the DFA recognizes the
Contd.
 A DFA is a collection of states and transitions:- Given the input string, the transitions

tell us how to move among the states.


 One of the states is denoted as the initial state, and a subset of the states are final

states.
 We start from the initial state, move from state to state via the transitions, and

check to see if we are in the final state when we have checked each character in the
string.
 If we are, then the string is accepted, otherwise, the string is rejected

 A DFA is a quintuple, a machine with five parameters, M = (Q, ∑, δ, q0, F), where

 Q is a finite set of states

 ∑ is a finite set called the alphabet

 δ is a total function from (Q x ∑) to Q known as transition function (a function


44
that takes a state and a symbol as inputs and returns a state)
Example
b b
1) q1 c q2 a q4

c c
a b

c q3 a A DFA that can accept the strings which begin


with a or b, or begin with c and contain at most
b
one a.

2)

45 A DFA accepting (a|b) * abb


Thank
Thank You
You ...
...

You might also like