Lexical Analysis in Compiler Design
Lexical Analysis in Compiler Design
(SENG471)
Chapter Two
Lexical Analysis
1
Objective
At the end of this session students will be able to:
Understand the basic roles of lexical analyzer (LA): Lexical Analysis Versus Parsing,
Tokens , Patterns, and Lexemes, Attributes for Tokens and Lexical Error.
Understand the specification of Tokens: Strings and Languages, Operations on
2
Introduction
The lexical analysis phase of compilation breaks the text file (program) into smaller
program and groups the characters into meaningful sequences called lexemes.
For each lexeme, the lexical analyzer produces as output a token of the form {token-
Where,
token- name is an abstract symbol that is used during syntax analysis , and
3
attribute-value points to an entry in the symbol table for this token.
Example.
For example, suppose a source program contains the assignment statement
1. position is a lexeme that would be mapped into a token {id, 1}, where id is an
abstract symbol standing for identifier and 1 points to the symbol table entry for
position.
The symbol-table entry for an identifier holds information about the identifier,
2. The assignment symbol = is a lexeme that is mapped into the token {=}. Since this
notational convenience we have chosen to use the lexeme itself as the name of the
4
abstract symbol.
Contd.
5. rate is a lexeme that is mapped into the token { id, 3 }, where 3 points to the
5
Contd. Parser uses tokens
produced by the LA to
create a tree-like
intermediate representation
that depicts the
grammatical structure of
Ittheuses
token
thestream.
syntax tree and
the information in the
symbol table to check the
source program for
semantic consistency with
the language definition.
An important part of
semantic analysis is type
In the process of translating a checking, where the
source program into target code, a compiler checks that each
compiler may construct one or operator has matching
more intermediate representations, operands
which are easy to produce and
easy to translate into the target The machine-independent code-
machine optimization phase attempts to
improve the intermediate code so
that better target code will result.
Usually better means faster, but
other objectives may be desired,
6
such as shorter code, or target
code that consumes less power.
The Role of Lexical Analyzer
Symbol
Table
It is common for the lexical analyzer to interact with the symbol table.
In some cases, information regarding the kind of identifier may be read from
the symbol table by the lexical analyzer to assist it in determining the proper
7
token it must pass to the parser.
Other Tasks of an LA
Since LA is the part of the compiler that reads the source text, it may perform
specialized techniques that serve only the lexical task, not the job of parsing.
In addition, specialized buffering techniques for reading input characters can
speed up the compiler significantly.
3. Compiler portability is enhanced:- Input-device-specific peculiarities can be
9
restricted to the lexical analyzer.
Tokens, Patterns, and Lexemes
A lexeme is a sequence of characters in the source program that matches the pattern for a
token.
Example: -5, i, sum, 100, for ; int 10 + -
Token:- is a pair consisting of a token name
One andfor
token aneach
optional attribute
keyword. value. for a keyword is the
The pattern
same as the keyword itself.
Example:
Tokens for the operators , either individually or in classes
Identifiers = { i, sum } such as the token in comparison operators
One token representing all identifiers.
int_Constatnt = { 10, 100, -5 } One or more tokens representing constants, such as numbers
Oppr = { +, - } and literal strings .
Tokens for each punctuation symbol, such as left and right
rev_Words = { for, int } parentheses, comma, and semicolon
Separators = { ; , , }
Pattern is a rule describing the set of lexemes that can represent a particular token in source
program
It is a description of the form that the lexemes of a token may take.
10 In the case of a keyword as a token, the pattern is just the sequence of characters
Contd.
One efficient but complex brute-force approach is to read character by character
What we’d like is a set of tools that will allow us to easily create and modify a
lexical analyzer that has the same run-time efficiency as the brute-force
method.
if (cThe first tool that we use to attack this problem is Deterministic Finite Automata
= nextchar() == ‘c’) {
if (c = nextchar() == ‘l’) {
(DFA).
// code to handle the rest of either “class” or any identifier that starts with
“cl”
} else if (c == ‘a’) {
// code to handle the rest of either “case” or any identifier that starts with
“ca”
} else {
// code to handle any identifier that starts with c
11 }
} else if ( c = …..) {
Contd.
“Literal String”
12
Consideration for a simple design of Lexical Analyzer
Lexical Analyzer can allow source program to be
1. Free-Format Input:- the alignment of lexeme should not be necessary in determining the
correctness of the source program such restriction put extra load on Lexical Analyzer
2. Blanks Significance:- Simplify the task of identifying tokens
E.g. int a indicates <int is keyword> <a is identifier>
inta indicates <inta is Identifier>
3. Keyword must be reserved:- Keywords should be reserved otherwise LA will have to
predict whether the given lexeme should be treated as a keyword or as identifier
E.g. if then then then =else;
else else = then;
The above statements are misleading as then and else keywords are not reserved.
Approaches to implementation
Use assembly language- Most efficient but most difficult to implement
14
Handling Lexical Errors
It is hard for a LA to tell , without the aid of other components, that there is a
source-code error.
For instance, if the string fi is encountered for the first time in a java program
proceed because none of the patterns for tokens matches any prefix of the
remaining input.
The simplest recovery strategy is "panic mode" recovery.
We delete successive characters from the remaining input , until the lexical
analyzer can find a well-formed token at the beginning of what input is left.
15
This recovery technique may confuse the parser, but in an interactive computing
Contd.
5. Pre-scanning
16
Input Buffering
There are some ways that the task of reading the source program can be
speeded
This task is made difficult by the fact that we often have to look one or more
characters beyond the next lexeme before we can be sure we have the right
lexeme
In C language: we need to look after -, = or < to decide what token to return
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes
Using one system read command, we can read N characters into a buffer,
represented by eof, marks the end of the source file and is different from any
retracting)
Then, after the lexeme is recorded as an attribute value of the token returned to the
parser; lexemeBegin is set to the character immediately after the lexeme just
found
Advancing forward requires that we first test whether we have reached the end of one
of the buffers, and if so, we must reload the other buffer from the input, and move
forward to the beginning of the newly loaded buffer
Sentinels
If we use the previous scheme, we must check each time we advance forward, that
we have not moved off one of the buffers; if we do, then we must also reload the
other buffer
Thus, for each character read, we must make two tests: one for the end of the
19
buffer, and one to determine which character is read
Contd.
While they cannot express all possible patterns, they are very effective in specifying
specification of tokens
An alphabet is any finite set of symbols. Typical examples of symbols are letters ,
21
systems.
Contd.
A string over an alphabet is a finite sequence of symbols drawn from that
alphabet .
In language theory, the terms "sentence" and "word" are often used as
synonyms for "string."
The length of a string s, usually written |s|, is the number of occurrences of
symbols in s.
For example, banana is a string of length six.
Abstract languages like , the empty set, or {Ɛ} , the set containing only the
of all grammatically correct English sentences, although the latter two languages
22
are difficult to specify exactly.
Terms for Parts of Strings
The following string-related terms are commonly used:
1. A prefix of string S is any string obtained by removing zero or more symbols from the
end of s.
For example, ban, banana, and Ɛ are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from
the beginning of s.
For example, nana, banana, and Ɛ are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s.
For instance, banana, nan, and Ɛ are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s that are not Ɛ or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily
23 consecutive positions of s.
Operations on Languages
The following string-related terms are commonly used:
length one.
Here are some other languages that can be constructed from languages L and D, using
tokens
Regular expressions are means for specifying regular languages
Example: ID letter_(letter|digit)*
letter A|B|…|Z|a|b|…|z|_
digit 0|1|2|…|9
Each regular expression is a pattern specifying the form of strings
The regular expressions are built recursively out of smaller regular expressions,
28
Table: The algebraic laws that hold for arbitrary regular expressions r, s,
Regular Definitions
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of
1. One or more instances(+):- is a unary postfix operator that represents the positive
closure of a regular expression and its language. That is , if r is a regular
expression, then (r)+ denotes the language (L( r) ) + .
1. The operator + has the same precedence and associativity as the operator *.
2. Two useful algebraic laws, r*=r+|Ɛ and r+=rr*=r*r.
2. Zero or one instance(?):- r? is equivalent to r|Ɛ , or put another way, L(r?) = L(r) U
{Ɛ}.
The ? operator has the same precedence and associativity as * and +.
3. Character classes:- A regular expression a1|a2|···| an , where the ai's are each
symbols of the alphabet , can be replaced by the shorthand [al a2 ··· an].
Example:
30
[abc] is shorthand for a| b | c, and
Example
Using the above short hands we can rewrite the regular expression for examples a and c
A. All strings of lowercase letters that contain the five vowels in order.
lexicographic order.
32
Recognition of Regular Expressions
1. Starting point is the language grammar to understand the tokens:
stmt if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr term relop term
| term
term id
| number
2. The next step is to formalize the patterns: 3. We also need to handle whitespaces:
digit [0-9]
ws (blank | tab | newline)+
Digits digit+
number digit(.digits)? (E[+-]? Digit)?
letter [A-Za-z_]
id letter (letter|digit)*
If if
Then then
Else else
33
Relop < | > | <= | >= | = | <>
Transition Diagram
Lexemes Token Names Attribute Value
Any ws - -
if if -
then then -
else else -
Any id id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
34
Fig. Transition diagram for reserved words and identifiers
Contd.
35
Finite Automata
The lexical analyzer tools use finite automata, at the heart of the transition, to convert the
1. Finite automata are recognizers ; they simply say "yes" or "no" about each
possible input string.
2. Finite automata come in two flavors :
A. Nondeterministic finite automata (NFA) have no restrictions on the labels of
their edges . A symbol can label several edges out of the same state, and
Ɛ, the empty string, is a possible label.
B. Deterministic finite automata (DFA) have, for each state, and for each
symbol of its input alphabet exactly one edge with that symbol leaving
that state.
Both deterministic and nondeterministic finite automata are capable of recognizing
0
a
A Transition
0 0
1
1
38
Nondeterministic Finite Automata (NFA)
A nondeterministic finite automaton (NFA) consists of:
are states and the labeled edges represent the transition function.
There is an edge labeled a from state s to state t if and only if t is one of
1. The same symbol can label edges from one state to several different states,
39
and
Contd.
An NFA can get into multiple states
b a
Q1 Q2 Q3
b
Input: a b a
40
Transition Tables
We can also represent an NFA by a transition table, whose rows correspond to states,
to those arguments.
If the transition function has no information about that state-input pair, we put in
from the start state to one of the accepting(Final) states, such that the symbols along
Example:- Strings aaba and bbbba are accepted by the NFA in slide 40.
The language defined (or accepted) by an NFA is the set of strings labeling some path
As was mentioned above, the NFA of slide # 40 defines the same language as does
the regular expression (a|b)* ba, that is, all strings from the alphabet {a, b} that
42 end in ba.
Deterministic Finite Automata (DFA)
A deterministic finite automaton (DFA) is a special case of an NFA where:
2. For each state S and input symbol a, there is exactly one edge out of s labeled a.
If we are using a transition table to represent a DFA, then each entry is a single state.
we may therefore represent this state without the curly braces that we use to form
sets.
certain language, the DFA is a simple, concrete algorithm for recognizing strings.
It is fortunate indeed that every regular expression and every NFA can be converted to a
DFA accepting the same language, because it is the DFA that we really implement or
states.
We start from the initial state, move from state to state via the transitions, and
check to see if we are in the final state when we have checked each character in the
string.
If we are, then the string is accepted, otherwise, the string is rejected
A DFA is a quintuple, a machine with five parameters, M = (Q, ∑, δ, q0, F), where
c c
a b
2)