0% found this document useful (0 votes)

5 views20 pages

Unit 2

The document discusses lexical analysis, the first step in compiler design, which involves transforming source code into tokens while removing whitespace and comments. It covers the role, advantages, and disadvantages of lexical analyzers, as well as input buffering techniques, token specification, and finite automata implications. Additionally, it introduces the Lex tool for generating lexical analyzers and outlines the structure of a Lex program.

Uploaded by

k4yb9vh0mk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views20 pages

Unit 2

Uploaded by

k4yb9vh0mk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Unit - 2

Lexical Analysis

2.1 Role of a Lexical analyzer

The first step in the compiler design is LEXICAL ANALYSIS. A Lexer

takes the modified source code that is written in the form of
sentences. In other words, it allows you to transform a sequence
of characters into a sequence of tokens. This syntax is split into a
series of tokens by the lexical analyzer. It eliminates any extra
space or comment written in the source code.

Lexical analyzer programs are called lexical analyzers. A lexer

includes a tokenizer or scanner. If the lexical analyzer detects that
the token is invalid, an error is created. It reads character streams
from the source code, checks for legal tokens, and when it needs
the syntax analyzer, transfers the data to the syntax analyzer

Role of lexical analyzer

The following tasks are carried out by the Lexical Analyzer:

● Helps to classify tokens in the table of symbols

● Removes white spaces and the source program's comments

● Correlates with the source software error messages

● Helps you extend the macros if they are located in the source
program

● Read the source program's input characters

Advantages of lexical analyzer

● Programs such as compilers that can use the parsed data
from a programr's code to construct a compiled binary
executable code use the lexical analyzer technique.

● Web browsers use it to format and view a web page with the
assistance of JavaScript, HTML, CSS Script, CSS Script, and CSS
Script parsed data.

● A separate lexical analyzer allows you to create a specialised

and potentially more powerful processor for the assignment.

Disadvantages of lexical analyzer

● You need to spend considerable time reading and separating

the source program in the form of tokens.

● In contrast to the PEG or EBNF law, certain regular

expressions are very difficult to understand.

● It takes more time to build and debug the lexer and its token
descriptions

● To generate the lexer tables and create the tokens, additional

runtime overhead is required.

Key takeaway :

● The first step in the compiler design is LEXICAL ANALYSIS.

● Lexical analyzer programs are called lexical analyzers.

● It allows you to transform a sequence of characters into a

sequence of tokens.

2.2 Input Buffering

● One or more characters must be looked up past the next

lexeme to guarantee that a suitable lexeme is found.
● A two-buffer framework is thus implemented to safely control
large look aheads.

● Techniques for speeding up the lexical analyzer method have

been introduced, such as the use of sentinels to mark the end
of the buffer.

For the implementation of a lexical analyzer, there are three general

approaches:

(i) To create the lexical analyzer from a regular expression-based

specification by using a lexical-analyzer generator, such as the lex
compiler. The generator offers routines for reading and buffering the
input in this context.

(ii) Using the I/O facilities of that language to interpret the input by
writing the lexical analyzer in a traditional systems-programming
language.

(iii) Through writing the lexical analyzer and handling the reading of
input directly in assembly language.

1. Buffer Pair

Specialized buffering strategies have been developed to

reduce the amount of overhead needed to process an input
character due to a significant amount of time usage in moving
characters.
The buffer pairs used to store the input data are shown in
below Figure :

Fig 1: buffer pair

Scheme

● It consists of two buffers, each consisting of an N-

character size that is alternatively reloaded.

● N-Number of characters, e.g., 4096, on one disc block.

● Using one device read instruction, N characters are read

from the input file to the buffer.

● If the number of characters is less than N, an eof is

inserted at the end.

Pointers

● LexemeBegin and forward are maintained by two

pointers.

● The beginning of the lexeme leads to the beginning of

the present lexeme, which has yet to be identified.
● Forward scans ahead before a match is found for a
pattern.

● Once a lexeme is found, immediately after the lexeme

that is just found, lexeme start is set to the character,
and forward is set to the character at its right end.

● The existing lexeme is the character set that is between

two pointers.

Disadvantages of the scheme

● Most of the time, this scheme works well, but the

amount of looking ahead is reduced.

● In situations where the distance that the forward pointer

must travel is more than the length of the buffer, this
restricted lookahead can make it difficult to recognise
tokens.

● Until the character follows the correct parenthesis, it

cannot decide if DECLARE is a keyword or an array name.

2. Sentinels

● Time the forward pointer is shifted in the previous

system, a check is performed to ensure that one half of
the buffer has not moved off. If it is finished, then it is
appropriate to reload the other half.

● Hence, with each advance of the forward pointer, the

ends of the buffer halves need two checks.
Test 1: For the buffer end.

Test 2: Deciding which character is being read.

● By expanding each buffer half to carry a sentinel

character at the end, the use of sentinel lowers the two
checks to one.

● A sentinel is a unique character that cannot be used in

the source software. (The character of eof is used as a
sentinel).

Fig 2: sentinels

Advantages

● It conducts only one test most of the time to see if a

forward pointer points to an eof.

● This conducts further checks only when it hits the end of

the buffer half or eof.

● The average number of tests per input character is very

close to 1. since N input characters are encountered
between eofs.
Key takeaway :

● Techniques for speeding up the lexical analyzer method have

been introduced, such as the use of sentinels to mark the end
of the buffer.

● A two-buffer framework is thus implemented to safely control

large look aheads.

2.3 Specification and recognition of tokens

Token :

It is said that lexemes are a character sequence (alphanumeric) in a

token. For each lexeme to be identified as a valid token, there are
some predefined rules. These rules, by means of a pattern, are
described by grammar rules. A pattern describes what a token can
be, and by means of regular expressions, these patterns are
described.

Keywords, constants, identifiers, sequences, numbers, operators

and punctuation symbols can be used as tokens in the programming
language.

Example : the variable declaration line in the C language

int num = 60 ;

contains the tokens: int (keyword) , num (identifiers) = (operator) ,

60 (constant) and ; (symbol)

Specification of token :

Now let's explain how the following words are used in language
theory.
1. Alphabets : Any finite set of symbols

● {0,1} is a set of binary alphabets,

● {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal

alphabets,

● {a-z, A-Z} is a set of English language alphabets.

2. Strings : A string is called any finite sequence of alphabets.

3. Special Symbol : The following symbols are used in a

standard high-level language:

Arithmetic Addition(+), Subtraction(-),

Symbols Multiplication(*), Division(/)
Punctuation Comma(,), Semicolon(;), Dot(.)
Assignment =
Special +=, -=, *=, /=
Assignment
Comparison ==, !=. <. <=. >, >=
Preprocessor #

4. Language : A language over any finite set of alphabets is

known as a finite set of strings.

5. Longest Match Rule : When the lexical analyzer reads the

source code, it checks the code letter by letter and concludes
that a phrase is completed when it finds a white space,
operator symbol, or special symbols.

6. Operation : The different language operations are:

● The Union is written in two languages, L and M as , L U

M = {s | s is in L or s is in M}

● Two languages, L and M, are concatenated as, LM = {st

| s is in L and t is in M}

● A Language L's Kleene Closure is written as , L* = Zero

or more occurrence of language L .

7. Notations : If r and s are regular expressions that denote the

L(r) and L(s) languages, then

● Union : L(r)UL(s)

● Concatenation : L(r)L(s)

● Kleene Closure : (L(r))*

8. Representing valid tokens of a language in regular

expression : If x is an expression that is normal, then:

● x* implies the occurrence of x at zero or more.

● x+ implies one or more instances of x.

9. Finite automata : Finite Automata is a state machine that

takes as input a string of symbols and changes its state
accordingly.

If the input string is processed successfully and the automatic

has reached its final state, it will be accepted.

The Finite Automata Mathematical Model consists of :

● Q - Finite set of state

● Σ - Finite set of input symbol

● q0 - One start state

● qf - set of final state

● δ - Transition function

The transition function (δ) maps the finite set of state (Q) to a finite
set of input symbols (Σ), Q × Σ ➔ Q

Recognition of tokens

Reading input characters in the code and creating tokens is the key
task of lexical analysis.

The Lexical Analyzer checks the program's entire source code. One
by one, it recognises each token. Scanners are typically only
implemented when requested by a parser to generate tokens. And
here's how it works.

1) 'Get next token' is a command that is sent to the lexical

analyzer from the parser.

2) The lexical analyzer scans the input until it identifies the next
token upon obtaining this instruction.

3) It returns the Parser token.

Fig 3: lexical analyzer

When generating these tokens, the Lexical Analyzer skips white

spaces and comments. If there is an error, the Lexical Analyzer
compares the error with the source file and the line number.

Key takeaway :

● It is said that lexemes are a character sequence

(alphanumeric) in a token.

● A pattern describes what a token can be, and by means of

regular expressions, these patterns are described.

● Reading input characters in the code and creating tokens is

the key task of lexical analysis.

2.4 Finite Automata Implications

Finite Automata(FA) is the simplest pattern recognition machine.

The finite automatic or finite state machine is an abstract machine
with five components or tuples. It has a set of states and rules that
switch from one state to another, but it depends on the input
symbol that is added. It is essentially an abstract digital machine
model.

Some important features of general automation are shown in the

figure below.

Fig 4: finite automata

The figure above shows the following automated features:

1. Input
2. Output
3. State of Automata
4. State relation
5. Output relation

The following are Finite Automata:

● Q - Finite set of state

● Σ - Finite set of input symbol

● q0 - One start state

● qf - set of final state

● δ - Transition function

A formal computer specification is

{ Q, Σ, q, F, δ }

Finite automata model :

Input tape and finite control can be represented by Finite Automata.

● Input tape : It's a linear tape that has a certain number of

cells. In each cell, each input symbol is positioned.

● Finite control : On receiving unique feedback from the input

tape, the finite control determines the next condition. The tape
reader reads the cells from left to right, one by one, and only
one input symbol is read at a time.

Fig 5: finite automata model

Types of automata

The FA is divided into two types:

1. DFA (Deterministics finite automata)

In a DFA, the computer goes to only one state for a specific

input character. For every input symbol, a transition function is
specified for every state. Null (or ε) movement is also not
permitted in DFA, i.e. DFA does not change state without any
input characters.

DFA has five tuples {Q, ∑, q0, F, δ}

Q: set of all states

∑: finite set of input symbol where δ: Q x ∑ →Q

q0: initial state

F: final state

δ: Transition function

One important thing to remember is that, for a pattern, there

can be several possible DFAs. As a rule, a DFA with a minimum
number of states is favoured.

2. NFA (Non-deterministics finite automata)

NDFA applies to Finite Non-Deterministic Automata. It is used

for a single input to pass through any number of states. NDFA
embraces the NULL step, indicating that without reading the
symbols, it can change the state.

Like the DFA, NDFA also has five states. NDFA, however, has
distinct transformation features.
NDFA's transition role may be described as:

δ: Q x ∑ →2Q

Key takeaway :

● Finite Automata(FA) is the simplest pattern recognition

machine.

● The finite automatic or finite state machine is an abstract

machine with five components or tuples.

● In a DFA, the computer goes to only one state for a specific

input character.

● NFA is used for a single input to pass through any number of

states.

2.5 Designing a lexical analyzer generator

● Lexical Analyzer Generator introduces a tool called Lex, which enables

a lexical analyzer to be described by defining regular expressions to
define token patterns.

● The lex tool input is the lex language

● A program written in the lex language can compile and generate a C

code named [Link].c through the lex compiler, i.e.

Fig 6: lex compiler

Now, as always, C code is compiled by the C compiler and produces a
file called [Link], i.e.

Fig 7: c compiler

The output of the C compiler is a working lexical analyser that can take
an input character stream and generate a token stream, i.e.,

Fig 8: generator of lexical analyzer

● Lex is a lexical analyzer generating program. It is used to produce the

YACC parser. A software that transforms an input stream into a
sequence of tokens is the lexical analyzer. It reads the input stream
and, by implementing the lexical analyzer in the C program, produces
the source code as output.
Fig 9: lexical analyzer generator

If a token is considered invalid by the lexical analyzer, it produces an error.

The lexical analyzer works with the syntax analyzer in close collaboration. It
reads character streams from the source code, checks for legal tokens, and,
when necessary, passes the information to the syntax analyzer.

Structure of lex program

A Lex program has the following form

declarations

transition rules

auxiliary functions

Variable declarations, manifest constants (LT, LE, GT, GE) and standard
definitions are included in the declaration section.

The basic type of rule for transition is : pattern{Action}

Where the pattern is a regular expression and action determines what action
should be taken when it matches the pattern.
Second, the auxiliary function includes whatever extra features are used in
the actions.

You can separately compile these functions and load them with the lexical
analyzer.

LEX Program for Tokens

ID, NUMBNER, RELOP, IF, THEN,

ELSE

% } || declaration of variables

delim [ space\t\n ]

ws [ delim ] +

letter [A - Z - a - z ]

digit [ 0 - 9 ]

id letter (letter / digit)

Number { digit } + ( \ . { digit } + ) ?

E [ +- ] ? {digit +} +)?

*transition rule * |

{ ws } { No Action }

If { return (IF) ; }
{ id } { return (ID) ; Install - ID () ; }

1 { number } { return (number); Install-num (); }

“<” { return (RELOP, LT) }

“<=” { return (RELOP, LE) }

“=” { return (RELOP, EQ) }

|* Auxiliary Section *|

install-ID ()

install-num ()

Key takeaway :

● The lex tool input is the lex language.

● Lexical Analyzer Generator introduces a tool called Lex, which enables

a lexical analyzer to be described by defining regular expressions to
define token patterns.

● Lex is a lexical analyzer generating program.

● It is used to produce the YACC parser.

References :

1. “Compiler Construction”, Dhamdere, Mc-Millan.

2. “Compilers - Principles, Techniques and Tools”, A.V. Aho, R. Shethi

and [Link], Addison

Wesley Publishing Company.

3. “Compiler Construction”, Barret, Bates, Couch, Galgotia.

4. “Unix Programming”, Pepkin Pike.

Lexical Analysis and Input Buffering
No ratings yet
Lexical Analysis and Input Buffering
37 pages
Regular Expressions and Definitions Explained
No ratings yet
Regular Expressions and Definitions Explained
61 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
52 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
17 pages
Understanding Lexical Analyzers
No ratings yet
Understanding Lexical Analyzers
24 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
10 pages
Unit 2 - Lexical Analyzer
No ratings yet
Unit 2 - Lexical Analyzer
34 pages
Input Buffering in Lexical Analysis
No ratings yet
Input Buffering in Lexical Analysis
26 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
55 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
102 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
67 pages
Lexical Analysis: Tokens and Recognition
No ratings yet
Lexical Analysis: Tokens and Recognition
58 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
20 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
17 pages
Compiler Design Principles Overview
No ratings yet
Compiler Design Principles Overview
27 pages
Language Processors and Lexical Analysis
No ratings yet
Language Processors and Lexical Analysis
85 pages
Lexical Analyzer Design Overview
No ratings yet
Lexical Analyzer Design Overview
43 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
68 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
11 pages
Lexical Analyzer Overview and Functions
No ratings yet
Lexical Analyzer Overview and Functions
56 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
48 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
62 pages
Chapter Two
No ratings yet
Chapter Two
13 pages
CD Module2
No ratings yet
CD Module2
23 pages
Finite Automata and Lexical Analysis Guide
0% (1)
Finite Automata and Lexical Analysis Guide
13 pages
CD Unit Test 1 Notes
No ratings yet
CD Unit Test 1 Notes
14 pages
CD Unit-1 Notes
No ratings yet
CD Unit-1 Notes
12 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
22 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
6 pages
Lexical Analysis and Input Buffering
No ratings yet
Lexical Analysis and Input Buffering
23 pages
Token Specification in Compiler Design
No ratings yet
Token Specification in Compiler Design
10 pages
Language Processing and Lexical Analysis
No ratings yet
Language Processing and Lexical Analysis
60 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
5 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
63 pages
Compiler Notes 8
No ratings yet
Compiler Notes 8
22 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
71 pages
Lexical Analyzer Design and Functions
No ratings yet
Lexical Analyzer Design and Functions
15 pages
Compiler Lexical Analysis Explained
No ratings yet
Compiler Lexical Analysis Explained
45 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
63 pages
Lexical Analysis in Compilers
No ratings yet
Lexical Analysis in Compilers
6 pages
Lexical Analysis Errors in Compiler Design
100% (1)
Lexical Analysis Errors in Compiler Design
37 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
16 pages
CD - Unit2
No ratings yet
CD - Unit2
12 pages
CD - Unit2
No ratings yet
CD - Unit2
15 pages
Lexical Analysis and Transition Diagrams
No ratings yet
Lexical Analysis and Transition Diagrams
10 pages
Token Specification in Compiler Design
No ratings yet
Token Specification in Compiler Design
9 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
62 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
58 pages
EContent 11 2025 08 12 13 15 41 Unit2 SSPDF 2025 07 15 07 41 15
No ratings yet
EContent 11 2025 08 12 13 15 41 Unit2 SSPDF 2025 07 15 07 41 15
126 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
46 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
59 pages
Lexical Analysis in Compilers Explained
No ratings yet
Lexical Analysis in Compilers Explained
36 pages
Lexical Analysis in Compilers Explained
No ratings yet
Lexical Analysis in Compilers Explained
12 pages
Lexical Analyzer in Compiler Design
No ratings yet
Lexical Analyzer in Compiler Design
30 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
26 pages
Complier Construction
No ratings yet
Complier Construction
6 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
59 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
5 pages
TOC Lab Manual for Binary Programs
No ratings yet
TOC Lab Manual for Binary Programs
25 pages
CS3501 Compiler Design Overview
No ratings yet
CS3501 Compiler Design Overview
18 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
25 pages
chapter3-Thomson's Construction ^L0 RE to DFA（2022版）
No ratings yet
chapter3-Thomson's Construction ^L0 RE to DFA（2022版）
47 pages
Automata Theory and Formal Languages
0% (1)
Automata Theory and Formal Languages
4 pages
CS3452 Theory of Computation Notes
No ratings yet
CS3452 Theory of Computation Notes
48 pages
TC - W2025 (3160704) (GTURanker)
No ratings yet
TC - W2025 (3160704) (GTURanker)
3 pages
PDA Construction for Given Grammar
No ratings yet
PDA Construction for Given Grammar
23 pages
Theory of Computation Overview
No ratings yet
Theory of Computation Overview
162 pages
CST 301 Formal Languages Exam Key
No ratings yet
CST 301 Formal Languages Exam Key
4 pages
Understanding Finite Automata: DFA & NFA
No ratings yet
Understanding Finite Automata: DFA & NFA
42 pages
Operating Systems & Automata Syllabus
No ratings yet
Operating Systems & Automata Syllabus
4 pages
DFA and NFA Examples with Solutions
90% (10)
DFA and NFA Examples with Solutions
23 pages
T.Y.B.Sc. Computer Science Syllabus 2021
No ratings yet
T.Y.B.Sc. Computer Science Syllabus 2021
60 pages
Bio-Inspired Algorithms for Intrusion Detection
No ratings yet
Bio-Inspired Algorithms for Intrusion Detection
20 pages
Regular Expression for Specific Strings
No ratings yet
Regular Expression for Specific Strings
6 pages
B.Tech CSE AI & ML Curriculum Overview
No ratings yet
B.Tech CSE AI & ML Curriculum Overview
21 pages
Finite Automata and Formal Languages Overview
No ratings yet
Finite Automata and Formal Languages Overview
64 pages
Automata Exercises for Computer Science
No ratings yet
Automata Exercises for Computer Science
3 pages
FA to Regular Expression Conversion Guide
No ratings yet
FA to Regular Expression Conversion Guide
3 pages
DFA Simulation and Transition Diagram
No ratings yet
DFA Simulation and Transition Diagram
17 pages
ATCD Assignments and Exam Dates
No ratings yet
ATCD Assignments and Exam Dates
13 pages
B.Tech CSE Syllabus: Digital Electronics & Data Structures
No ratings yet
B.Tech CSE Syllabus: Digital Electronics & Data Structures
22 pages
Understanding Mealy Machines
No ratings yet
Understanding Mealy Machines
5 pages
Finite Automata and Regular Languages Quiz
No ratings yet
Finite Automata and Regular Languages Quiz
8 pages
SPPU Third Year CS & Design Syllabus
No ratings yet
SPPU Third Year CS & Design Syllabus
96 pages
Rakudo
No ratings yet
Rakudo
17 pages
Kleene's Theorem and NFA Explained
No ratings yet
Kleene's Theorem and NFA Explained
27 pages
Adaptive Game Theory with Genetic Automata
No ratings yet
Adaptive Game Theory with Genetic Automata
7 pages
Formal Languages and Automata Theory Notes
No ratings yet
Formal Languages and Automata Theory Notes
122 pages