0% found this document useful (0 votes)
5 views20 pages

Unit 2

The document discusses lexical analysis, the first step in compiler design, which involves transforming source code into tokens while removing whitespace and comments. It covers the role, advantages, and disadvantages of lexical analyzers, as well as input buffering techniques, token specification, and finite automata implications. Additionally, it introduces the Lex tool for generating lexical analyzers and outlines the structure of a Lex program.

Uploaded by

k4yb9vh0mk
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Unit 2

The document discusses lexical analysis, the first step in compiler design, which involves transforming source code into tokens while removing whitespace and comments. It covers the role, advantages, and disadvantages of lexical analyzers, as well as input buffering techniques, token specification, and finite automata implications. Additionally, it introduces the Lex tool for generating lexical analyzers and outlines the structure of a Lex program.

Uploaded by

k4yb9vh0mk
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Unit - 2

Lexical Analysis

2.1 Role of a Lexical analyzer

The first step in the compiler design is LEXICAL ANALYSIS. A Lexer


takes the modified source code that is written in the form of
sentences. In other words, it allows you to transform a sequence
of characters into a sequence of tokens. This syntax is split into a
series of tokens by the lexical analyzer. It eliminates any extra
space or comment written in the source code.

Lexical analyzer programs are called lexical analyzers. A lexer


includes a tokenizer or scanner. If the lexical analyzer detects that
the token is invalid, an error is created. It reads character streams
from the source code, checks for legal tokens, and when it needs
the syntax analyzer, transfers the data to the syntax analyzer

Role of lexical analyzer

The following tasks are carried out by the Lexical Analyzer:

● Helps to classify tokens in the table of symbols

● Removes white spaces and the source program's comments

● Correlates with the source software error messages

● Helps you extend the macros if they are located in the source
program

● Read the source program's input characters

Advantages of lexical analyzer


● Programs such as compilers that can use the parsed data
from a programr's code to construct a compiled binary
executable code use the lexical analyzer technique.

● Web browsers use it to format and view a web page with the
assistance of JavaScript, HTML, CSS Script, CSS Script, and CSS
Script parsed data.

● A separate lexical analyzer allows you to create a specialised


and potentially more powerful processor for the assignment.

Disadvantages of lexical analyzer

● You need to spend considerable time reading and separating


the source program in the form of tokens.

● In contrast to the PEG or EBNF law, certain regular


expressions are very difficult to understand.

● It takes more time to build and debug the lexer and its token
descriptions

● To generate the lexer tables and create the tokens, additional


runtime overhead is required.

Key takeaway :

● The first step in the compiler design is LEXICAL ANALYSIS.

● Lexical analyzer programs are called lexical analyzers.

● It allows you to transform a sequence of characters into a


sequence of tokens.

2.2 Input Buffering

● One or more characters must be looked up past the next


lexeme to guarantee that a suitable lexeme is found.
● A two-buffer framework is thus implemented to safely control
large look aheads.

● Techniques for speeding up the lexical analyzer method have


been introduced, such as the use of sentinels to mark the end
of the buffer.

For the implementation of a lexical analyzer, there are three general


approaches:

(i) To create the lexical analyzer from a regular expression-based


specification by using a lexical-analyzer generator, such as the lex
compiler. The generator offers routines for reading and buffering the
input in this context.

(ii) Using the I/O facilities of that language to interpret the input by
writing the lexical analyzer in a traditional systems-programming
language.

(iii) Through writing the lexical analyzer and handling the reading of
input directly in assembly language.

1. Buffer Pair

Specialized buffering strategies have been developed to


reduce the amount of overhead needed to process an input
character due to a significant amount of time usage in moving
characters.
The buffer pairs used to store the input data are shown in
below Figure :

Fig 1: buffer pair

Scheme

● It consists of two buffers, each consisting of an N-


character size that is alternatively reloaded.

● N-Number of characters, e.g., 4096, on one disc block.

● Using one device read instruction, N characters are read


from the input file to the buffer.

● If the number of characters is less than N, an eof is


inserted at the end.

Pointers

● LexemeBegin and forward are maintained by two


pointers.

● The beginning of the lexeme leads to the beginning of


the present lexeme, which has yet to be identified.
● Forward scans ahead before a match is found for a
pattern.

● Once a lexeme is found, immediately after the lexeme


that is just found, lexeme start is set to the character,
and forward is set to the character at its right end.

● The existing lexeme is the character set that is between


two pointers.

Disadvantages of the scheme

● Most of the time, this scheme works well, but the


amount of looking ahead is reduced.

● In situations where the distance that the forward pointer


must travel is more than the length of the buffer, this
restricted lookahead can make it difficult to recognise
tokens.

● Until the character follows the correct parenthesis, it


cannot decide if DECLARE is a keyword or an array name.

2. Sentinels

● Time the forward pointer is shifted in the previous


system, a check is performed to ensure that one half of
the buffer has not moved off. If it is finished, then it is
appropriate to reload the other half.

● Hence, with each advance of the forward pointer, the


ends of the buffer halves need two checks.
Test 1: For the buffer end.

Test 2: Deciding which character is being read.

● By expanding each buffer half to carry a sentinel


character at the end, the use of sentinel lowers the two
checks to one.

● A sentinel is a unique character that cannot be used in


the source software. (The character of eof is used as a
sentinel).

Fig 2: sentinels

Advantages

● It conducts only one test most of the time to see if a


forward pointer points to an eof.

● This conducts further checks only when it hits the end of


the buffer half or eof.

● The average number of tests per input character is very


close to 1. since N input characters are encountered
between eofs.
Key takeaway :

● Techniques for speeding up the lexical analyzer method have


been introduced, such as the use of sentinels to mark the end
of the buffer.

● A two-buffer framework is thus implemented to safely control


large look aheads.

2.3 Specification and recognition of tokens

Token :

It is said that lexemes are a character sequence (alphanumeric) in a


token. For each lexeme to be identified as a valid token, there are
some predefined rules. These rules, by means of a pattern, are
described by grammar rules. A pattern describes what a token can
be, and by means of regular expressions, these patterns are
described.

Keywords, constants, identifiers, sequences, numbers, operators


and punctuation symbols can be used as tokens in the programming
language.

Example : the variable declaration line in the C language

int num = 60 ;

contains the tokens: int (keyword) , num (identifiers) = (operator) ,


60 (constant) and ; (symbol)

Specification of token :

Now let's explain how the following words are used in language
theory.
1. Alphabets : Any finite set of symbols

● {0,1} is a set of binary alphabets,

● {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal


alphabets,

● {a-z, A-Z} is a set of English language alphabets.

2. Strings : A string is called any finite sequence of alphabets.

3. Special Symbol : The following symbols are used in a


standard high-level language:

Arithmetic Addition(+), Subtraction(-),


Symbols Multiplication(*), Division(/)
Punctuation Comma(,), Semicolon(;), Dot(.)
Assignment =
Special +=, -=, *=, /=
Assignment
Comparison ==, !=. <. <=. >, >=
Preprocessor #

4. Language : A language over any finite set of alphabets is


known as a finite set of strings.

5. Longest Match Rule : When the lexical analyzer reads the


source code, it checks the code letter by letter and concludes
that a phrase is completed when it finds a white space,
operator symbol, or special symbols.

6. Operation : The different language operations are:

● The Union is written in two languages, L and M as , L U


M = {s | s is in L or s is in M}

● Two languages, L and M, are concatenated as, LM = {st


| s is in L and t is in M}

● A Language L's Kleene Closure is written as , L* = Zero


or more occurrence of language L .

7. Notations : If r and s are regular expressions that denote the


L(r) and L(s) languages, then

● Union : L(r)UL(s)

● Concatenation : L(r)L(s)

● Kleene Closure : (L(r))*

8. Representing valid tokens of a language in regular


expression : If x is an expression that is normal, then:

● x* implies the occurrence of x at zero or more.

● x+ implies one or more instances of x.

9. Finite automata : Finite Automata is a state machine that


takes as input a string of symbols and changes its state
accordingly.

If the input string is processed successfully and the automatic


has reached its final state, it will be accepted.

The Finite Automata Mathematical Model consists of :

● Q - Finite set of state


● Σ - Finite set of input symbol

● q0 - One start state

● qf - set of final state

● δ - Transition function

The transition function (δ) maps the finite set of state (Q) to a finite
set of input symbols (Σ), Q × Σ ➔ Q

Recognition of tokens

Reading input characters in the code and creating tokens is the key
task of lexical analysis.

The Lexical Analyzer checks the program's entire source code. One
by one, it recognises each token. Scanners are typically only
implemented when requested by a parser to generate tokens. And
here's how it works.

1) 'Get next token' is a command that is sent to the lexical


analyzer from the parser.

2) The lexical analyzer scans the input until it identifies the next
token upon obtaining this instruction.

3) It returns the Parser token.


Fig 3: lexical analyzer

When generating these tokens, the Lexical Analyzer skips white


spaces and comments. If there is an error, the Lexical Analyzer
compares the error with the source file and the line number.

Key takeaway :

● It is said that lexemes are a character sequence


(alphanumeric) in a token.

● A pattern describes what a token can be, and by means of


regular expressions, these patterns are described.

● Reading input characters in the code and creating tokens is


the key task of lexical analysis.

2.4 Finite Automata Implications

Finite Automata(FA) is the simplest pattern recognition machine.


The finite automatic or finite state machine is an abstract machine
with five components or tuples. It has a set of states and rules that
switch from one state to another, but it depends on the input
symbol that is added. It is essentially an abstract digital machine
model.

Some important features of general automation are shown in the


figure below.

Fig 4: finite automata

The figure above shows the following automated features:

1. Input
2. Output
3. State of Automata
4. State relation
5. Output relation

The following are Finite Automata:

● Q - Finite set of state

● Σ - Finite set of input symbol

● q0 - One start state


● qf - set of final state

● δ - Transition function

A formal computer specification is

{ Q, Σ, q, F, δ }

Finite automata model :

Input tape and finite control can be represented by Finite Automata.

● Input tape : It's a linear tape that has a certain number of


cells. In each cell, each input symbol is positioned.

● Finite control : On receiving unique feedback from the input


tape, the finite control determines the next condition. The tape
reader reads the cells from left to right, one by one, and only
one input symbol is read at a time.

Fig 5: finite automata model


Types of automata

The FA is divided into two types:

1. DFA (Deterministics finite automata)

In a DFA, the computer goes to only one state for a specific


input character. For every input symbol, a transition function is
specified for every state. Null (or ε) movement is also not
permitted in DFA, i.e. DFA does not change state without any
input characters.

DFA has five tuples {Q, ∑, q0, F, δ}

Q: set of all states

∑: finite set of input symbol where δ: Q x ∑ →Q

q0: initial state

F: final state

δ: Transition function

One important thing to remember is that, for a pattern, there


can be several possible DFAs. As a rule, a DFA with a minimum
number of states is favoured.

2. NFA (Non-deterministics finite automata)

NDFA applies to Finite Non-Deterministic Automata. It is used


for a single input to pass through any number of states. NDFA
embraces the NULL step, indicating that without reading the
symbols, it can change the state.

Like the DFA, NDFA also has five states. NDFA, however, has
distinct transformation features.
NDFA's transition role may be described as:

δ: Q x ∑ →2Q

Key takeaway :

● Finite Automata(FA) is the simplest pattern recognition


machine.

● The finite automatic or finite state machine is an abstract


machine with five components or tuples.

● In a DFA, the computer goes to only one state for a specific


input character.

● NFA is used for a single input to pass through any number of


states.

2.5 Designing a lexical analyzer generator

● Lexical Analyzer Generator introduces a tool called Lex, which enables


a lexical analyzer to be described by defining regular expressions to
define token patterns.

● The lex tool input is the lex language

● A program written in the lex language can compile and generate a C


code named [Link].c through the lex compiler, i.e.

Fig 6: lex compiler


Now, as always, C code is compiled by the C compiler and produces a
file called [Link], i.e.

Fig 7: c compiler

The output of the C compiler is a working lexical analyser that can take
an input character stream and generate a token stream, i.e.,

Fig 8: generator of lexical analyzer

● Lex is a lexical analyzer generating program. It is used to produce the


YACC parser. A software that transforms an input stream into a
sequence of tokens is the lexical analyzer. It reads the input stream
and, by implementing the lexical analyzer in the C program, produces
the source code as output.
Fig 9: lexical analyzer generator

If a token is considered invalid by the lexical analyzer, it produces an error.


The lexical analyzer works with the syntax analyzer in close collaboration. It
reads character streams from the source code, checks for legal tokens, and,
when necessary, passes the information to the syntax analyzer.

Structure of lex program

A Lex program has the following form

declarations

%%

transition rules

%%

auxiliary functions

Variable declarations, manifest constants (LT, LE, GT, GE) and standard
definitions are included in the declaration section.

The basic type of rule for transition is : pattern{Action}

Where the pattern is a regular expression and action determines what action
should be taken when it matches the pattern.
Second, the auxiliary function includes whatever extra features are used in
the actions.

You can separately compile these functions and load them with the lexical
analyzer.

LEX Program for Tokens

%{

ID, NUMBNER, RELOP, IF, THEN,

ELSE

% } || declaration of variables

delim [ space\t\n ]

ws [ delim ] +

letter [A - Z - a - z ]

digit [ 0 - 9 ]

id letter (letter / digit)

Number { digit } + ( \ . { digit } + ) ?

E [ +- ] ? {digit +} +)?

%%

*transition rule * |

{ ws } { No Action }

If { return (IF) ; }
{ id } { return (ID) ; Install - ID () ; }

1 { number } { return (number); Install-num (); }

“<” { return (RELOP, LT) }

“<=” { return (RELOP, LE) }

“=” { return (RELOP, EQ) }

%%

|* Auxiliary Section *|

install-ID ()

install-num ()

Key takeaway :

● The lex tool input is the lex language.

● Lexical Analyzer Generator introduces a tool called Lex, which enables


a lexical analyzer to be described by defining regular expressions to
define token patterns.

● Lex is a lexical analyzer generating program.


● It is used to produce the YACC parser.

References :

1. “Compiler Construction”, Dhamdere, Mc-Millan.

2. “Compilers - Principles, Techniques and Tools”, A.V. Aho, R. Shethi


and [Link], Addison

Wesley Publishing Company.

3. “Compiler Construction”, Barret, Bates, Couch, Galgotia.

4. “Unix Programming”, Pepkin Pike.

You might also like