0% found this document useful (0 votes)
13 views24 pages

Understanding Lexical Analysis Basics

Uploaded by

Jayesh Wagh
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views24 pages

Understanding Lexical Analysis Basics

Uploaded by

Jayesh Wagh
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lexical Analysis

Acknowledgement
• Alfred V Aho, Monica S. Lam, Ravi Sethi,
Jeffrey D Ullman- “Compilers- Principles,
Techniques and Tools”

Girish Kumar Patnaik 2


Lexical Analysis
• Main task of the lexical analyzer is to read
the input characters of the source program,
group them into lexemes, and produce as
output a sequence of tokens for each lexeme
in the source program.
• The stream of tokens is sent to the parser for
syntax analysis.

Girish Kumar Patnaik 3


The Role of t he Lexical
Analyzer
Token,
Source Lexical tokenval
Program Parser
Analyzer
Get next
token

error error

Symbol Table

Girish Kumar Patnaik 4


The Role of the Lexical Analyzer
• Other tasks besides identification of
lexemes
– stripping out comments and whitespace
– Correlating error messages generated by the
compiler with the source program
• line number with each error message
• copy of the source program with the error messages
inserted

Girish Kumar Patnaik 5


Lexical Analysis Versus Parsing
• Simplicity of design is the most important
consideration
– Separation of lexical and syntactic analysis
– For example, deal with comments and whitespace
• Compiler efficiency is improved
– specialized techniques can be applied
• Compiler portability is enhanced
– Input-device-specific peculiarities can be restricted to
the lexical analyzer
Girish Kumar Patnaik 6
Tokens, Patterns, and Lexemes
• A token is a pair consisting of a token name and
an optional attribute value.
• A pattern is a description of the form that the
lexemes of a token may take.
• A lexeme is a sequence of characters in the source
program that matches the pattern for a token and is
identified by the lexical analyzer as an instance of
that token.

Girish Kumar Patnaik 7


Tokens, Patterns, and Lexemes
• A token is a classification of lexical units
– For example: id and num
• Lexemes are the specific character strings that
make up a token
– For example: abc and 123
• Patterns are rules describing the set of lexemes
belonging to a token
– For example: “letter followed by letters and digits” and
“non-empty sequence of digits”
Girish Kumar Patnaik 8
Tokens, Patterns, and Lexemes

Girish Kumar Patnaik 9


Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

token
tokenval
Parser
(token attribute)
Girish Kumar Patnaik 10
Attributes of Tokens

Girish Kumar Patnaik 11


Lexical Errors
• The simplest recovery strategy is "panic mode"
recovery. We delete successive characters from
the remaining input, until the lexical analyzer can
find a well-formed token at the beginning of what
input is left.
• Other possible error-recovery actions are:
– Delete one character from the remaining input.
– Insert a missing character into the remaining input.
– Replace a character by another character.
– Transpose two adjacent characters.

Girish Kumar Patnaik 12


Exercises
Divide the following C + + program:

f loat limitedSquare (x) f loat x {


/* returns x-squared , but never more than 1 00 */
return (x<=- 10 . 0 / / x>=1 0 . 0) ? 1 00 : x*x ;
}
into appropriate lexemes.

Girish Kumar Patnaik 13


Input Buffering
• Need to look one or more characters
beyond the next lexeme before we can be
sure we have the right lexeme.
• Single-character operators like - , =, or <
could also be the beginning of a two-
character operator like - > , ==, or <=.

Girish Kumar Patnaik 14


Input Buffering: Buffer Pairs

• Each buffer is of the same size N


• Read N characters into a buffer
• Pointer “lexemeBegin”, marks the beginning of the current
lexeme
• Pointer “forward” scans ahead until a pattern match is found
• If reached the end of one of the buffers, then reload the other
buffer from the input
Girish Kumar Patnaik 15
Input Buffering: Sentinels

• For each character read, we make two tests:


– one for the end of the buffer,
– other to determine what character is read (the latter may be a
multiway branch)
• Combine the buffer-end test with the test for the current
character
• Each buffer hold a sentinel character at the end
• Sentinel is a special character that cannot be part of the
source program (eof) Girish Kumar Patnaik 16
The Lex and Flex Scanner
Generators
• Lex and its newer cousin flex are scanner
generators
• Systematically translate regular definitions
into C source code for efficient scanning
• Generated code is easy to integrate in C
applications

Girish Kumar Patnaik 17


Creating a Lexical Analyzer
with Lex and Flex
lex
source lex or flex [Link].c
program compiler
lex.l

[Link].c C [Link]
compiler

input sequence
stream [Link] of tokens

Girish Kumar Patnaik 18


Lex Specification
• A lex specification consists of three parts:
regular definitions, C declarations in %{ %}
%%
translation rules
%%
user-defined auxiliary procedures
• The translation rules are of the form:
p1 { action1 }
p2 { action2 }

pn { actionn }
Girish Kumar Patnaik 19
Regular Expressions in Lex
x match the character x
\. match the character .
“string”match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x
[^xyz] x, y
y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1 r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
(r) grouping
r1\r2 match r1 when followed by r2
{d} match the regular expression defined by d
Girish Kumar Patnaik 20
Example Lex Specification 1
Contains
%{ the matching
Translation #include <stdio.h> lexeme
%}
rules %%
[0-9]+ { printf(“%s\n”, yytext); }
.|\n { }
%% Invokes
main() the lexical
{ yylex(); analyzer
}

lex spec.l
gcc [Link].c -ll
./[Link] < spec.l
Girish Kumar Patnaik 21
Example Lex Specification 2
%{
#include <stdio.h> Regular
int ch = 0, wd = 0, nl = 0;
definition
Translation %}
delim [ \t]+
rules %%
\n { ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim} { ch+=yyleng; wd++; }
. { ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
Girish Kumar Patnaik 22
Example Lex Specification 3
%{
#include <stdio.h> Regular
%}
definitions
Translation digit [0-9]
letter [A-Za-z]
rules id {letter}({letter}|{digit})*
%%
{digit}+ { printf(“number: %s\n”, yytext); }
{id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
%%
main()
{ yylex();
}

Girish Kumar Patnaik 23


Example Lex Specification 4
%{ /* definitions of manifest constants */
#define LT (256)

%}
delim [ \t\n]
ws {delim}+
letter [A-Za-z] Return
digit [0-9]
id {letter}({letter}|{digit})* token to
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%%
parser
{ws} { }
if {return IF;} Token
then {return THEN;}
else {return ELSE;}
attribute
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;}
“<>“ {yylval = NE; return RELOP;}
“>“ {yylval = GT; return RELOP;}
“>=“
%%
{yylval = GE; return RELOP;} Install yytext as
int install_id() identifier in symbol table
Girish Kumar Patnaik 24

You might also like