Lexical Analysis
Acknowledgement
• Alfred V Aho, Monica S. Lam, Ravi Sethi,
Jeffrey D Ullman- “Compilers- Principles,
Techniques and Tools”
Girish Kumar Patnaik 2
Lexical Analysis
• Main task of the lexical analyzer is to read
the input characters of the source program,
group them into lexemes, and produce as
output a sequence of tokens for each lexeme
in the source program.
• The stream of tokens is sent to the parser for
syntax analysis.
Girish Kumar Patnaik 3
The Role of t he Lexical
Analyzer
Token,
Source Lexical tokenval
Program Parser
Analyzer
Get next
token
error error
Symbol Table
Girish Kumar Patnaik 4
The Role of the Lexical Analyzer
• Other tasks besides identification of
lexemes
– stripping out comments and whitespace
– Correlating error messages generated by the
compiler with the source program
• line number with each error message
• copy of the source program with the error messages
inserted
Girish Kumar Patnaik 5
Lexical Analysis Versus Parsing
• Simplicity of design is the most important
consideration
– Separation of lexical and syntactic analysis
– For example, deal with comments and whitespace
• Compiler efficiency is improved
– specialized techniques can be applied
• Compiler portability is enhanced
– Input-device-specific peculiarities can be restricted to
the lexical analyzer
Girish Kumar Patnaik 6
Tokens, Patterns, and Lexemes
• A token is a pair consisting of a token name and
an optional attribute value.
• A pattern is a description of the form that the
lexemes of a token may take.
• A lexeme is a sequence of characters in the source
program that matches the pattern for a token and is
identified by the lexical analyzer as an instance of
that token.
Girish Kumar Patnaik 7
Tokens, Patterns, and Lexemes
• A token is a classification of lexical units
– For example: id and num
• Lexemes are the specific character strings that
make up a token
– For example: abc and 123
• Patterns are rules describing the set of lexemes
belonging to a token
– For example: “letter followed by letters and digits” and
“non-empty sequence of digits”
Girish Kumar Patnaik 8
Tokens, Patterns, and Lexemes
Girish Kumar Patnaik 9
Attributes of Tokens
y := 31 + 28*x Lexical analyzer
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
token
tokenval
Parser
(token attribute)
Girish Kumar Patnaik 10
Attributes of Tokens
Girish Kumar Patnaik 11
Lexical Errors
• The simplest recovery strategy is "panic mode"
recovery. We delete successive characters from
the remaining input, until the lexical analyzer can
find a well-formed token at the beginning of what
input is left.
• Other possible error-recovery actions are:
– Delete one character from the remaining input.
– Insert a missing character into the remaining input.
– Replace a character by another character.
– Transpose two adjacent characters.
Girish Kumar Patnaik 12
Exercises
Divide the following C + + program:
f loat limitedSquare (x) f loat x {
/* returns x-squared , but never more than 1 00 */
return (x<=- 10 . 0 / / x>=1 0 . 0) ? 1 00 : x*x ;
}
into appropriate lexemes.
Girish Kumar Patnaik 13
Input Buffering
• Need to look one or more characters
beyond the next lexeme before we can be
sure we have the right lexeme.
• Single-character operators like - , =, or <
could also be the beginning of a two-
character operator like - > , ==, or <=.
Girish Kumar Patnaik 14
Input Buffering: Buffer Pairs
• Each buffer is of the same size N
• Read N characters into a buffer
• Pointer “lexemeBegin”, marks the beginning of the current
lexeme
• Pointer “forward” scans ahead until a pattern match is found
• If reached the end of one of the buffers, then reload the other
buffer from the input
Girish Kumar Patnaik 15
Input Buffering: Sentinels
• For each character read, we make two tests:
– one for the end of the buffer,
– other to determine what character is read (the latter may be a
multiway branch)
• Combine the buffer-end test with the test for the current
character
• Each buffer hold a sentinel character at the end
• Sentinel is a special character that cannot be part of the
source program (eof) Girish Kumar Patnaik 16
The Lex and Flex Scanner
Generators
• Lex and its newer cousin flex are scanner
generators
• Systematically translate regular definitions
into C source code for efficient scanning
• Generated code is easy to integrate in C
applications
Girish Kumar Patnaik 17
Creating a Lexical Analyzer
with Lex and Flex
lex
source lex or flex [Link].c
program compiler
lex.l
[Link].c C [Link]
compiler
input sequence
stream [Link] of tokens
Girish Kumar Patnaik 18
Lex Specification
• A lex specification consists of three parts:
regular definitions, C declarations in %{ %}
%%
translation rules
%%
user-defined auxiliary procedures
• The translation rules are of the form:
p1 { action1 }
p2 { action2 }
…
pn { actionn }
Girish Kumar Patnaik 19
Regular Expressions in Lex
x match the character x
\. match the character .
“string”match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x
[^xyz] x, y
y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1 r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
(r) grouping
r1\r2 match r1 when followed by r2
{d} match the regular expression defined by d
Girish Kumar Patnaik 20
Example Lex Specification 1
Contains
%{ the matching
Translation #include <stdio.h> lexeme
%}
rules %%
[0-9]+ { printf(“%s\n”, yytext); }
.|\n { }
%% Invokes
main() the lexical
{ yylex(); analyzer
}
lex spec.l
gcc [Link].c -ll
./[Link] < spec.l
Girish Kumar Patnaik 21
Example Lex Specification 2
%{
#include <stdio.h> Regular
int ch = 0, wd = 0, nl = 0;
definition
Translation %}
delim [ \t]+
rules %%
\n { ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim} { ch+=yyleng; wd++; }
. { ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
Girish Kumar Patnaik 22
Example Lex Specification 3
%{
#include <stdio.h> Regular
%}
definitions
Translation digit [0-9]
letter [A-Za-z]
rules id {letter}({letter}|{digit})*
%%
{digit}+ { printf(“number: %s\n”, yytext); }
{id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
%%
main()
{ yylex();
}
Girish Kumar Patnaik 23
Example Lex Specification 4
%{ /* definitions of manifest constants */
#define LT (256)
…
%}
delim [ \t\n]
ws {delim}+
letter [A-Za-z] Return
digit [0-9]
id {letter}({letter}|{digit})* token to
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%%
parser
{ws} { }
if {return IF;} Token
then {return THEN;}
else {return ELSE;}
attribute
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;}
“<>“ {yylval = NE; return RELOP;}
“>“ {yylval = GT; return RELOP;}
“>=“
%%
{yylval = GE; return RELOP;} Install yytext as
int install_id() identifier in symbol table
Girish Kumar Patnaik 24
…