0% found this document useful (0 votes)

33 views23 pages

Understanding Lexical Analysis in Compilers

Uploaded by

Chaudhary Waqas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views23 pages

Understanding Lexical Analysis in Compilers

Uploaded by

Chaudhary Waqas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Introduction of Lexical Analysis

Lexical Analysis is the first phase of the compiler also known as a scanner. It converts the High level input
program into a sequence of Tokens.

1. Lexical Analysis can be implemented with the Deterministic finite Automata.

2. The output is a sequence of tokens that is sent to the parser for syntax analysis

What is a
Token?

A lexical token is a sequence of characters that can be treated as a unit in the grammar of the
programming languages. Example of tokens:

 Type token (id, number, real, . . . )

 Punctuation tokens (IF, void, return, . . . )

 Alphabetic tokens (keywords)

Keywords; Examples-for, while, if etc.

Identifier; Examples-Variable name, function name, etc.

Operators; Examples '+', '++', '-' etc.

Separators; Examples ',' ';' etc

Example of Non-Tokens:

 Comments, preprocessor directive, macros, blanks, tabs, newline, etc.

Lexeme: The sequence of characters matched by a pattern to form the corresponding token or a
sequence of input characters that comprises a single token is called a lexeme. eg- “float”,
“abs_zero_Kelvin”, “=”, “-”, “273”, “;” .

How Lexical Analyzer Works?

1. Input preprocessing: This stage involves cleaning up the input text and preparing it for lexical
analysis. This may include removing comments, whitespace, and other non-essential characters
from the input text.

2. Tokenization: This is the process of breaking the input text into a sequence of tokens. This is
usually done by matching the characters in the input text against a set of patterns or regular
expressions that define the different types of tokens.

3. Token classification: In this stage, the lexer determines the type of each token. For example, in a
programming language, the lexer might classify keywords, identifiers, operators, and
punctuation symbols as separate token types.

4. Token validation: In this stage, the lexer checks that each token is valid according to the rules of
the programming language. For example, it might check that a variable name is a valid identifier,
or that an operator has the correct syntax.

5. Output generation: In this final stage, the lexer generates the output of the lexical analysis
process, which is typically a list of tokens. This list of tokens can then be passed to the next stage
of compilation or interpretation.
 The lexical analyzer identifies the error with the help of the automation machine and the
grammar of the given language on which it is based like C, C++, and gives row number and
column number of the error.

Suppose we pass a statement through lexical analyzer – a = b + c; It will generate token

sequence like this: id=id+id; Where each id refers to it’s variable in the symbol table referencing
all details For example, consider the program

int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}

All the valid tokens are:

'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'

'a' '=' '10' ';' 'return' '0' ';' '}'

Above are the valid tokens. You can observe that we have omitted comments. As another example,
consider below printf statement.

There are 5 valid token in this printf statement. Exercise 1: Count number of tokens:

int main()
{
int a = 10, b = 20;
printf("sum is:%d",a+b);
return 0;
}

Answer: Total number of token: 27.

Exercise 2: Count number of tokens: int max(int i);

 Lexical analyzer first read int and finds it to be valid and accepts as token.

 max is read by it and found to be a valid function name after reading (

 int is also a token , then again I as another token and finally ;

Answer: Total number of tokens 7:

int, max, ( ,int, i, ), ;

 We can represent in the form of lexemes and tokens as under

Lexemes Tokens Lexemes Tokens

while WHILE a IDENTIEFIER

( LAPREN = ASSIGNMENT

a IDENTIFIER a IDENTIFIER

>= COMPARISON – ARITHMETIC

b IDENTIFIER 2 INTEGER

) RPAREN ; SEMICOLON

Advantages

1. Simplifies Parsing:Breaking down the source code into tokens makes it easier for computers to
understand and work with the code. This helps programs like compilers or interpreters to figure
out what the code is supposed to do. It’s like breaking down a big puzzle into smaller pieces,
which makes it easier to put together and solve.
2. Error Detection: Lexical analysis will detect lexical errors such as misspelled keywords or
undefined symbols early in the compilation process. This helps in improving the overall efficiency
of the compiler or interpreter by identifying errors sooner rather than later.

3. Efficiency: Once the source code is converted into tokens, subsequent phases of compilation or
interpretation can operate more efficiently. Parsing and semantic analysis become faster and
more streamlined when working with tokenized input.

Disadvantages

1. Limited Context: Lexical analysis operates based on individual tokens and does not consider the
overall context of the code. This can sometimes lead to ambiguity or misinterpretation of the
code’s intended meaning especially in languages with complex syntax or semantics.

2. Overhead: Although lexical analysis is necessary for the compilation or interpretation process, it
adds an extra layer of overhead. Tokenizing the source code requires additional computational
resources which can impact the overall performance of the compiler or interpreter.

3. Debugging Challenges: Lexical errors detected during the analysis phase may not always provide
clear indications of their origins in the original source code. Debugging such errors can be
challenging especially if they result from subtle mistakes in the lexical analysis process.

Compiler Design - Syntax Analysis

Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the basic
concepts used in the construction of a parser.

We have seen that a lexical analyzer can identify tokens with the help of regular expressions and
pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to the limitations
of the regular expressions. Regular expressions cannot check balancing tokens, such as parenthesis.
Therefore, this phase uses context-free grammar (CFG), which is recognized by push-down automata.

CFG, on the other hand, is a superset of Regular Grammar, as depicted below:

It implies that every Regular Grammar is also context-free, but there exists some problems, which are
beyond the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of programming
languages.

Context-Free Grammar

In this section, we will first see the definition of context-free grammar and introduce terminologies
used in parsing technology.

A context-free grammar has four components:

 A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings.
The non-terminals define sets of strings that help define the language generated by the
grammar.

 A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which
strings are formed.

 A set of productions (P). The productions of a grammar specify the manner in which the
terminals and non-terminals can be combined to form strings. Each production consists of
a non-terminal called the left side of the production, an arrow, and a sequence of tokens
and/or on- terminals, called the right side of the production.

 One of the non-terminals is designated as the start symbol (S); from where the production
begins.

The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start
symbol) by the right side of a production, for that non-terminal.

Example

We take the problem of palindrome language, which cannot be described by means of Regular
Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by means of
CFG, as illustrated below:

G = ( V, Σ, P, S )

Where:

V = { Q, Z, N }

Σ = { 0, 1 }

P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }

S={Q}

This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc.

Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The
parser analyzes the source code (token stream) against the production rules to detect any errors in the
code. The output of this phase is a parse tree.

This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating a
parse tree as the output of the phase.

Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use
error recovering strategies, which we will learn later in this chapter.

Derivation

A derivation is basically a sequence of production rules, in order to get the input string. During
parsing, we take two decisions for some sentential form of input:

 Deciding the non-terminal which is to be replaced.

 Deciding the production rule, by which, the non-terminal will be replaced.

To decide which non-terminal to be replaced with production rule, we can have two options.

Left-most Derivation

If the sentential form of an input is scanned and replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most derivation is called the left-sentential form.

Right-most Derivation

If we scan and replace the input with production rules, from right to left, it is known as right-most
derivation. The sentential form derived from the right-most derivation is called the right-sentential
form.

Example

Production rules:

E→E+E

E→E*E
E → id

Input string: id + id * id

The left-most derivation is:

E→E*E

E→E+E*E

E → id + E * E

E → id + id * E

E → id + id * id

Notice that the left-most side non-terminal is always processed first.

The right-most derivation is:

E→E+E

E→E+E*E

E → E + E * id

E → E + id * id

E → id + id * id

Parse Tree

A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived
from the start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us
see this by an example from the last topic.

We take the left-most derivation of a + b * c

The left-most derivation is:

E→E*E

E→E+E*E

E → id + E * E

E → id + id * E

E → id + id * id

Step 1:
E→E*E

Step 2:

E→E+E*E

Step 3:

E → id + E * E
Step 4:

E → id + id * E

Step 5:

E → id + id * id

In a parse tree:

 All leaf nodes are terminals.

 All interior nodes are non-terminals.

 In-order traversal gives original input string.

A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence over the operator which is in the parent
nodes.

Ambiguity

A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at
least one string.
Example

E→E+E

E→E–E

E → id

For the string id + id – id, the above grammar generates two parse trees:

The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity in

grammar is not good for a compiler construction. No method can detect and remove ambiguity
automatically, but it can be removed by either re-writing the whole grammar without ambiguity, or by
setting and following associativity and precedence constraints.

Associativity

If an operand has operators on both sides, the side on which the operator takes this operand is
decided by the associativity of those operators. If the operation is left-associative, then the operand
will be taken by the left operator or if the operation is right-associative, the right operator will take the
operand.

Example

Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the
expression contains:

id op id op id

it will be evaluated as:

(id op id) op id

For example, (id + id) + id

Operations like Exponentiation are right associative, i.e., the order of evaluation in the same
expression will be:
id op (id op id)

For example, id ^ (id ^ id)

Precedence

If two different operators share a common operand, the precedence of operators decides which will
take the operand. That is, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4 and
another corresponding to 2+(3*4). By setting precedence among operators, this problem can be easily
removed. As in the previous example, mathematically * (multiplication) has precedence over +
(addition), so the expression 2+3*4 will always be interpreted as:

2 + (3 * 4)

These methods decrease the chances of ambiguity in a language or its grammar.

Left Recursion

A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself as
the left-most symbol. Left-recursive grammar is considered to be a problematic situation for top-down
parsers. Top-down parsers start parsing from the Start symbol, which in itself is non-terminal. So,
when the parser encounters the same non-terminal in its derivation, it becomes hard for it to judge
when to stop parsing the left non-terminal and it goes into an infinite loop.

Example:

(1) A => Aα | β

(2) S => Aα | β

A => Sd

(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents a
string of non-terminals.

(2) is an example of indirect-left recursion.

A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and the
parser may go into a loop forever.

Removal of Left Recursion

One way to remove left recursion is to use the following technique:

The production

A => Aα | β

is converted into following productions

A => βA'

A'=> αA' | ε

This does not impact the strings derived from the grammar, but it removes immediate left recursion.

Second method is to use the following algorithm, which should eliminate all direct and indirect left
recursions.

START

Arrange non-terminals in some order like A1, A2, A3,…, An

for each i from 1 to n

for each j from 1 to i-1

replace each production of form Ai ⟹Aj𝜸

with Ai ⟹ δ1𝜸 | δ2𝜸 | δ3𝜸 |…| 𝜸

where Aj ⟹ δ1 | δ2|…| δn are current Aj productions

eliminate immediate left-recursion

END

Example
The production set

S => Aα | β

A => Sd

after applying the above algorithm, should become

S => Aα | β

A => Aαd | βd

and then, remove immediate left recursion using the first technique.

A => βdA'

A' => αdA' | ε

Now none of the production has either direct or indirect left recursion.

Left Factoring

If more than one grammar production rules has a common prefix string, then the top-down parser
cannot make a choice as to which of the production it should take to parse the string in hand.

Example

If a top-down parser encounters a production like

A ⟹ αβ | α𝜸 | …

Then it cannot determine which production to follow to parse the string as both productions are
starting from the same terminal (or non-terminal). To remove this confusion, we use a technique
called left factoring.

Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we
make one production for each common prefixes and the rest of the derivation is added by new
productions.

Example

The above productions can be written as

A => αA'

A'=> β | 𝜸 | …

Now the parser has only one production per prefix which makes it easier to take decisions.

First and Follow Sets

An important part of parser table construction is to create first and follow sets. These sets can provide
the actual position of any terminal in the derivation. This is done to create the parsing table where the
decision of replacing T[A, t] = α with some production rule.
First Set

This set is created to know what terminal symbol is derived in the first position by a non-terminal. For
example,

α→tβ

That is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).

Algorithm for calculating First set

Look at the definition of FIRST(α) set:

 if α is a terminal, then FIRST(α) = { α }.

 if α is a non-terminal and α → ℇ is a production, then FIRST(α) = { ℇ }.

 if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any FIRST(𝜸) contains t then t is in

FIRST(α).

First set can be seen as:

Follow Set

Likewise, we calculate what terminal symbol immediately follows a non-terminal α in production

rules. We do not consider what the non-terminal can generate but instead, we see what would be the
next terminal symbol that follows the productions of a non-terminal.

Algorithm for calculating Follow set:

 if α is a start symbol, then FOLLOW() = $

 if α is a non-terminal and has a production α → AB, then FIRST(B) is in FOLLOW(A) except ℇ.

 if α is a non-terminal and has a production α → AB, where B ℇ, then FOLLOW(A) is in

FOLLOW(α).

Follow set can be seen as: FOLLOW(α) = { t | S αt}

Limitations of Syntax Analyzers

Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical analyzers
are responsible for the validity of a token supplied by the syntax analyzer. Syntax analyzers have the
following drawbacks -

 it cannot determine if a token is valid,

 it cannot determine if a token is declared before it is being used,

 it cannot determine if a token is initialized before it is being used,

 it cannot determine if an operation performed on a token type is valid or not.

These tasks are accomplished by the semantic analyzer, which we shall study in Semantic Analysis.
Explained Other way ---

Lexical Analysis

Lexical Analysis is the first phase of a compiler, where the source code is analyzed to identify meaningful
elements called tokens. It is often referred to as the "scanning" phase, as it involves reading the source
code sequentially and breaking it down into basic units.

Example:

Consider a simple arithmetic expression:

int sum = a + b * 5;

During lexical analysis, this code will be broken down into the following tokens:

 int, sum, =, a, +, b, *, 5, ;

Role of Lexical Analyzer

The lexical analyzer (also known as a lexer or scanner) is responsible for reading the input source code
and converting it into a stream of tokens. Its roles include:

1. Tokenization: Breaking the input string into meaningful symbols or tokens.

o Example: For the expression a + b, the tokens generated could be a, +, b.

2. Removing White Spaces and Comments: The lexer typically removes unnecessary elements like
white spaces, line breaks, and comments that do not contribute to the code's meaning.

3. Error Reporting: If the lexer encounters an invalid token (e.g., an unrecognized character), it
reports an error and may provide the location of the issue.

o Example: If the source code contains an invalid character like # in a language where # is
not defined, the lexer will flag this as an error.

4. Handling Literals and Identifiers: The lexer distinguishes between different types of tokens such
as keywords, identifiers, literals (e.g., numbers, strings), and operators.

Tokens, Lexemes, and Patterns

 Token: A token is a pair consisting of a token name and an optional attribute value. It represents
a class of lexemes.

o Example: The keyword int is a token representing the data type integer.

 Lexeme: A lexeme is a sequence of characters in the source code that matches the pattern for a
token and is identified by the lexical analyzer.
o Example: In the code int sum = 5;, the lexemes are int, sum, =, 5, and ;.

 Pattern: A pattern is a rule that specifies the structure of lexemes that can be recognized as a
particular token.

o Example: The pattern for an identifier might be [a-zA-Z_][a-zA-Z0-9_]*, which matches

any string starting with a letter or underscore, followed by any number of letters, digits,
or underscores.

Regular Expressions and Finite Automata

 Regular Expressions: Regular expressions are used to specify the patterns for tokens in a formal
and concise manner. They define the set of strings that match a particular pattern.

o Example: The regular expression [0-9]+ matches one or more digits, which can be used
to recognize integer literals.

 Finite Automata: Finite automata (FA) are abstract machines used to recognize patterns
described by regular expressions. There are two types of finite automata:

1. Deterministic Finite Automaton (DFA): A DFA has a finite number of states and
transitions between them based on input symbols. It can be used to recognize a regular
language.

 Example: A DFA for recognizing binary numbers (0 or 1) might have states that
transition between accepting or rejecting based on the input bit.

2. Non-deterministic Finite Automaton (NFA): An NFA allows multiple transitions for the
same input symbol or no transition at all. NFAs can be converted to DFAs for practical
implementation.

 Example: An NFA for recognizing the pattern a*b could have multiple paths
where the a* part can accept any number of a characters followed by b.

Summary

 Lexical Analysis involves breaking down source code into tokens.

 The lexical analyzer performs tokenization, removes white spaces and comments, reports errors,
and distinguishes between different types of tokens.

 Tokens are basic units, lexemes are the actual character sequences, and patterns define what
lexemes should look like.

 Regular expressions and finite automata are tools used to define and recognize patterns in the
source code.

You said:

generte automation for d((a+bd)c)*

To generate an automation (finite automaton) for the regular expression d((a+bd)c)*, we'll break down
the regular expression and construct the corresponding deterministic finite automaton (DFA). Here's a
step-by-step explanation:

Regular Expression Breakdown

1. d: This requires that the string starts with the character d.

2. (a + bd): This indicates that after d, there can be either a or bd.

3. c: This c follows whatever was chosen in (a + bd).

4. *: This asterisk * means that the pattern (a + bd)c can repeat zero or more times.

Steps to Create DFA

1. Start State:

o Begin with a start state q0.

2. Transition for d:

o From q0, if the input is d, move to state q1.

3. Handling (a + bd):

o From state q1, if the input is a, transition to state q2.

o Alternatively, if the input is b, transition to state q3.

o From state q3, if the input is d, transition to state q4.

4. Transition for c:

o From state q2, if the input is c, transition back to state q1 (because the pattern can
repeat).

o From state q4, if the input is c, transition back to state q1.

5. Accept State:

o The accepting state should be q1 because the pattern can repeat, so q1 is where the
string is accepted if it ends with c.

DFA Representation in English

1. Start at q0.

2. If the input is d, move to q1.

3. If at q1:
o If the input is a, move to q2.

o If the input is b, move to q3.

4. If at q3 and the input is d, move to q4.

5. If at q2 or q4 and the input is c, move back to q1.

6. Accept the string if it ends in state q1.

Summary of Transitions

 q0 → q1 on input d

 q1 → q2 on input a

 q1 → q3 on input b

 q3 → q4 on input d

 q2 → q1 on input c

 q4 → q1 on input c

Final DFA

The DFA will have states q0, q1, q2, q3, q4, where q0 is the start state and q1 is both the accepting and
repeating state. The transitions will allow the string to be accepted if it follows the pattern d((a+bd)c)*.

You said:

Lexical analyzer examples?

A lexical analyzer (or lexer) is a component of a compiler or interpreter that processes the input source
code and converts it into a sequence of tokens. Here are some examples to illustrate how a lexical
analyzer works:

Example 1: Simple Arithmetic Expression

Source Code:

int result = a + b * 5;

Lexical Analysis Process:

1. Input: int result = a + b * 5;

2. Output Tokens:

o int → Keyword

o result → Identifier

o = → Assignment Operator

o a → Identifier

o + → Addition Operator

o b → Identifier

o * → Multiplication Operator

o 5 → Numeric Literal

o ; → Semicolon

Explanation:

 The lexical analyzer reads the code character by character.

 It recognizes keywords like int, operators like =, +, *, and the semicolon ;.

 It also identifies identifiers like result, a, b, and the numeric literal 5.

Example 2: Variable Declaration and Assignment

Source Code:

name = ""

Lexical Analysis Process:

1. Input: name = ""

2. Output Tokens:

o name → Identifier

o = → Assignment Operator

o "" → String Literal

Explanation:

 The lexer identifies name as an identifier and the equal sign = as an assignment operator.

 The string "" is recognized as a string literal.

Example 3: Conditional Statement

Source Code:

java

if (x > 10) {

y = x * 2;

Lexical Analysis Process:

1. Input: if (x > 10) { y = x * 2; }

2. Output Tokens:
 if → Keyword
 ( → Left Parenthesis
 x → Identifier
 > → Relational Operator
 10 → Numeric Literal
 ) → Right Parenthesis
 { → Left Brace
 y → Identifier
 = → Assignment Operator
 x → Identifier
 → Multiplication Operator
 2 → Numeric Literal
 ; → Semicolon
 } → Right Brace

Explanation:

 The lexical analyzer processes the if statement, identifying keywords (if), operators (>, =, *),
literals (10, 2), and symbols ((, ), {, }, ;).

 Each component is categorized into its respective token type.

Example 4: Function Definition

Source Code:

def add(a, b):

return a + b

Lexical Analysis Process:

1. Input: def add(a, b): return a + b

2. Output Tokens:
 def → Keyword
 add → Identifier
 ( → Left Parenthesis
 a → Identifier
 , → Comma
 b → Identifier
 ) → Right Parenthesis
 : → Colon
 return → Keyword
 a → Identifier
 + → Addition Operator
 b → Identifier

Explanation:

 The lexer identifies function-related keywords (def, return), identifiers (add, a, b), and
operators/symbols ((, ), ,, :, +).

Summary

In each example, the lexical analyzer reads the input code, removes irrelevant characters like white
spaces or comments, and produces a sequence of tokens. These tokens are then passed on to the next
stage of compilation or interpretation (usually syntax analysis).

Common questions

A lexical analyzer, also known as a scanner, is the first phase of a compiler. It processes a high-level input program and converts it into a sequence of tokens, which are the input for the parser in the syntax analysis phase . The lexical analyzer removes non-tokens like whitespace and comments, identifying elements such as keywords, identifiers, and operators as tokens . This streamlining facilitates the parsing process, as the parser can then focus on validating the sequence of tokens according to the grammar rules without being distracted by extraneous characters .

Derivation in syntax analysis involves applying production rules to transform the start symbol into a sentence of a language. In left-most derivation, the left-most non-terminal is replaced first in each step, leading to a left-sentential form . Conversely, right-most derivation replaces the right-most non-terminal first, producing a right-sentential form . The impact on the parse tree is significant, as different derivation approaches result in different tree structures and can affect the understanding of associativity and operator precedence . For example, parse trees illustrate whether an operation should be executed before another due to operator precedence, which is visualized through the tree’s structure with deeper sub-trees having higher precedence .

Left recursion in grammars poses a problem for top-down parsers, which attempt to expand the left-most non-terminal first. If a grammar is left-recursive, the parser may fall into an infinite loop as it repeatedly attempts to expand a non-terminal with itself at the left-most position . To resolve left recursion without altering the language, we transform the grammar. For direct left recursion, a common method involves changing a production of the form A → Aα | β into A → βA' and A' → αA' | ε, effectively eliminating the immediate left recursion . Indirect left recursion can also be resolved systematically by reordering non-terminals and applying similar transformations, ensuring that the resulting grammar is free of left recursion while generating the same strings .

Associativity and precedence play critical roles in resolving operator expression ambiguities in grammar, guiding how operators are grouped in expressions . Associativity defines the order of operations for operators of the same precedence level; for instance, left-associative operators like addition and multiplication are grouped left-to-right (e.g., a + b + c is evaluated as (a + b) + c). Precedence dictates the order in which operations are performed when different operators appear together. In the expression 2 + 3 * 4, precedence rules ensure multiplication occurs before addition, leading to 2 + (3 * 4) rather than (2 + 3) * 4 . These rules provide a systematic way to handle potential parsing ambiguities where multiple parse trees might be possible, ensuring consistent evaluation and interpretation of expressions without ambiguity .

Deterministic finite automata (DFAs) are used in lexical analysis because they provide an efficient method for implementing pattern recognition required to tokenize input text . DFAs are state machines that process input strings one character at a time, transitioning between states based on the input. This enables the lexical analyzer to match strings against patterns defining tokens such as keywords, identifiers, and operators . The deterministic nature ensures that for any given input, the DFA can unambiguously determine the state transitions, thereby efficiently recognizing token boundaries and accurately generating a stream of tokens . This precise control over the input processing enhances the overall speed and reliability of the tokenization process in compilers and interpreters .

Token validation ensures that each recognized token matches the rules specified by the programming language's grammar, confirming their syntactic validity before being passed to the parser . During this stage, the lexical analyzer checks that variable names adhere to identifier naming conventions, keywords are recognized correctly, and that operators are valid according to language syntax . This validation process is crucial to ensure that the parsed tokens do not lead to syntax errors in later compilation stages due to incorrect token identification. By maintaining fidelity to the language's syntactic rules, the lexical analyzer aids in preventing errors that the parser would otherwise need to handle, streamlining the parsing process .

Ambiguity in grammar is problematic because it leads to multiple parse trees for the same string, which can cause inconsistent interpretations of code . This situation is undesirable in compiler construction because it complicates parsing and affects the correctness and predictability of the compiler's output . To address ambiguity, grammar can be rewritten to exclude ambiguous constructs or by defining operator precedence and associativity clearly . Although no automatic method exists for removing ambiguity, these strategies can substantially mitigate potential difficulties by ensuring that the language grammar has a unique parse tree for each valid string .

Left factoring is a grammar transformation technique used to make top-down parsers more effective when they encounter grammar productions with common prefixes in rules . It is necessary because without left factoring, a top-down parser may face difficulties in choosing which production rule to apply, as it relies on lookahead to make decisions. Left factoring involves rewriting the grammar such that each decision point has a separate production rule for common prefixes. For example, if the production is A → αβ | αγ, it is rewritten as A → αA' and A' → β | γ, resolving ambiguity and simplifying decisions made by the parser .

During syntax analysis, a parser's main challenge in error recovery is to continue processing the input code despite encountering errors, aiming to report as many errors as possible in a single pass without getting stuck . This challenge arises because syntax errors can disrupt the orderly flow of parsing, potentially leading to cascading errors that obscure the actual source of the problem. To overcome these challenges, parsers employ strategies such as panic mode error recovery, which skips tokens until a synchronizing token is found, and phrase level recovery, where the parser attempts to replace or insert a minimal number of tokens to repair the error . These strategies help the parser to regain a correct parsing state quickly, allowing it to continue parsing the remainder of the input and identify additional errors .

First and follow sets are fundamental in constructing parsing tables, which are used in syntax analysis to determine parsing actions. The first set of a non-terminal is the set of terminals that begin the strings derivable from the non-terminal . The follow set, on the other hand, contains terminals that can appear immediately to the right of the non-terminal in some sentential form. Parsing tables use these sets to decide the production rule to apply for a given non-terminal and terminal in the input . When constructing an LL(1) parsing table, entries are determined using the first set for non-terminals and the follow set for handling ε-productions, which ensures that the parser can decide next actions based solely on the current input and stack symbols .

UNIT-1 Compiler Design
No ratings yet
UNIT-1 Compiler Design
38 pages
Understanding Context-Free Grammar in Compilers
No ratings yet
Understanding Context-Free Grammar in Compilers
18 pages
Lexical Analyzer and Syntax Analyzer
No ratings yet
Lexical Analyzer and Syntax Analyzer
9 pages
Module 5
No ratings yet
Module 5
78 pages
Compiler Design Phases Overview
No ratings yet
Compiler Design Phases Overview
117 pages
Compiler Construction Tools Overview
No ratings yet
Compiler Construction Tools Overview
5 pages
Lexical Analyzer and Tokenization Overview
No ratings yet
Lexical Analyzer and Tokenization Overview
16 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
46 pages
Lexical Analysis in Compilers Explained
No ratings yet
Lexical Analysis in Compilers Explained
12 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
10 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
10 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
56 pages
Compiler Phases Explained with Examples
No ratings yet
Compiler Phases Explained with Examples
16 pages
Lexical Error Handling Techniques
No ratings yet
Lexical Error Handling Techniques
56 pages
Lexical vs Syntax Analysis Explained
No ratings yet
Lexical vs Syntax Analysis Explained
3 pages
Chapter Two
No ratings yet
Chapter Two
13 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
58 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
107 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
55 pages
CD Chapter 1
No ratings yet
CD Chapter 1
28 pages
DFA in Lexical Analysis of Compilers
No ratings yet
DFA in Lexical Analysis of Compilers
24 pages
Language Processors and Lexical Analysis
No ratings yet
Language Processors and Lexical Analysis
85 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
59 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
22 pages
Chapter 2-Lexical Analysis
No ratings yet
Chapter 2-Lexical Analysis
52 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
48 pages
Compiler Parser Overview and Techniques
No ratings yet
Compiler Parser Overview and Techniques
126 pages
CD Chapter 1
No ratings yet
CD Chapter 1
28 pages
Compiler Design: Lexical Analysis Overview
No ratings yet
Compiler Design: Lexical Analysis Overview
28 pages
Lexical Analyzer in Compiler Design
No ratings yet
Lexical Analyzer in Compiler Design
26 pages
Phases of Compiler Compilation Process
No ratings yet
Phases of Compiler Compilation Process
21 pages
Lexical Analyzer in Compiler Design
No ratings yet
Lexical Analyzer in Compiler Design
16 pages
Compiler Notes 8
No ratings yet
Compiler Notes 8
22 pages
Chapter - 6 - Automata Theory and Compiler
No ratings yet
Chapter - 6 - Automata Theory and Compiler
32 pages
Lexical Analyzer Output Explained
No ratings yet
Lexical Analyzer Output Explained
16 pages
CD Unit-1 Notes
No ratings yet
CD Unit-1 Notes
12 pages
Front-End Compiler Design Overview
No ratings yet
Front-End Compiler Design Overview
20 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
35 pages
Lexical and Syntax Analysis in Compilers
No ratings yet
Lexical and Syntax Analysis in Compilers
72 pages
Introduction to Compiler Design
No ratings yet
Introduction to Compiler Design
30 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
42 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
16 pages
Compiler Design: Lexical Analysis Overview
No ratings yet
Compiler Design: Lexical Analysis Overview
51 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
14 pages
CD - Ch.1
No ratings yet
CD - Ch.1
28 pages
CD Chap 1
No ratings yet
CD Chap 1
28 pages
Compiler Design: Phases and Analysis
No ratings yet
Compiler Design: Phases and Analysis
50 pages
Compiler Design: Lexical & Syntax Analysis
No ratings yet
Compiler Design: Lexical & Syntax Analysis
31 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
37 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
150 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
52 pages
Compiler Structure and Lexical Analysis
No ratings yet
Compiler Structure and Lexical Analysis
5 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
59 pages
Compiler Construction Overview and Phases
No ratings yet
Compiler Construction Overview and Phases
36 pages
Language Processing and Lexical Analysis
No ratings yet
Language Processing and Lexical Analysis
60 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
67 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
5 pages
Understanding Lexical Analysis in Compilers
No ratings yet
Understanding Lexical Analysis in Compilers
22 pages
Financial Awareness Guide for Exams
No ratings yet
Financial Awareness Guide for Exams
64 pages
T.Y.B.Sc. Java Practical Exam Guide
No ratings yet
T.Y.B.Sc. Java Practical Exam Guide
30 pages
DC Generator Operation and Types
No ratings yet
DC Generator Operation and Types
30 pages
Interhandel Case Overview
No ratings yet
Interhandel Case Overview
11 pages
Dumangas: History and Haw-as Festival
No ratings yet
Dumangas: History and Haw-as Festival
10 pages
DLP Miel
No ratings yet
DLP Miel
9 pages
Robots' Last Wish in a Wasteland
No ratings yet
Robots' Last Wish in a Wasteland
16 pages
Technical Qualifications Timeline 2022/23
No ratings yet
Technical Qualifications Timeline 2022/23
2 pages
Concepts of Genetics 12th Edition Masteringgenetics William S. Klug - Ebook
100% (3)
Concepts of Genetics 12th Edition Masteringgenetics William S. Klug - Ebook
192 pages
Primary Heating Systems Overview
No ratings yet
Primary Heating Systems Overview
18 pages
Dairy Nutrition Guidelines and Insights
No ratings yet
Dairy Nutrition Guidelines and Insights
9 pages
Writing A Reply Letter To Linda Liu (Marking Scheme)
No ratings yet
Writing A Reply Letter To Linda Liu (Marking Scheme)
2 pages
OffendES: Spanish Corpus for Offensive Language
No ratings yet
OffendES: Spanish Corpus for Offensive Language
13 pages
Collection of E Books On Hypnotherapy and Silva Mind Control
No ratings yet
Collection of E Books On Hypnotherapy and Silva Mind Control
2 pages
DMCPA80FL Media Receiver Manual
No ratings yet
DMCPA80FL Media Receiver Manual
30 pages
Learning Plan for Student Engagement
No ratings yet
Learning Plan for Student Engagement
4 pages
Brutal Intentions by Lilith Vincent
94% (17)
Brutal Intentions by Lilith Vincent
200 pages
Understanding Cell Properties and Functions
No ratings yet
Understanding Cell Properties and Functions
23 pages
Grade 10 Science Paper 2 Test 2021
No ratings yet
Grade 10 Science Paper 2 Test 2021
1 page
English Test on Canada and Ukraine
No ratings yet
English Test on Canada and Ukraine
4 pages
AI Tools Enhance Academic Efficiency
No ratings yet
AI Tools Enhance Academic Efficiency
16 pages
GST Invoice for Curtain Purchase
No ratings yet
GST Invoice for Curtain Purchase
1 page
Benefits and Challenges of Tourism
No ratings yet
Benefits and Challenges of Tourism
9 pages
Comprehensive Latin Phrases List
No ratings yet
Comprehensive Latin Phrases List
91 pages
IT Fundamentals Course Overview
No ratings yet
IT Fundamentals Course Overview
2 pages
Acutonics: Healing with Tuning Forks
100% (3)
Acutonics: Healing with Tuning Forks
13 pages
Tutorial Sweet Owlie
100% (1)
Tutorial Sweet Owlie
11 pages
SQL Operations on Loan Database
No ratings yet
SQL Operations on Loan Database
2 pages
Code of Conduct for Court Personnel
No ratings yet
Code of Conduct for Court Personnel
6 pages
Hilti PROFIS Engineering Anchor Design
No ratings yet
Hilti PROFIS Engineering Anchor Design
9 pages

Understanding Lexical Analysis in Compilers

Uploaded by

Understanding Lexical Analysis in Compilers

Uploaded by

Introduction of Lexical Analysis

1. Lexical Analysis can be implemented with the Deterministic finite Automata.

 Type token (id, number, real, . . . )

 Punctuation tokens (IF, void, return, . . . )

 Alphabetic tokens (keywords)

Keywords; Examples-for, while, if etc.

Identifier; Examples-Variable name, function name, etc.

Operators; Examples '+', '++', '-' etc.

Separators; Examples ',' ';' etc

 Comments, preprocessor directive, macros, blanks, tabs, newline, etc.

How Lexical Analyzer Works?

Suppose we pass a statement through lexical analyzer – a = b + c; It will generate token

All the valid tokens are:

'a' '=' '10' ';' 'return' '0' ';' '}'

Answer: Total number of token: 27.

 max is read by it and found to be a valid function name after reading (

 int is also a token , then again I as another token and finally ;

Answer: Total number of tokens 7:

int, max, ( ,int, i, ), ;

 We can represent in the form of lexemes and tokens as under

Lexemes Tokens Lexemes Tokens

while WHILE a IDENTIEFIER

>= COMPARISON – ARITHMETIC

Compiler Design - Syntax Analysis

CFG, on the other hand, is a superset of Regular Grammar, as depicted below:

A context-free grammar has four components:

 Deciding the non-terminal which is to be replaced.

 Deciding the production rule, by which, the non-terminal will be replaced.

The left-most derivation is:

Notice that the left-most side non-terminal is always processed first.

The right-most derivation is:

We take the left-most derivation of a + b * c

The left-most derivation is:

 All leaf nodes are terminals.

 All interior nodes are non-terminals.

 In-order traversal gives original input string.

The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity in

it will be evaluated as:

For example, (id + id) + id

For example, id ^ (id ^ id)

These methods decrease the chances of ambiguity in a language or its grammar.

(2) is an example of indirect-left recursion.

Removal of Left Recursion

One way to remove left recursion is to use the following technique:

is converted into following productions

Arrange non-terminals in some order like A1, A2, A3,…, An

for each i from 1 to n

for each j from 1 to i-1

replace each production of form Ai ⟹Aj𝜸

with Ai ⟹ δ1𝜸 | δ2𝜸 | δ3𝜸 |…| 𝜸

where Aj ⟹ δ1 | δ2|…| δn are current Aj productions

eliminate immediate left-recursion

after applying the above algorithm, should become

A' => αdA' | ε

If a top-down parser encounters a production like

The above productions can be written as

First and Follow Sets

That is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).

Algorithm for calculating First set

Look at the definition of FIRST(α) set:

 if α is a terminal, then FIRST(α) = { α }.

 if α is a non-terminal and α → ℇ is a production, then FIRST(α) = { ℇ }.

 if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any FIRST(𝜸) contains t then t is in

First set can be seen as:

Likewise, we calculate what terminal symbol immediately follows a non-terminal α in production

Algorithm for calculating Follow set:

 if α is a start symbol, then FOLLOW() = $

 if α is a non-terminal and has a production α → AB, then FIRST(B) is in FOLLOW(A) except ℇ.

 if α is a non-terminal and has a production α → AB, where B ℇ, then FOLLOW(A) is in

Follow set can be seen as: FOLLOW(α) = { t | S *αt*}

Limitations of Syntax Analyzers

 it cannot determine if a token is valid,

 it cannot determine if a token is declared before it is being used,

 it cannot determine if a token is initialized before it is being used,

 it cannot determine if an operation performed on a token type is valid or not.

Consider a simple arithmetic expression:

Follow set can be seen as: FOLLOW(α) = { t | S αt}