0% found this document useful (0 votes)
8 views17 pages

Compiler Phases and Lexical Analysis Guide

The document provides an overview of compilers, including their phases, such as lexical and syntax analysis, and the concepts of tokens and lexemes. It explains the differences between single-pass and multi-pass compilers, as well as the bootstrapping process in compiler design. Additionally, it covers tools like LEX and YACC for generating lexical analyzers and parsers, respectively.

Uploaded by

alkamansi03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views17 pages

Compiler Phases and Lexical Analysis Guide

The document provides an overview of compilers, including their phases, such as lexical and syntax analysis, and the concepts of tokens and lexemes. It explains the differences between single-pass and multi-pass compilers, as well as the bootstrapping process in compiler design. Additionally, it covers tools like LEX and YACC for generating lexical analyzers and parsers, respectively.

Uploaded by

alkamansi03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT I

Compiler
 Compiler is a translator program that translates a program written in (High Level
Language)) the source program and translates it into an equivalent program
in (Machine Level Language) the target program.
 As an important part of a compiler is error showing to the programmer.
Assembler
 Programs known as assembler were written to automate the translation of assembly
language into machine language
language.
The input to an assembler program is called source program,, the output is a machine
language translation
Phases of a Compiler
 A compiler operates in phases. A phase is a logically interrelated operation that
takes source program in one representation and produces output in another
representation.

Tokens and lexemes


In lexical analysis, tokens and lexemes are key concepts. A token is a category, like a
keyword or identifier, representing units of meaning. A pattern defines the structure that
matches a token, such as a regular expression for an identifier. A lexeme is the actual
sequence of characters in the source code that matches a token’s pattern.
In programming, a token is the smallest unit of meaningful data; it may be an identifier,
keyword, operator, or symbol. A token represents a series or sequence of characters that
cannot be decomposed further. In languages such as C, some examples of tokens would
include:
 Keywords : Those reserved words in C like ` int `, ` char `, ` float `, ` const `, ` goto `,
etc.
 Identifiers: Names of variables and user-defined functions.
 Operators : ` + `, ` – `, ` * `, ` / `, etc.
 Delimiters /Punctuators: Symbols used such as commas ” , ” semicolons ” ; ” braces
` {} `.

Example 1:
int a = 10; //Input Source code
Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)
Answer – Total number of tokens = 5

Example 2:
int main() {
// printf() sends the string inside quotation to
// the standard output (the display)
printf("Welcome to GeeksforGeeks!");
return 0;
}
Tokens
'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to GeeksforGeeks!" ',
')', ';', 'return', '0', ';', '}'

Answer – Total number of tokens = 14

Lexeme
A lexeme is a sequence of source code that matches one of the predefined patterns and
thereby forms a valid token. For example, in the expression `x + 5`, both `x` and `5` are
lexemes that correspond to certain tokens. These lexemes follow the rules of the language
in order for them to be recognized as valid tokens.
A lexeme is an actual character sequence forming a specific instance of a token, such as
num.

Lexeme- A lexeme is a string of character that is the lowest level syntactic unit in the
programming language.

Token- The token is a syntactic category that forms a class of lexemes that means which
class the lexeme belong is it a keyword or identifier or anything else. One of the major
tasks of the lexical analyzer is to create a pair of lexemes and tokens, that is to collect all the
characters.
Let us take an example:-
if(y<= t)
y=y-3;

Token Lexeme
KEYWORD if
LEFT PARENTHESIS (
IDENTIFIER y
< =COMPARISON =
IDENTIFIER t
RIGHT PARENTHESIS )

IDENTIFIER y

ASSGNMENT =

IDENTIFIER y

ARITHMATIC -

INTEGER 3

SEMICOLON ;

Example
sum = 3 + 2;
Tokenized and represented by the following table:
Lexeme Token category
------------------------------
sum | Identifier
= | Assignment operator
3 | Integer literal
+ | Addition operator
2 | Integer literal
; | End of statement

Lexical Analysis
 Lexical analysis is the first phase of a compiler. It takes the source code from user and
produces the output in the form of tokens. The lexical analyzer breaks these syntaxes
into a series of tokens, by removing any whitespace or comments in the source code.

x = a+b*50
x -> Identifier- (id, 1)
= -> Operator - Assignment
a -> Identifier- (id, 2)
+ -> Operator - Binary Addition
b -> Identifier- (id, 3)
* -> Operator - Multiplication
50 -> Constant - Integer

Now, the final tokenized expression is given below.


(id, 1) = (id, 2) + (Id, 3)*50

Syntax Analysis
 The second stage of translation is called syntax analysis or parsing. In this
phase expressions, statements, declarations are identified by using the results of
lexical analysis.
 A Syntax analyzer creates the syntactic structure of the given program.
Compiler Pass
A Compiler pass refers to the traversal of a compiler through the entire
program. Compiler passes are of two types Single Pass Compiler, and Two Pass
Compiler or Multi-Pass Compiler. These are explained as follows.
Types of Compiler Pass
1. Single Pass Compiler
If we combine or group all the phases of compiler design in a single module (unit) known
as a single pass compiler.

Single Pass Compiler: In the above diagram, there are all 6 phases are grouped in a single
module, some points of the single pass compiler are as:
 A one-pass/single-pass compiler is a type of compiler that passes through the part of
each compilation unit exactly once.
 Single pass compiler is faster and smaller than the multi-pass compiler.
 A disadvantage of a single-pass compiler is that it is less efficient in comparison with
the multipass compiler.
 A single pass compiler is one that processes the input exactly once, so going directly
from lexical analysis to code generator, and then going back for the next read.
Note: Single pass compiler almost never done, early Pascal compiler did this as an
introduction.
Problems with Single Pass Compiler
 We can not optimize very well due to the context of expressions are limited.
 As we can’t back up and process, it again sogrammar should be limited or simplified.
 Command interpreters such as bash/sh/tcsh can be considered Single pass compilers,
but they also execute entries as soon as they are processed.

Two-Pass compiler or Multi-Pass compiler


A Two pass/multi-pass Compiler is a type of compiler that processes the source
code or abstract syntax tree of a program multiple times. In multi-pass Compiler, we divide
phases into two or more passes as:

First Pass is referred as


 Front end
 Analytic part
 Platform independent
Second Pass is referred as
 Back end
 Synthesis Part
 Platform Dependent
Problems that can be Solved With Multi
Multi-Pass Compiler

First Pass: If we want to design a compiler for a different programming language for the
same machine. In this case for each programming language, there is a requirement to make
the Front end/first pass for each of them and only one Back end/second pass as:

Second Pass: If we want to design a compiler for the same programming language for
different machines/systems. In this case, we make different Back end for different
Machine/system and make only one Front end for the same programming language as:
Conclusion
In conclusion, the choice between a single-pass and a two-pass compiler depends on
specific requirements and trade-offs. Single-pass compilers are faster and more compact but
have limitations in optimization and grammar complexity. Two-pass compilers, on the
other hand, offer greater flexibility for different programming languages and machine
systems but come at the cost of additional processing. Understanding these distinctions is
crucial for designing effective compilers.

Bootstrapping

Bootstrapping is an important technique in compiler design, where a basic compiler is


used to create a more advanced version of itself. This process helps in building compilers
for new programming languages and improving the ones already in use. By starting with a
simple compiler, bootstrapping allows gradual improvements and makes the compiler more
efficient over time.
 Bootstrapping relies on the idea of a self-compiling compiler, where each iteration
improves the compiler’s ability to handle more complex code.
 It simplifies the development cycle, allowing incremental improvements and faster
deployment of more robust compilers.
 Many successful programming languages, including C and Java, have used
bootstrapping techniques during their development.

A compiler can be used three distinct languages.


Source Language (S): This is the input language provided to the compiler. It represents the
code that needs to be compiled.
Target Language (T): This is the output language produced by the compiler after the
compilation process. It is the language that the source code is translated into.
Implementation Language (I): This is the language used to design and implement the
compiler itself.

The T-Diagram Representation


The relationship between these languages can be visually represented using T-diagrams.
The T-diagram illustrates how the source language (S) is transformed into the target
language (T) through the implementation language (I).
A compiler can be represented using a T Diagram. In this diagram, the source language of
the compiler is positioned at the top-left, the target language (the language produced by the
compiler) is placed at the top-right, and the language in which the compiler is implemented
is shown at the bottom.
Working of Bootstrapping
Bootstrapping is the process of creating compilers. It involves a methodology where a
slightly more complicated compiler is created using a simple language (such as assembly
language). This slightly more complicated compiler, in turn, is used to create an even more
advanced compiler, and this process continues until the desired result is achieved.
Here’s a step-by-step look at how bootstrapping works in compiler design:

Step 1: Start with a Basic Compiler


The first step is to create a basic compiler that can handle the most essential features of a
programming language. This simple compiler is often written in assembly language or
machine language to make it easier to build.

Step 2: Use the Basic Compiler to Create a More Advanced Version


Once the basic compiler is ready, it is used to compile a more advanced version of itself.
This new version can handle more complex features, like better error checking and
optimizations.

Step 3: Gradually Improve the Compiler


With each new version, the compiler becomes more capable. The process is repeated, and
each iteration adds more features, making the compiler stronger and more efficient.
For example, let’s assume a compiler which takes C language as input and generates an
assembly language as an output.
1. To generate this compiler we first write a compiler for a small subset of C language i.e.
C0 in assembly language. Subset of C means C language with less functionality.
Here in the T diagram, source language is subset of C (C0), target language is Assembly
language and implementation language is also Assembly language.
2. Then using the C0 language, we create a compiler for the C language. This compiler C0,
takes C language as source language and generates assembly language as target language as
shown below.

By the help bootstrapping we generated compiler for C language in C language itself i.e.
self-compiling compiler.

Cross-compiler
A cross compiler is a compiler capable of creating executable code for a platform other
than the one on which the compiler is running.
Across compiler is a compiler that runs on one m
machine
achine and produces object code for another
machine. For example, a compiler that runs on a windows 7 PC but generates code that runs
on Android Smartphone is a cross compiler.
The cross compiler is used to implement the compiler which is characterized by three distinct
languages:
Source Language (S)
Target Language (T)
Implementation Language (I)

LEX in Compiler Design


It is a tool or software which automatically generates a lexical analyzer (finite Automata). It
takes as its input a LEX source program and produces lexical Analyzer as its output. Lexical
Analyzer will convert the input string entered by the user into tokens as its output.

Lex is a tool or a computer program that generates Lexical Analyzers (converts the stream
of characters into tokens). The Lex tool itself is a compiler. The Lex compiler takes the
input and transforms that input into input patterns.
Function of Lex
1. In the first step the source code which is in the Lex language having the file name ‘File.l’
gives as input to the Lex Compiler commonly known as Lex to get the output as [Link].c.
2. After that, the output [Link].c will be used as input to the C compiler which gives the
output in the form of an ‘[Link]’ file, and finally, the output file [Link] will take the stream of
character and generates tokens as output.
Lex File Format: A Lex program consists of three parts and is separated by % delimiters:-

Declarations
%%
Translation rules
%%
Auxiliary procedures

Declarations: The declarations include declarations of variables.

Transition rules: These rules consist of Pattern and Action.

Auxiliary procedures: The Auxilary section holds auxiliary functions used in the actions.

For example:
declaration
number[0-9]
%%
translation
if {return (IF);}
%%
auxiliary function
int numberSum()

Example
/*** Definition section ***/

%{
/* C code to be copied verbatim */
#include <stdio.h>
%}

%%
/*** Rules section ***/

/* [0-9]+ matches a string of one or more digits */


[0-9]+ {
/* yytext is a string containing the matched text. */
printf("Saw an integer: %s\n", yytext);
}

.|\n { /* Ignore all other characters. */ }

%%
/*** C Code section ***/

int main(void)
{
/* Call the lexer, then quit. */
yylex();
return 0;
}

YACC in Compiler Design


YACC (Yet Another Compiler Compiler) is a tool used to generate a parser. YACC translates
a given Context Free Grammar (CFG) specifications (input in input_file.y) into a C
implementation ([Link].c) of a corresponding push down automaton (i.e., a finite state machine
with a stack). This C program when compiled yields an executable parser.
 An input with a .y extension is given.
 The file is invoked, and 2 files, [Link].h and [Link].c, are created. These files contain long
codes implementing the LARl (1) Parser for grammar.
 This file then provides yyparse.y, which tries to parse a valid sentence successfully.
The structure of YACC programs
A YACC program consists of three sections: Declarations, Rules and Auxiliary functions.
(Note the similarity with the structure of LEX programs).

DECLARATIONS

%%

RULES

%%

AUXILIARY FUNCTIONS
Examples of DFA and NFA
Example
Design a FA with ∑ = {0, 1} accepts those string which starts with 1 and ends with 0.
The FA will have a start state q0 from which only the edge with input 1 will go to the next
state.

Example
Design FA with ∑ = {0, 1} accepts even number of 0's and even number of 1's.
Solution:
This FA will consider four different sstages
tages for input 0 and input 1. The stages could be:

Example
Design FA with ∑ = {0, 1} accepts the set of all strings with three consecutive 0's.
Solution:
The strings that
at will be generated for this particular languages are 000, 0001, 1000, 10001, ....
in which 0 always appears in a clump of 3. The transition graph is as follows:

Example
Design a FA with ∑ = {0, 1} accepts the strings with an even numb
number
er of 0's followed by
single 1.
Solution:
The DFA can be shown by a transition diagram as:
Design an NFA with ∑ = {0, 1} accepts all string in which the third symbol from the right
end is always 0.
Solution:

Thus we get the third symbol from the right end as '0' always. The NFA can be:

Example
Convert
ert the given NFA to DFA.

Solution: For the given transition diagram we will first construct the transition table.

DFA
Example
Convert the given NFA to DFA.

DFA

You might also like