0% found this document useful (0 votes)
6 views15 pages

Nondeterministic Finite Automata in Compilers

compiler design

Uploaded by

shyamsingh841442
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

Nondeterministic Finite Automata in Compilers

compiler design

Uploaded by

shyamsingh841442
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Compiler Design

The compiler is software that converts a program written in a high-level language (Source Language)
to a low-level language (Object/Target/Machine Language/0, 1’s).

Compilers and Translators

Compilers and translators are essential components in the development and execution of computer
programs. They serve to convert human-readable code (source code) written in high-level
programming languages into machine code that the computer's hardware can execute.

Key Concepts:

1. Compiler:

o A compiler is a type of translator that reads an entire high-level language program


(like C, C++, or Java) and converts it into machine code in one go.

o It produces an executable file (e.g., .exe or .out) that can be run independently.

o The compilation process involves several stages:

 Lexical Analysis: Breaking the source code into tokens.

 Syntax Analysis: Checking for syntactical correctness (grammar rules).

 Semantic Analysis: Ensuring that the program's logic makes sense.

 Optimization: Improving the efficiency of the code.

 Code Generation: Producing the final machine code.

2. Translator:

o A translator is a broader term that refers to any software that converts code from
one form to another. It includes:

 Compilers (convert high-level languages to machine code).

 Interpreters (execute the source code line by line, without producing an


intermediate machine code).

 Assemblers (convert assembly language into machine code).

 Decompilers (convert machine code back into a high-level language).

Need for Translators

1. Language Barrier:

o Computers only understand machine code, which is represented in binary (0s and
1s). High-level languages like Python, Java, or C++ are easier for humans to write and
understand but need to be translated into machine code for the computer to
execute the instructions.

2. Efficiency:
o Translators like compilers optimize the code, ensuring that it runs as efficiently as
possible by improving resource usage (memory, CPU time).

3. Error Detection:

o Translators help catch errors during the conversion process. Compilers provide
detailed error messages when the source code contains syntax or semantic issues.

4. Portability:

o High-level languages are designed to be portable, meaning the same program can
run on different hardware or operating systems. Translators allow this by converting
the same source code into machine code specific to the target system.

5. Ease of Development:

o High-level languages provide features like abstraction, object-oriented programming,


and type safety, which make development easier and faster. Translators enable the
execution of programs written in these advanced languages by converting them into
machine-understandable formats.

Types of Translators

1. Compiler:

o Translates the entire program at once.

o Faster execution but slower development cycle (since the whole program must be
compiled before running).

2. Interpreter:

o Translates and executes code line by line.

o Slower execution but faster for debugging and development.

3. Assembler:

o Converts assembly language (a low-level programming language) into machine code.

o Assembly language is closer to machine code and provides more control over
hardware.

4. Linker:

o Combines multiple object files generated by the compiler into a single executable.

Stages of Compiler Design


1. Lexical Analysis: The first stage of compiler design is lexical analysis, also known as scanning.
In this stage, the compiler reads the source code character by character and breaks it down
into a series of tokens, such as keywords, identifiers, and operators. These tokens are then
passed on to the next stage of the compilation process.

2. Syntax Analysis: The second stage of compiler design is syntax analysis, also known as
parsing. In this stage, the compiler checks the syntax of the source code to ensure that it
conforms to the rules of the programming language. The compiler builds a parse tree, which
is a hierarchical representation of the program’s structure, and uses it to check for syntax
errors.

3. Semantic Analysis: The third stage of compiler design is semantic analysis. In this stage, the
compiler checks the meaning of the source code to ensure that it makes sense. The compiler
performs type checking, which ensures that variables are used correctly and that operations
are performed on compatible data types. The compiler also checks for other semantic errors,
such as undeclared variables and incorrect function calls.

4. Code Generation: The fourth stage of compiler design is code generation. In this stage, the
compiler translates the parse tree into machine code that can be executed by the computer.
The code generated by the compiler must be efficient and optimized for the target platform.

5. Optimization: The final stage of compiler design is optimization. In this stage, the compiler
analyzes the generated code and makes optimizations to improve its performance. The
compiler may perform optimizations such as constant folding, loop unrolling, and function
inlining.

Overall, compiler design is a complex process that involves multiple stages and requires a deep
understanding of both the programming language and the target platform. A well-designed compiler
can greatly improve the efficiency and performance of software programs, making them more useful
and valuable for users.

Compiler

 Cross Compiler that runs on a machine ‘A’ and produces a code for another machine ‘B’. It is
capable of creating code for a platform other than the one on which the compiler is running.

 Source-to-source Compiler or transcompiler or transpiler is a compiler that translates source


code written in one programming language into the source code of another programming
language.
Lexical Analysis
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts the High level
input program into a sequence of Tokens.

1. Lexical Analysis can be implemented with the Deterministic finite Automata.

2. The output is a sequence of tokens that is sent to the parser for syntax analysis

What is
a Token?

A lexical token is a sequence of characters that can be treated as a unit in the grammar of the
programming languages. Example of tokens:

 Type token (id, number, real, . . . )

 Punctuation tokens (IF, void, return, . . . )

 Alphabetic tokens (keywords)

Keywords; Examples-for, while, if etc.

Identifier; Examples-Variable name, function name, etc.

Operators; Examples '+', '++', '-' etc.

Separators; Examples ',' ';' etc

Example of Non-Tokens:

 Comments, preprocessor directive, macros, blanks, tabs, newline, etc.

Lexeme: The sequence of characters matched by a pattern to form the corresponding token or a
sequence of input characters that comprises a single token is called a lexeme. eg- “float”,
“abs_zero_Kelvin”, “=”, “-”, “273”, “;” .

How Lexical Analyzer Works?

1. Input preprocessing: This stage involves cleaning up the input text and preparing it for lexical
analysis. This may include removing comments, whitespace, and other non-essential
characters from the input text.
2. Tokenization: This is the process of breaking the input text into a sequence of tokens. This is
usually done by matching the characters in the input text against a set of patterns or regular
expressions that define the different types of tokens.

3. Token classification: In this stage, the lexer determines the type of each token. For example,
in a programming language, the lexer might classify keywords, identifiers, operators, and
punctuation symbols as separate token types.

4. Token validation: In this stage, the lexer checks that each token is valid according to the rules
of the programming language. For example, it might check that a variable name is a valid
identifier, or that an operator has the correct syntax.

5. Output generation: In this final stage, the lexer generates the output of the lexical analysis
process, which is typically a list of tokens. This list of tokens can then be passed to the next
stage of compilation or interpretation.

 The lexical analyzer identifies the error with the help of the automation machine and the
grammar of the given language on which it is based like C, C++, and gives row number and
column number of the error.

Suppose we pass a statement through lexical analyzer – a = b + c; It will generate token


sequence like this: id=id+id; Where each id refers to it’s variable in the symbol table
referencing all details For example, consider the program

int main()

// 2 variables

int a, b;
a = 10;

return 0;

All the valid tokens are:

'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'

'a' '=' '10' ';' 'return' '0' ';' '}'

Advantages

1. Simplifies Parsing: Breaking down the source code into tokens makes it easier for computers
to understand and work with the code. This helps programs like compilers or interpreters to
figure out what the code is supposed to do. It’s like breaking down a big puzzle into smaller
pieces, which makes it easier to put together and solve.

2. Error Detection: Lexical analysis will detect lexical errors such as misspelled keywords or
undefined symbols early in the compilation process. This helps in improving the overall
efficiency of the compiler or interpreter by identifying errors sooner rather than later.

3. Efficiency: Once the source code is converted into tokens, subsequent phases of compilation
or interpretation can operate more efficiently. Parsing and semantic analysis become faster
and more streamlined when working with tokenized input.

Disadvantages

1. Limited Context: Lexical analysis operates based on individual tokens and does not consider
the overall context of the code. This can sometimes lead to ambiguity or misinterpretation of
the code’s intended meaning especially in languages with complex syntax or semantics.

2. Overhead: Although lexical analysis is necessary for the compilation or interpretation


process, it adds an extra layer of overhead. Tokenizing the source code requires additional
computational resources which can impact the overall performance of the compiler or
interpreter.

3. Debugging Challenges: Lexical errors detected during the analysis phase may not always
provide clear indications of their origins in the original source code. Debugging such errors
can be challenging especially if they result from subtle mistakes in the lexical analysis
process.
Input Buffering in Compiler Design
The lexical analyzer scans the input from left to right one character at a time. It uses two pointers
begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.

Input buffering is an important concept in compiler design that refers to the way in which the
compiler reads input from the source code. In many cases, the compiler reads input one character at
a time, which can be a slow and inefficient process. Input buffering is a technique that allows the
compiler to read input in larger chunks, which can improve performance and reduce overhead.

1. The basic idea behind input buffering is to read a block of input from the source code into a
buffer, and then process that buffer before reading the next block. The size of the buffer can
vary depending on the specific needs of the compiler and the characteristics of the source
code being compiled. For example, a compiler for a high-level programming language may
use a larger buffer than a compiler for a low-level language, since high-level languages tend
to have longer lines of code.

2. One of the main advantages of input buffering is that it can reduce the number of system
calls required to read input from the source code. Since each system call carries some
overhead, reducing the number of calls can improve performance. Additionally, input
buffering can simplify the design of the compiler by reducing the amount of code required to
manage input.

However, there are also some potential disadvantages to input buffering. For example, if the size of
the buffer is too large, it may consume too much memory, leading to slower performance or even
crashes. Additionally, if the buffer is not properly managed, it can lead to errors in the output of the
compiler.

ring can reduce the number of system calls required to read input from the source code, which can
improve performance.
Input buffering can simplify the design of the compiler by reducing the amount of code required to
manage input.

Advantages:

Input buffering can reduce the number of system calls required to read input from the source code,
which can improve performance.
Input buffering can simplify the design of the compiler by reducing the amount of code required to
manage input.

Disadvantages:

If the size of the buffer is too large, it may consume too much memory, leading to slower
performance or even crashes.
If the buffer is not properly managed, it can lead to errors in the output of the compiler.
Overall, the advantages of input buffering generally outweigh the disadvantages when used
appropriately, as it can improve performance and simplify the compiler design.
What are Tokens?

A token is the smallest individual element of a program that is meaningful to the compiler. It cannot
be further broken down. Identifiers, strings, keywords, etc., can be the example of the token. In the
lexical analysis phase of the compiler, the program is converted into a stream of tokens.

Different Types of Tokens

There can be multiple types of tokens. Some of them are-

1. Keywords

Keywords are words reserved for particular purposes and imply a special meaning to the compilers.
The keywords must not be used for naming a variable, function, etc.

2. Identifier

The names given to various components in the program, like the function's name or variable's name,
etc., are called identifiers. Keywords cannot be identifiers.

3. Operators

Operators are different symbols used to perform different operations in a programming language.

4. Punctuations

Punctuations are special symbols that separate different code elements in a programming language.

Consider the following line of code in C++ language -

int x = 45;

The above statement has multiple tokens, which are-

Keywords: int

Identifier: x , 45

Operators: =

Punctuators: ;

Specification of Token

In compiler design, there are three specifications of token-

1. String

2. Language

3. Regular Expressions

1. Strings

Strings are a finite set of symbols or characters. These symbols can be a digit or an alphabet. There is
also an empty string which is denoted by ε.

Operations on String
The operations that can be performed on a string are-

1. Prefix

The prefix of String S is any string that is extracted by removing zero or more characters from the end
of string S. For example, if the String is "NINJA", the prefix can be "NIN" which is obtained by
removing "JA" from that String. A string is a prefix in itself.

Proper prefixes are special types of prefixes that are not equal to the String itself or equal to ε. We
obtain it by removing at least one character from the end of the String.

2. Suffix

The suffix of string S is any string that is extracted by removing any number of characters from the
beginning of string S. For example, if the String is "NINJA", the suffix can be "JA," which is obtained
by removing "NIN" from that String. A string is a suffix of itself.

Proper suffixes are special types of suffixes that are not equal to the String itself or equal to ε. It is
obtained by removing at least one character from the beginning of the String.

3. Substring

A substring of a string S is any string obtained by removing any prefixes and suffixes of that String.
For example, if the String is "AYUSHI," then the substring can be "US," which is formed by removing
the prefix "AY" and suffix "HI." Every String is a substring of itself.

Proper substrings are special types that are not equal to the String itself or equal to ε. It is obtained
by removing at least one prefix or suffix from the String.

4. Subsequence

The subsequence of the String is a string obtained by eliminating zero or more symbols from the
String. The symbols that are removed need not be consecutive. For example, if the String is
"NINJAMANIA," then a subsequence can be "NIJAANIA," which is produced by removing "N" and
"M."

Proper subsequences are special subsequences that are not equal to the String itself or equal to ε. It
is obtained by removing at least one symbol from the String.

5. Concatenation

Concatenation is defined as the addition of two strings. For example, if we have two strings S=" Cod"
and T=" ing," then the concatenation ST would be "Coding."

2. Language

A language can be defined as a finite set of strings over some symbols or alphabets.

Operations on Language

The following operations are performed on a language in the lexical analysis phase-

1. Union

Union is one of the most common operations we perform on a set. In terms of languages also, it will
hold a similar meaning.
Suppose there are two languages, L and S. Then the union of these two languages will be

L ∪ S will be equal to { x | x belongs to either L or S }

For example If L = {a, b} and S = {c, d}Then L ∪ S = {a, b, c, d}

2. Concatenation

Concatenation links two languages by linking the strings from one language to all the strings of the
other language.

If there are two languages, L and S, then the concatenation of L and S will be LS equal to { ls |
where l belongs to L and s belongs to S }.

For example, there are two languages, L and S, such that { L`, L"} is the set of strings belonging to
language L and { S,` S, "S`"} is the set of strings belonging to language S.

Then the concatenation of L and S will be LS will be {L'S`, L'S ", L``S`, L``S "}

3. Kleene closure

Kleene closure of a language L is denoted by L*provides a set of all the strings that can be obtained
by concatenating L zero or more times.

If L = {a, b}

then L* = {ε, a, b, aa, bb, aaa, bbb, …}

Positive Closure

L+ denotes the Positive closure of a language L and provides a set of all the strings that can be
obtained by concatenating L one or more times.

If L = {a, b}

then L+ = {a, b, aa, bb, aaa, bbb, …}

3. Regular Expression

Regular expressions are strings of characters that define a searching pattern with the help of which
we can form a language, and each regular expression represents a language.

A regular expression r can denote a language L(r) which can be built recursively over the smaller
regular expression by following some rules.

Writing Regular Expressions

Following symbols are used very frequently to write regular expressions

 The asterisk symbol ( * ): It is used in our regular expression to instruct the compiler that the
symbol that preceded the * symbol can be repeated any number of times in the pattern. For
example, if the expression is ab*c, then it gives the following string- ac, abc, abbc, abbbc,
abbbbbc.. and so on.

 The plus symbol ( + ): It is used in our regular expression to tell the compiler that the symbol
that preceded + can be repeated one or more times in the pattern. For example, if the
expression is ab+c, then it gives the following string- abc, abbc, abbbc, abbbbbc.. and so on.
 Wildcard Character ( . ): The. A symbol, also known as the wildcard character, is a character
in our regular expression that can be replaced by another character.

 Character Class: It is a way of representing multiple characters. For example, [a – z] denotes


the regular expression a | b | c | d | ….|z.

The following rules are used to define a regular expression r over some alphabet Σ and the languages
denoted by these regular expressions.

 It ε is a regular expression that denotes a language L(ε). The language L(ε) has a set of strings
{ε} which means that this language has a single string which is the empty String.

 If there is a symbol 'a' in Σ, then 'a' is a regular expression that denotes a language L(a). The
language L(a) = {a}, i.e., the language has only one String of length, and the String holds 'a' in
the first position.

 Consider the two regular expressions r and s then:

r|s denotes the language L(r) ∪ L(s).

(r) (s) denotes the language L(r) ⋅ L(s).

(r)* denotes the language (L(r))*.

(r)+ denotes the language L(r)

Finite Automata

Finite Automata(FA) is the simplest machine to recognize [Link] is used to characterize a Regular
Language, for example: /baa+!/.
Also it is used to analyze and recognize Natural language Expressions. The finite automata or finite
state machine is an abstract machine that has five elements or tuples. It has a set of states and rules
for moving from one state to another but it depends upon the applied input symbol. Based on the
states and the set of rules the input string can be either accepted or rejected. Basically, it is an
abstract model of a digital computer that reads an input string and changes its internal state
depending on the current input symbol. Every automaton defines a language i.e. a set of strings it
accepts. The following figure shows some essential features of general automation.

Figure: Features of Finite Automata

The above figure shows the following features of automata:

1. Input

2. Output

3. States of automata

4. State relation

5. Output relation

A Finite Automata consists of the following:

Q : Finite set of states.

Σ : set of Input Symbols.

q : Initial state.

F : set of Final States.

δ : Transition Function.

The formal specification of the machine is

{ Q, Σ, q, F, δ }

FA is characterized into two types:

1) Deterministic Finite Automata (DFA):

DFA consists of 5 tuples {Q, Σ, q, F, δ}.

Q : set of all states.

Σ : set of input symbols. ( Symbols which machine takes as input )


q : Initial state. ( Starting state of a machine )

F : set of final state.

δ : Transition Function, defined as δ : Q X Σ --> Q.

In a DFA, for a particular input character, the machine goes to one state only. A transition function is
defined on every state for every input symbol. Also in DFA null (or ϵ ) move is not allowed, i.e., DFA
cannot change state without any input character.

For example, construct a DFA which accept a language of all strings ending with ‘a’.
Given: Σ = {a,b}, q = {q0}, F={q1}, Q = {q0, q1}

First, consider a language set of all the possible acceptable strings in order to construct an accurate
state transition diagram.
L = {a, aa, aaa, aaaa, aaaaa, ba, bba, bbbaa, aba, abba, aaba, abaa}
Above is simple subset of the possible acceptable strings there can many other strings which ends
with ‘a’ and contains symbols {a,b}.

Fig 1. State Transition Diagram for DFA with Σ = {a, b}

Strings not accepted are,


ab, bb, aab, abbb, etc.

State transition table for above automaton,

State\Symbol a b

q0 q1 q0

q1 q1 q0

One important thing to note is, there can be many possible DFAs for a pattern. A DFA with a
minimum number of states is generally preferred.

2) Nondeterministic Finite Automata(NFA): NFA is similar to DFA except following additional


features:

1. Null (or ϵ) move is allowed i.e., it can move forward without reading symbols.

2. Ability to transmit to any number of states for a particular input.

However, these above features don’t add any power to NFA. If we compare both in terms of power,
both are equivalent.
Due to the above additional features, NFA has a different transition function, the rest is the same as
DFA.

δ: Transition Function

δ: Q X (Σ U ϵ ) --> 2 ^ Q.

As you can see in the transition function is for any input including null (or ϵ), NFA can go to any state
number of states. For example, below is an NFA for the above problem.

Fig 2. State Transition Diagram for NFA with Σ = {a, b}

State Transition Table for above Automaton,

State\Symbol a b

q0 {q0,q1} q0

q1 φ φ

One important thing to note is, in NFA, if any path for an input string leads to a final state, then the
input string is accepted. For example, in the above NFA, there are multiple paths for the input string
“00”. Since one of the paths leads to a final state, “00” is accepted by the above NFA.

Some Important Points:

 Justification:

Since all the tuples in DFA and NFA are the same except for one of the tuples, which is Transition
Function (δ)

In case of DFA

δ : Q X Σ --> Q

In case of NFA

δ : Q X Σ --> 2Q

Now if you observe you’ll find out Q X Σ –> Q is part of Q X Σ –> 2Q.

On the RHS side, Q is the subset of 2Q which indicates Q is contained in 2Q or Q is a part of 2Q,
however, the reverse isn’t true. So mathematically, we can conclude that every DFA is NFA but not
vice-versa. Yet there is a way to convert an NFA to DFA, so there exists an equivalent DFA for every
NFA.
1. Both NFA and DFA have the same power and each NFA can be translated into a DFA.

2. There can be multiple final states in both DFA and NFA.

3. NFA is more of a theoretical concept.

4. DFA is used in Lexical Analysis in Compiler.

5. If the number of states in the NFA is N then, its DFA can have maximum 2N number of states.

You might also like