Lecture 1

The document provides an overview of compiler design, detailing the parts of a compiler including the preprocessor, lexical analyzer, parser, optimizer, and back end. It explains the role of the lexical analyzer in tokenizing input and the parser's function in generating parse trees from tokens. Additionally, it discusses formal grammars, language representation, and basic string operations relevant to lexical analysis.

Uploaded by

Nourhan Magdy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views14 pages

Lecture 1

Uploaded by

Nourhan Magdy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter one

Basic Concepts

1.1 The Parts of a Compiler

Compiler defines as a program or group of programs that
translates one language into another in this case the source code
of a high-level computer language is translated into assembly
language.
The assembler, linker, and so forth are not considered to be part
of the compiler.
The structure of a typical four-pass compiler is shown in Figure
1.1.
1- The preprocessor is the first pass.
2- The second pass is the heart of the compiler. It is made up
of a lexical analyzer, parser, and code generator, and it
translates the source code into an intermediate language
that is much like assembly language.
3- The third pass is the optimizer, which improves the quality
of the generated intermediate code, and
4- the fourth pass, the back end, translates the optimized code
to real assembly language or some form of binary,
executable code.
Many compilers don't have preprocessors; others generate
assembly language in the second pass.

1.1.1 The Lexical Analyzer

A phase is an independent task used in the compilation process.
Typically, several phases are combined into a single pass. The
lexical analyzer phase of a compiler (often called a scanner or
tokenizer) translates the input into a form that's more useable by
the rest of the compiler.
The lexical analyzer looks at the input stream as a collection of
basic language elements called tokens.
That is, a token is an indivisible lexical unit. In C, keywords
like while or for are tokens, symbols like >,>=, and>> are
tokens, names and numbers are tokens, and so forth.
The original string that comprises the token is called a lexeme.
Note that there is not a one-to-one relationship between lexemes
and tokens.
The parser calls the lexical-analyzer every time it needs a new
token, and the analyzer returns that token and the associated
lexeme.
1.1.2 The Parser
To parse an English sentence is to break it up into its
component parts in order to analyze it grammatically. For
example, a sentence like this:
Jane sees Spot run is broken up into a subject ("Jane") and a
predicate ("sees Spot run"). The predicate is broken up into a
verb ("sees"), a direct object ("Spot"), and a participle that
modifies the direct object ("run").

A compiler performs this same process (of decomposing a

sentence into its component parts) in the parser phase, though it
usually represents the parsed sentence in a tree form rather than
as sentence diagram.
The sentence diagram itself shows the syntactic relationships
between the parts of the sentence, so this kind of graph is
formally called a syntax diagram (or, if it's in tree form, a syntax
tree). You can expand the syntax tree to show the grammatical
structure as well as the syntactic structure. This second
representation is called a parse tree. A parse tree for our earlier
sentence diagram is shown in Figure 1.3. Syntax and parse trees
for the expression A*B+C*D are shown in Figure 1.4.

A tree structure is used here primarily because it's easy to

represent in a computer program, unlike a sentence diagram.
A parser is a group of subroutines that converts a token stream
into a parse tree, and a parse tree is a structural representation of
the sentence being parsed.
The parse tree represents the sentence in a hierarchical fashion,
moving from a general description of the sentence (at the root of
the tree) down to the specific sentence being parsed (the actual
tokens) at the leaves.

There are advantages and disadvantages to an intermediate-

language approach.
The main disadvantage is lack of speed.
A parser that goes straight from tokens to binary object code
will be very fast, since an extra stage to process the intermediate
code can often double the compile time.
The advantages are usually enough to justify the loss of speed.
These are optimization and flexibility. A few optimizations,
such as simple constant folding-the evaluation of constant
expressions at compile time rather than run time-can be done in
the parser.
Most optimizations, however, are difficult, if not impossible, for
a parser to perform. Consequently, parsers for optimizing
compilers output an intermediate language that's easy for a
second pass to optimize.
Intermediate languages give you flexibility as well.
1.2 Representing Computer Languages
1.2.1 Grammars and Parse Trees
The most common method used to describe a programming
language is a formal grammar
Formal grammars are most often represented in a modified
Backus-Naur Form (also called Backus-Normal Form), BNF for
short. A strict BNF representation starts with a set of tokens,
called terminal symbols, and a set of definitions, called
nonterminal symbols.
The definitions create a system in which every legal structure in
the language can be represented. One operator is supported, the
::=operator, translated by the phrase "is defined as" or "goes to."
For example, the following BNF rule might start a grammar for
an English sentence:
sentence : := subject predicate
A sentence is defined as a subject followed by a predicate. You
can also say "a sentence goes to a subject followed by a
predicate."
Each rule of this type is called a production.
The nonterminal to the left of the ::=is the left-hand side and
everything to the right of the ::= is the right-hand side of the
production.
The left-hand side of a production always consists of a single,
nonterminal symbol, and every nonterminal that's used on a
right-hand side must also appear on a left-hand side. All
symbols that don't appear on a left-hand side, such as the tokens
in the input language, are terminal symbols.
Grammars must be as flexible as possible, and one of the ways
to get that flexibility is to make the application of certain rules
optional.
A rule like this:
article →THE
says that THE is an article, and you can use that production like
this:
object →article noun
In English, an object is an article followed by a noun. A rule
like the foregoing requires that all nouns that comprise an object
be preceded by a participle. But what if you want the article to
be optional? You can do this by saying that an article can either
be the noun "the" or an empty string. The following is used to
do this:
article →THE |ε
The ε (pronounced "epsilon") represents an empty string. If the
THE token is present in the input, then the
article →THE
production is used. If it is not there, however, then the article
matches an empty string, and
article → ε
is used.
A grammar that recognizes a limited set of English sentences is
shown below:

An input sentence can be recognized using this grammar, with a

series of replacements, as follows:
(1) Start out with the topmost symbol in the grammar, the goal
symbol.
(2) Replace that symbol with one of its right-hand sides.
(3) Continue replacing nonterminals, always replacing the
leftmost nonterminal with its right-hand side, until there are no
more nonterminals to replace.

1.2.2 An Expression Grammar

Table 1.1 shows a grammar that recognizes a list of one or more
statements, each of which is an arithmetic expression followed
by a semicolon. Statements are made up of a series of
semicolon-delimited expressions, each comprising a series of
numbers separated either by asterisks (for multiplication) or plus
signs (for addition).
Note that the grammar is recursive. For example, Production 2
has statements on both the left- and right-hand sides. There's
also third-order recursion in Production 8, since it contains an
expression, but the only way to get to it is through Production 3,
which has an expression on its left-hand side. This last recursion
is made clear if you make a few algebraic substitutions in the
grammar. You can substitute the right-hand side of Production 6
in place of the reference to term in Production 4, yielding
expression →factor
and then substitute the right-hand side of Production 8 in place
of the factor:
expression → ( expression )
In Production 3, expression appears both on the left-hand side
and at the far left of the right-hand side.
The property is called left recursion, and certain parsers can't
handle left-recursive productions. They just loop forever,
repetitively replacing the leftmost symbol in the right-hand side
with the entire right-hand side.
You can understand the problem by considering how the parser
decides to apply a particular production when it is replacing a
nonterminal that has more than one right hand side.
The simple case is evident in Productions 7 and 8. The parser
can choose which production to apply when it's expanding a
factor by looking at the next input symbol.
If this symbol is a number, then the compiler applies
Production 7 and replaces the factor with a number. If the next
input symbol was an open parenthesis, the parser would use
Production 8. The choice between Productions 5 and 6 cannot
be solved in this way, however. In the case of Production 6, the
right-hand side of term starts with a factor which, in tum, starts
with either a number or left parenthesis. Consequently, the
parser would like to apply Production 6 when a term is being
replaced and the next input symbol is a number or left
parenthesis. Production 5-the other right-hand side-starts with a
term, which can start with a factor, which can start with a
number or left parenthesis, and these are the same symbols that
were used to choose Production 6.
Chapter Two

Lexical Analysis

2.1 Languages
An alphabet is any finite set of symbols. For example, the
ASCII character set is an alphabet; the set {' 0' ,' 1' } is binary
alphabet. A string or word is a sequence of alphabetic symbols.
In practical terms, a string is an array of characters.
There is also the special case of an empty string, represented by
the symbol ε (pronounced epsilon). In C, the ' \0' is not part of
the input alphabet. As a consequence, it can be used as an end-
of-string marker because it cannot be confused with any of the
characters in the string itself, all of which are part of the input
alphabet. An empty string in C can then be represented by an
array containing a single ' \0' character. Note, here, that there's
an important difference between ε, an empty string, and a null
string. The former is an array containing the end-of-string
marker. The latter is represented by a NULL pointer-a pointer
that doesn't point anywhere. In other words, there is no array
associated with a null string.
A language is a set of strings that can be formed from the input
alphabet. A sentence is a sequence of the strings that comprise a
language. The ordering of strings within the sentence is defined
by a collection of syntactic rules called a grammar. The lexical
analyzer doesn't understand meaning.
A prefix is a string composed of the characters remaining after
zero or more symbols have been deleted from the end of a
string: "in" is a prefix of" inconsequential". Officially, ε is a
prefix of every string.
A suffix is a string formed by deleting zero or more symbols
from the front of a string. "ible" is a suffix of"
incomprehensible". The suffix is what's left after you've
removed a prefix. ε is a suffix of every string.
A substring is what's left when you remove both a suffix and
prefix: "age" is a substring of "unmanageable". Note that
suffixes and prefixes are substrings. Also ε, the empty string, is
a substring of every string.
A proper prefix, suffix, or substring of the string x has at least
one element and it is not the same as x. That is, it can't be ε, and
it can't be identical to the original string.
A sub-sequence of a string is formed by deleting zero or more
symbols from the string. The symbols don't have to be
contiguous, so "iiii" and "ssss" are both sub-sequences of
"Mississippi".
Several useful operations can be performed on strings. The
concatenation of two String concatenation.
Strings is formed by appending all characters of one string to the
end of another string. The concatenation of" fire" and "water"
is" firewater". The empty string, ε, can be concatenated to any
other string without modifying it.
If you look at concatenation as a sort of multiplication, then
exponentiation makes String exponentiation sense. An
expression like Xn represents x, repeated n times. You could
define a language consisting of the eight legal octal digits with
the following:
L(octal) ={ 0, I, 2, 3, 4, 5, 6, 7 }
and then you could specify a three-digit octal number with
L(octal)3•
The exponentiation process can be generalized into the closure
operations. If L is language, then the Kleene closure of L is L
repeated zero or more times. This operation is usually
represented as L*. In the case of a language comprised of a
single character, L* is that character repeated zero or more
times. If the language elements are strings rather than single
characters, L* are the strings repeated zero or more times. For
example, L(octal)* is zero or more octal digits. If L(v1) is a
language comprised of the string Va and L(v2) is a language
comprised of the string Voom, then L(v1)* · L(v2)
describes all of the following strings:
Voom VaVoom VaVaVoom VaVaVaVoom etc.
The positive closure of L is L repeated one or more times,
usually denoted L+.
Since languages are sets of symbols, most of the standard set
operations can be applied to them. The most useful of these is
union, denoted with the u operator. For example, if letters is a
language containing all 26 letters [denoted by L(letters)] and
digits is a set containing all 10 digits [denoted by L(digits)], then
{ L(letters) L(digits)} is the set of alphanumeric characters.
Union is the equivalent of a logical OR operator.
Other set operations (like intersection) are, of course possible,
but have less practical application.
L(digit) = {0,1, 2, 3, 4, 5, 6, 7, 8, 9 }
L(alpha) = { a, b, c, ... , z }
you can say:
L(digit)+ is a decimal constant in C (one or more digits).
L(digit)* is an optional decimal constant (zero or more digits).
L(alpha) L(digit) is the set of alphanumeric characters.
(L(alpha) L(digit))* is any number of alphanumeric
characters.
L(alpha) · (L(alpha) L(digit) )* is a C identifier.

Understanding Recursive Grammars
No ratings yet
Understanding Recursive Grammars
25 pages
Role of the Parser in Syntax Analysis
No ratings yet
Role of the Parser in Syntax Analysis
76 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
27 pages
Compiler Construction: Error Types & Parsing
No ratings yet
Compiler Construction: Error Types & Parsing
17 pages
Understanding Syntax Analysis in Compilers
No ratings yet
Understanding Syntax Analysis in Compilers
46 pages
Understanding Syntax Analysis in Compilers
No ratings yet
Understanding Syntax Analysis in Compilers
20 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
31 pages
Parse Trees in Syntax Analysis
No ratings yet
Parse Trees in Syntax Analysis
18 pages
Syntax Directed Translation in Compilers
No ratings yet
Syntax Directed Translation in Compilers
57 pages
Context-Free Grammar and Parsing Explained
No ratings yet
Context-Free Grammar and Parsing Explained
57 pages
Understanding Compiler Phases and Syntax
No ratings yet
Understanding Compiler Phases and Syntax
31 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
74 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
14 pages
CD Unit-1 Part-B
No ratings yet
CD Unit-1 Part-B
11 pages
Syntax Directed Translation Overview
No ratings yet
Syntax Directed Translation Overview
60 pages
Unit 2 Full
No ratings yet
Unit 2 Full
34 pages
Lec 4
No ratings yet
Lec 4
37 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
15 pages
Understanding Parser Functions in Compilers
No ratings yet
Understanding Parser Functions in Compilers
23 pages
CSC 411, Note 3 - 1-1
No ratings yet
CSC 411, Note 3 - 1-1
9 pages
Compiler Design: Phases and Parsers
No ratings yet
Compiler Design: Phases and Parsers
194 pages
Introduction to Lexical Analysis
No ratings yet
Introduction to Lexical Analysis
15 pages
Chapter 4 Syntax Analysis
No ratings yet
Chapter 4 Syntax Analysis
11 pages
Syntax Analysis and Context-Free Grammars
No ratings yet
Syntax Analysis and Context-Free Grammars
33 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
133 pages
Understanding Syntax Analysis in Compilers
No ratings yet
Understanding Syntax Analysis in Compilers
218 pages
Syntax Analysis and Error Handling in Parsers
No ratings yet
Syntax Analysis and Error Handling in Parsers
25 pages
Understanding Parsing Techniques in Compilers
No ratings yet
Understanding Parsing Techniques in Compilers
29 pages
Syntax Analysis in Compiler Design
100% (1)
Syntax Analysis in Compiler Design
20 pages
Top Down Parsing in Syntax Analysis
No ratings yet
Top Down Parsing in Syntax Analysis
15 pages
Syntax Analysis: Parsing Techniques Explained
No ratings yet
Syntax Analysis: Parsing Techniques Explained
36 pages
Syntax Analysis
No ratings yet
Syntax Analysis
22 pages
Syntax Analysis and Parsing Techniques
No ratings yet
Syntax Analysis and Parsing Techniques
29 pages
Chapter 3
No ratings yet
Chapter 3
34 pages
Understanding Context Free Grammar and Parsing
No ratings yet
Understanding Context Free Grammar and Parsing
42 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
47 pages
Syntax Analysis and Parsing Techniques
No ratings yet
Syntax Analysis and Parsing Techniques
42 pages
Syntax Analysis and Parsing Techniques
No ratings yet
Syntax Analysis and Parsing Techniques
32 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
42 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
19 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
32 pages
CD Mod 2
No ratings yet
CD Mod 2
47 pages
Terminal and Nonterminal Symbols in Parsing
No ratings yet
Terminal and Nonterminal Symbols in Parsing
28 pages
R22 CD Unit 2
No ratings yet
R22 CD Unit 2
32 pages
Syntax Analysis in Compiler Design
0% (1)
Syntax Analysis in Compiler Design
177 pages
Syntax Analysis in Programming Languages
No ratings yet
Syntax Analysis in Programming Languages
126 pages
Understanding Compilers and Parsing Techniques
No ratings yet
Understanding Compilers and Parsing Techniques
96 pages
Syntax Analysers Part B
No ratings yet
Syntax Analysers Part B
23 pages
Understanding Syntax Analysis
No ratings yet
Understanding Syntax Analysis
91 pages
C Depart
No ratings yet
C Depart
7 pages
Compiler Design Unit-2 Notes
No ratings yet
Compiler Design Unit-2 Notes
3 pages
A Simple Syntax-Directed Translator (Part1)
No ratings yet
A Simple Syntax-Directed Translator (Part1)
7 pages
Compiler Design: Syntax and Parsing
No ratings yet
Compiler Design: Syntax and Parsing
37 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
11 pages
Understanding Syntax Analysis in Compilers
No ratings yet
Understanding Syntax Analysis in Compilers
11 pages
Algorithm Errors and Corrections
100% (1)
Algorithm Errors and Corrections
44 pages
Lexical Analyzer and Symbol Table in C
No ratings yet
Lexical Analyzer and Symbol Table in C
14 pages
HDL Verifier™ Release Notes
No ratings yet
HDL Verifier™ Release Notes
30 pages
Benefits of Computer Science in Class 11
No ratings yet
Benefits of Computer Science in Class 11
2 pages
Network Programming C
100% (1)
Network Programming C
19 pages
St. Joseph's School Computer Applications Exam
No ratings yet
St. Joseph's School Computer Applications Exam
6 pages
Data Structures and Algorithms Overview
No ratings yet
Data Structures and Algorithms Overview
3 pages
Level Order Traversal in BST C Program
No ratings yet
Level Order Traversal in BST C Program
3 pages
EER Modeling: Subclasses and Inheritance
No ratings yet
EER Modeling: Subclasses and Inheritance
3 pages
Doubly Circular Linked List Implementation
No ratings yet
Doubly Circular Linked List Implementation
12 pages
I PUC Python Lab Manual 2024-25
No ratings yet
I PUC Python Lab Manual 2024-25
12 pages
IIT BHU Resume of Shubhrajyoti Dey
100% (1)
IIT BHU Resume of Shubhrajyoti Dey
1 page
100 PHP MCQs for Web Development
No ratings yet
100 PHP MCQs for Web Development
23 pages
DB25 Oracle 01afternoon
No ratings yet
DB25 Oracle 01afternoon
14 pages
Complete DSA Patterns Reference Guide
No ratings yet
Complete DSA Patterns Reference Guide
9 pages
Python Functions for Data Management
No ratings yet
Python Functions for Data Management
3 pages
C++ Math Operations: Built-in vs User-defined
No ratings yet
C++ Math Operations: Built-in vs User-defined
9 pages
Understanding Data Structures and Algorithms
No ratings yet
Understanding Data Structures and Algorithms
29 pages
CS-1 ALL Practicals
No ratings yet
CS-1 ALL Practicals
48 pages
CCP Module Wise Questions
No ratings yet
CCP Module Wise Questions
2 pages
Student Result Mini Project Assignment
No ratings yet
Student Result Mini Project Assignment
3 pages
Algorithm Correctness Explained
No ratings yet
Algorithm Correctness Explained
19 pages
Graphics Functions in C++
No ratings yet
Graphics Functions in C++
44 pages
Oracle 10g SQL Basics Guide
No ratings yet
Oracle 10g SQL Basics Guide
14 pages
Grapecity Technical Placement Quiz
No ratings yet
Grapecity Technical Placement Quiz
5 pages
Java Array Operations and Data Structures
No ratings yet
Java Array Operations and Data Structures
14 pages
Java and Spring Boot Interview Guide
No ratings yet
Java and Spring Boot Interview Guide
2 pages
Understanding PRAM in Parallel Computing
No ratings yet
Understanding PRAM in Parallel Computing
5 pages
Understanding Data Structures and Algorithms
No ratings yet
Understanding Data Structures and Algorithms
12 pages
Prathamesh Magar: Computer Engineering Resume
No ratings yet
Prathamesh Magar: Computer Engineering Resume
1 page

Lecture 1

Uploaded by

Lecture 1

Uploaded by

Chapter one

1.1 The Parts of a Compiler

1.1.1 The Lexical Analyzer

A compiler performs this same process (of decomposing a

A tree structure is used here primarily because it's easy to

There are advantages and disadvantages to an intermediate-

An input sentence can be recognized using this grammar, with a

1.2.2 An Expression Grammar

You might also like