1. Illustrate the various phases of a compiler.
Show the translation for an
assignment: Position = initial + rate * 60. Clearly indicate the output of
each phase
A compiler is a system program that translates a high-level language
program into an equivalent machine-level program.
This translation is carried out in a sequence of well-defined phases.
1. Lexical Analysis (Scanner)
• Converts the source program into a sequence of tokens
• Removes white spaces and comments
• Creates entries in the symbol table
Input: Source program
Output: Token stream
Example tokens:
<id, position> <=> <id, initial> <+> <id, rate> <*> <num, 60>
2. Syntax Analysis (Parser)
• Checks whether the token sequence follows grammar rules
• Constructs a parse tree / syntax tree
• Detects syntax errors
Input: Tokens
Output: Parse tree
3. Semantic Analysis
• Checks meaning of statements
• Verifies:
o Type compatibility
o Variable declarations
o Scope rules
Input: Parse tree
Output: Annotated syntax tree
4. Intermediate Code Generation
• Produces an intermediate representation (IR)
• Machine-independent
• Common forms:
o Three-address code
o Quadruples
o Triples
Input: Annotated syntax tree
Output: Intermediate code
5. Code Optimization
• Improves intermediate code for:
o Speed
o Memory usage
• Does not change program meaning
Examples:
• Constant folding
• Dead code elimination
• Strength reduction
6. Target Code Generation
• Converts optimized IR into machine-dependent code
• Produces assembly or machine instructions
Input: Optimized IR
Output: Target machine code
Supporting Activities
• Symbol Table Management
• Error Handling
Translation of the Assignment Statement
Statement
Position = initial + rate * 60
1. Lexical Analysis Output
The source statement is broken into tokens.
<id, position>
<=>
<id, initial>
<+>
<id, rate>
<*>
<num, 60>
2. Syntax Analysis Output (Parse Tree)
=
/ \
Position +
/ \
initial *
/\
rate 60
This confirms the statement is syntactically correct.
3. Semantic Analysis Output
• Check:
o position, initial, rate are declared
o Types are compatible (e.g., integer / float)
Annotated tree (types attached):
rate : int
60 : int
rate * 60 : int
initial : int
initial + (rate * 60) : int
position : int
✔ No semantic errors.
4. Intermediate Code Generation
Three-Address Code
t1 = rate * 60
t2 = initial + t1
position = t2
5. Code Optimization
In this example, no further optimization is required.
(Already minimal form.)
6. Target Code Generation (Sample)
Assembly-like Code
LOAD rate
MUL 60
STORE t1
LOAD initial
ADD t1
STORE position
Final Summary Table
Compiler Phase Output
Lexical Analysis Tokens
Syntax Analysis Parse tree
Semantic Analysis Annotated tree
Intermediate Code Three-address code
Optimization Optimized IR
Compiler Phase Output
Target Code Machine / assembly code
2. Explain language processing system phases in detail with suitable
diagram
A Language Processing System is a collection of programs that translate a
program written in a high-level language into machine-executable code
and then execute it.
It bridges the gap between human-readable languages and machine
language.
Phases of the Language Processing System
Preprocessor
Purpose:
Prepares the source program before actual compilation.
Functions:
• Macro expansion (#define)
• File inclusion (#include)
• Conditional compilation (#if, #ifdef)
• Removal of comments
Input: High-level source program
Output: Modified source program (expanded code)
Compiler
Purpose:
Translates the high-level language into assembly or intermediate code.
The compiler itself works in phases:
• Lexical analysis
• Syntax analysis
• Semantic analysis
• Intermediate code generation
• Code optimization
• Target code generation
Input: Preprocessed source code
Output: Assembly / object code (with possible errors reported)
Assembler
Purpose:
Converts assembly language into machine-level object code.
Functions:
• Converts mnemonics into opcodes
• Resolves symbolic addresses
• Produces object file
Input: Assembly code
Output: Object (machine) code
Linker
Purpose:
Combines multiple object files into a single executable file.
Functions:
• Links library functions
• Resolves external references
• Produces executable file
Input: Object files
Output: Linked executable file
Loader
Purpose:
Loads the executable program into main memory and starts execution.
Functions:
• Allocates memory
• Loads program into RAM
• Transfers control to the program
Input: Executable file
Output: Program execution
Interpreter (Alternative to Compiler)
• Translates and executes line by line
• Does not produce object code
• Slower than compiler
• Used in scripting languages
In some systems, interpreter replaces compiler + assembler stages.
Summary Table
Phase Input Output Function
Preprocessor Source code Expanded code Macro handling
Expanded Assembly/Object
Compiler Translation
code code
Assembly Opcode
Assembler Object code
code generation
Linker Object files Executable Linking
Loader Executable Running program Execution
3. List the difference between compiler interpreter and assembler
Feature Compiler Interpreter Assembler
High-level
Input High-level Assembly
language (C,
Language language language
Java, etc.)
Machine / object No permanent Machine
Output
code machine code (object) code
Translates
Translation Translates entire Translates line
instruction by
Method program at once by line
instruction
Feature Compiler Interpreter Assembler
Slow (interprets
Execution Fast (after
during Very fast
Speed compilation)
execution)
Error Reports errors Reports errors Reports syntax
Reporting after compilation line by line errors
Memory More (stores
Less Moderate
Requirement object code)
Object code can No reusable Object code
Reusability
be reused code reusable
Difficult (errors Easy (step-by-
Debugging Moderate
shown together) step)
GCC, Java
Examples Python, Ruby MASM, NASM
Compiler
Part of Alternative to
Phase in LPS After compiler
compilation compiler
4. Explain recognize the tokens for the below given lexems.
i. Identifier
ii. keywords
iii. Relational operator
iv. Unsigned number
v. Whitespace
A lexeme is a sequence of characters in the source program.
A token is a category (type) to which a lexeme belongs.
Lexical analyzers recognize tokens using patterns, usually expressed as
regular expressions.
i. Identifier
Meaning
Identifiers are names used to identify:
• variables
• functions
• arrays
• user-defined entities
Rules
• Must start with a letter
• Followed by letters or digits
• Cannot start with a digit
• Cannot be a keyword
Regular Expression
(letter)(letter | digit)*
Examples
x
sum
total1
Position
Token form
<id, lexeme>
ii. Keywords
Meaning
Keywords are reserved words with predefined meaning in a language.
Rules
• Fixed set of words
• Cannot be used as identifiers
Examples
if, while, int, float, return, for
Recognition method
• First recognized as identifier
• Then matched against keyword table
Token form
<keyword, if>
iii. Relational Operators
Meaning
Used to compare two operands.
Common relational operators
< <= > >= == !=
Regular Expressions
< | <= | > | >= | == | !=
Examples
a<b
x >= y
Token form
<relop, <>
<relop, <=>
iv. Unsigned Number
Meaning
A number without sign (+ or −).
Types
• Integer
• Real (optional, depending on syllabus)
Regular Expression (Integer)
digit+
Regular Expression (Real)
digit+ . digit+
Examples
10
345
60
3.14
Token form
<num, value>
v. Whitespace
Meaning
Characters used to separate tokens.
Includes
• Space
• Tab
• New line
Representation
blank | tab | newline
Regular Expression
( blank | tab | newline )+
Handling
• Ignored by lexical analyzer
• Not passed to parser
Summary Table (Very Useful in Exam)
Token Pattern / Rule Example
Identifier letter(letter digit)*
Keyword Fixed words while
Relational operator <, <=, >, >=, ==, != a <= b
Unsigned number digit+ 60
Whitespace space, tab, newline (ignored)
5. Demonstrate the concept of input buffering and sentinels in lexical
analysis
Input buffering is a technique used by the lexical analyzer to efficiently read
characters from the source program by minimizing the number of
input/output operations.
Instead of reading one character at a time from disk (which is slow),
characters are read in blocks (buffers).
Why input buffering is needed
• Disk access is slow
• Lexical analysis needs frequent lookahead
• Reading one character at a time reduces performance
Input buffering improves speed and efficiency.
Two-Buffer Scheme (Standard Method)
In this method:
• Two buffers of equal size are used
• While one buffer is being processed, the other is filled
• This avoids frequent disk reads
Sentinel Concept
What is a sentinel?
A sentinel is a special character (usually EOF) placed at the end of each
buffer.
Why sentinels are used
Without sentinel:
• Every character read needs a boundary check
• Increases overhead
With sentinel:
• No need to check buffer end every time
• When forward reaches sentinel → buffer is refilled
Sentinels eliminate explicit end-of-buffer checks.
How it works (Step-by-Step)
1. Characters are read into Buffer 1
2. A sentinel (EOF) is placed at the end
3. forward pointer scans characters
4. When forward reaches sentinel:
o Buffer 2 is loaded
o Sentinel is placed at its end
5. Scanning continues seamlessly
Example
Source input:
position = rate * 60
Buffer content:
Buffer1: p o s i t i o n EOF
Buffer2: = r a t e * 6 0 EOF
Lexical analyzer:
• Reads identifiers, operators, numbers
• Switches buffers automatically using sentinel
Advantages of Input Buffering with Sentinels
• Reduces number of I/O operations
• Faster lexical analysis
• Simple control logic
• Efficient lookahead handling
Summary Table
Concept Purpose
Input Buffering Efficient character reading
Two-Buffer Scheme Continuous scanning
Sentinel Avoid buffer boundary checks
lexemeBegin Start of token
forward Scans characters
Example Input (Source Program)
rate = a1 + 25
Assume:
• Buffer size = 8 characters
• Sentinel symbol = EOF
• Two buffers are used
Step 1: Fill the two buffers
Buffer–1
Index: 0 1 2 3 4 5 6 7
Buffer1: r a t e = EOF
Buffer–2
Index: 0 1 2 3 4 5 6 7
Buffer2: a 1 + 2 5 EOF
Step 2: Initialize pointers
lexemeBegin → Buffer1[0]
forward → Buffer1[0]
Step 3: Scan first lexeme (rate)
forward position Character Action
Buffer1[0] r letter → continue
Buffer1[1] a letter → continue
Buffer1[2] t letter → continue
Buffer1[3] e letter → continue
Buffer1[4] space lexeme ends
Token recognized: <id, rate>
Now:
lexemeBegin → Buffer1[4]
forward → Buffer1[4]
Step 4: Skip whitespace
forward Character Action
B1[4] space ignore
B1[5] = operator found
Token: <=>
Update pointers:
lexemeBegin → Buffer1[6]
forward → Buffer1[6]
Step 5: Sentinel encountered
forward Character Action
B1[7] EOF switch buffer
Buffer–2 is already filled
Move forward to Buffer2[0]
forward → Buffer2[0]
Step 6: Scan identifier (a1)
forward Character Action
B2[0] space ignore
B2[1] a start identifier
B2[2] 1 continue
B2[3] space lexeme ends
Token: <id, a1>
Update:
lexemeBegin → Buffer2[3]
forward → Buffer2[3]
Step 7: Scan operator (+)
forward Character Action
B2[4] + operator
Token: <+>
Step 8: Scan number (25)
forward Character Action
B2[5] 2 digit
B2[6] 5 digit
B2[7] EOF number ends
Token: <num, 25>
Final Tokens Generated
<id, rate>
<=>
<id, a1>
<+>
<num, 25>
Key Observations (VERY IMPORTANT)
1. lexemeBegin marks start of token
2. forward scans characters
3. Sentinel (EOF) tells lexer to switch buffers
4. No explicit buffer-end checking is needed
6. Explain specification and recognition of tokens
In lexical analysis, the source program is converted into a sequence of tokens.
This process involves two important tasks:
1. Specification of tokens – defining what tokens look like
2. Recognition of tokens – identifying tokens from the input program
Token Specification
What is token specification?
Token specification defines the structure or pattern of tokens using formal
rules.
These rules describe how lexemes are formed.
Components involved
• Token: A category (identifier, keyword, operator, etc.)
• Lexeme: Actual character sequence in the program
• Pattern: Rule describing the form of lexemes
How tokens are specified?
Tokens are specified using Regular Expressions (REs).
Examples of token specification
Token Regular Expression
Identifier `letter(letter
Keyword Fixed strings (if, while, int)
Number digit+
Relational operator `<
Whitespace `(space
Example
For the statement:
count = rate * 60
Token specifications identify:
• count, rate → identifiers
• = , * → operators
• 60 → number
Token Recognition
What is token recognition?
Token recognition is the process of matching input characters against token
specifications to identify valid tokens.
It is performed by the lexical analyzer (scanner).
How tokens are recognized?
Token recognition uses:
• Finite Automata (FA)
• Transition diagrams
• Lexical analyzer generator tools (e.g., Lex)
Recognition process (step-by-step)
1. Input characters are read from source program
2. Characters are grouped into lexemes
3. Lexemes are matched against regular expressions
4. Corresponding tokens are generated
5. Whitespace and comments are ignored
Example of token recognition
Input:
x1 = 20
Recognized tokens:
<id, x1>
<=>
<num, 20>
Relationship between Specification and Recognition
• Specification tells what a token looks like
• Recognition tells how the token is identified in input
Specification is design-time, recognition is run-time.
Role of Lexical Analyzer
The lexical analyzer:
• Uses token specifications (REs)
• Recognizes tokens using finite automata
• Supplies tokens to the syntax analyzer
• Maintains symbol table entries
• Reports lexical errors
Advantages
• Simplifies syntax analysis
• Improves compiler efficiency
• Enables modular compiler design
• Allows use of automated tools (Lex)