100% found this document useful (1 vote)
6 views78 pages

Module 1

The document outlines the structure and phases of a compiler, including lexical analysis, syntax analysis, and code generation. It discusses the importance of symbol tables, intermediate representations, and error handling during compilation. Additionally, it highlights the challenges of compiler design, such as supporting multiple programming languages and architectures while ensuring efficiency and correctness.

Uploaded by

caspian.maniac
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
6 views78 pages

Module 1

The document outlines the structure and phases of a compiler, including lexical analysis, syntax analysis, and code generation. It discusses the importance of symbol tables, intermediate representations, and error handling during compilation. Additionally, it highlights the challenges of compiler design, such as supporting multiple programming languages and architectures while ensuring efficiency and correctness.

Uploaded by

caspian.maniac
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Dr. S.

Kiruthika
Assistant Professor
School of Computer Science and Engineering
VIT Chennai
§ Machine
§ Systems
§ Hardware vs Software
§ Application software vs System Software
§ Languages
§ Programing Languages
§ Compliers
§ Interpreters
§ Analysis – Front end
§ Breaks down source program into constituent pieces
§ Imposes a grammatical structure on them
§ Creates an intermediate representation of the source program (Syntax tree)
§ Checks for errors and issues diagnostic messages
§ Stores necessary information in Symbol table

§ Synthesis - Backend
§ Constructs target program from intermediate representation
§ Requires Specialized techniques

4
§ Compiler operates in phases
§ Each phase transforms source program from one representation to another
§ All phases use the Symbol table
§ Machine Independent Code Optimization
§ May be present between front end and back end
§ Performs transformations on the intermediate representations
§ Optional

5
Character Stream

1 Lexical Analyzer
Token Stream
2 Syntax Analyzer
Syntax tree
3 Semantic Analyzer
Syntax tree
Symbol 4
Intermediate Code
Generator
Table
Intermediate Representation
Machine Independent
5 Code Optimizer
Intermediate Representation

6 Code Generator
Target Machine Code
Machine Dependent
7 Code Optimizer

Target Machine Code


6
§ Linear Analysis / Scanning
§ Stream of characters are read and grouped into meaningful sequences
called lexemes
§ For each lexeme the lexical analyzer produces a token of the form
<token-name, attribute-value>
§ Token-name : Abstract symbol used during syntax analysis
§ Attribute-value: Points to Symbol table entry
§ Input: position = initial + rate * 60

position = initial + rate * 60 ;


<id, 1> < = > <id, 2> <+> <id, 3> <*> <60>
=, +, * - Abstract Symbols 7
§ Syntax Analysis or Parsing
§ Uses first components of the tokens produced by the lexical analyzer to create a
tree-like intermediate representation
§ Syntax tree – interior node represents an operation and children represent
arguments

8
§ Hierarchical structure is expressed using recursive rules
§ Any identifier is an expr
§ Any number is an expr
§ If expr1 and expr2 are expressions
§ expr1 + expr2, expr1 * expr2; (expr1) are all expressions
§ If id1 is and identifier and exp1 is an expression id1 = exp1 is a statement
§ If exp1 is an expression and stmt2 is a statement
while (exp1) do stmt2
if (exp1) then stmt2 are all statements

9
§ Division between lexical and syntax analysis is arbitrary
§ Constructs of the source language can be used to decide
§ Non-recursive – lexical analysis
§ Recursive – begin-end statements; parenthesis matching etc.

10
assignment
statement
=
identifier expression
+
position expression expression
*
identifier expression expression
initial identifier number
rate 60

11
§ Compressed representation – Syntax tree
=
<id,1> +

<id, 2> *

§ In Parse tree <id, 3> 60


§ interior node: record with an operator field and 2 fields for children
§ Leaf: record with 2/more fields; one for token and other information about token

12
§ Ensures that the components of a program fit together
meaningfully
§ Gathers type information and checks for type compatibility
§ Permits coercions

=
<id,1> +

<id, 2> *

<id, 3> inttofloat

60
13
§ During Compilation, a compiler may construct one or more intermediate
representations (such as Syntax tree)
§ Intermediate representation
§ Explicit low-level or machine-like intermediate representation
§ Program for an abstract machine
§ Must be easy to produce and Easy to translate into target

§ 3-address code is a common form


§ Each instruction has at most 3 operands

14
§ Order of operations must be decided
§ Compiler must generate temporary names
§ Some instructions may have less than 3 operands

15
§ Example
t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3

16
§ Aims to produce better target code
§ Optimizing Compilers
§ Amount of optimization varies

§ Optimized code
t1 = id3 * 60.0
id1 = id2 + t1
-int to real conversion at compile time
-t3 is eliminated

17
§ Generates target code
§ Relocatable machine code / Assembly code
§ Memory locations are selected for variables
§ Assignment of registers is done
LDF R2, id3
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1

18
§ Symbol table is a data structure which has a record for each identifier
and fields for the attribute of the identifier
§ Attributes Stored
§ Identifiers - Storage allocated, type, Scope
§ Procedures – Number and type of arguments, Passing by reference/ value, return
type
§ Lexical Analysis stage makes entries, other phases fill in details and
refer

19
§ All phases may encounter errors
§ Lexical
§ Syntax
§ Semantic

§ Must not stop on first error

20
§ Phases can be grouped as Front end and Back End
§ Front end – depends on source language independent of target
machine
§ Lexical Analysis to ICG phases + some code optimization and error reporting
§ Back end – Depends on target machine, independent of source
language
§ Includes Code optimization and generation

21
§ Need for Intermediate Code Generation

M languages

INTERMEDIATE CODE

N machines

22
§ Several compilation phases may form a single pass
§ Within a pass
§ Phases maybe interleaved
§ Syntax analyzer maybe in-charge; may invoke lexical analyzer when needed and then call
code generator

23
§ Desirable to reduce number of passes
§ Group several phases into a pass
§ Larger memory consumption
§ Some phases can be easily grouped
§ Grouping code generation and syntax analysis
§ Forward references and Back patching

24
§ Each phase of compiler
§ The challenges of compiler design which include supporting many programming languages,
paradigms, architectures, and operating systems while ensuring the compiler is bug-free and
generates fast, portable machine code.
§ The differences between static and dynamic error checking done by compilers and during
program execution.
§ The structure of an activation record which contains things like temporary values, local data,
saved machine status, access links, actual parameters and return values.
§ The importance of symbol tables in compilers for caching information, providing type checking
and ascertaining program semantics.
§ Compilation appears to be very simple, but there are many pitfalls
§ How are erroneous programs handled?
§ Design of programming languages has a big impact on the complexity of the compiler
§ M*N vs. M+N problem
§ Compilers are required for all the languages and all the machines
§ For M languages and N machines we need to develop M*N compilers
§ However, there is lot of repetition of work because of similar activities in the front ends and
back ends
§ Can we design only M front ends and N back ends, and some how link them to get all M*N
compilers?
§ The compiler should fit in the integrated development environment. This opens
many challenges in design e.g., appropriate information should be passed on to
the debugger in case of erroneous programs.
§ The compiler should find the erroneous line in the program and also make error
recovery possible. Some features of programming languages make compiler
design difficult, e.g., Algol68 is a very neat language with most good features.
But it could never get implemented because of the complexities in its compiler
design.
token
source lexical
analyzer parser
program
get next
token

symbol
table

31
§ Read input and produce a sequence of tokens
§ Can act as a sub-routine for the parser
§ get_next_token

§ Secondary tasks
§ Strips blanks and new line characters
§ Correlate error messages

§ Maybe split into scanning and lexical analysis phases

32
§ Token is a pair consisting of a token name and an optional attribute
value.
§ A pattern is a rule describing a set of lexemes that can represent a
particular token
§ A lexeme is a sequence of characters in the source program that
matches the pattern for a token

33
Token Sample Lexemes Informal Description of Pattern
const const const
if if if
relation <, <=, =, < >, >, >= < or <= or = or < > or >= or >
id pi, count, D2 letter followed by letters and digits
num 3.1416, 0, 6.02E23 any numeric constant
literal “core dumped” any characters between “ and “ except

Actual values are critical. Info is :


Classifies
Pattern [Link] in symbol table
[Link] to parser
34
§ Separation of lexical and syntactic analysis phases
§ Simpler Conceptual Model
§ A parser including the conventions for comments and white space is
significantly more complex
§ Increases Compiler Efficiency
§ Specialized buffering techniques for reading input characters and
processing tokens

§ Separation Promotes Portability.


§ Input alphabet peculiarities and other device-specific anomalies can be
restricted to the lexical analyzer.

35
§ Alignment of statements; free format input
§ Treatment of blanks
§ DO 5 I = 1.25; DO 5 I = 1,25

§ Reserved Keywords
§ IF THEN THEN THEN = ELSE; ELSE ELSE = THEN

36
§ Attributes provide additional information about tokens
§ Tokens influence parsing decision; the attributes influence
the translation of tokens after the parse
§ Lexical analyzers usually provide a single attribute per
token (pointer into symbol table)

37
§ Example: E = M * C ** 2
§ <id, pointer to symbol-table entry for R>
§ <assign_op >
§ <id, pointer to symbol-table entry for M>
§ <mult_op >
§ <id, pointer to symbol-table entry for C>
§ <exp_op >
§ <num, integer value 2>

38
§ Many errors may not be detected at the lexical analysis
stage alone
§ Example: fi (a == b)
§ Errors are found when Lexical analyzer is unable to
proceed because none of the patterns for tokens matches
a prefix of remaining input.

39
§ Alphabet – any finite set of symbols (e.g. ASCII, binary
alphabet, or a set of tokens)
§ String – A finite sequence of symbols drawn from an
alphabet
§ Length of a string |s|
§ Empty String

§ Language – A set of strings over a fixed alphabet


§ Other terms relating to strings: prefix; suffix; substring;
proper prefix, suffix, or substring (non-empty, not entire
string); subsequence

42
§ A language, L, is simply any set of strings over a fixed alphabet.

Alphabet Languages
{0,1} {0,10,100,1000,100000…}
{0,1,00,11,000,111,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
{A, … ,Z} {FOR,WHILE,GOTO,…}
{A,…,Z,a,…,z,0,…9, { All legal PASCAL progs}
+,-,…,<,>,…}

Special Languages: Æ - EMPTY LANGUAGE


Î - contains Î string only
43
§ Given String: banana
§ Prefix : ban, banana
§ Suffix : ana, banana
§ Substring : nan, ban, ana, banana
§ Subsequence: bnan, nn
§ Proper Prefix and Suffix

44
§Concatenation
§xy; sÎ = Îs = s; Î - identity for concatenation
§s0 = Î if i > 0 si = si-1s

45
OPERATION DEFINITION
union of L and M L È M = {s | s is in L or s is in M}
written L È M
concatenation of L LM = {st | s is in L and t is in M}
and M written LM
¥
Kleene closure of L L*=  Li
written L* i =0

L* denotes “zero or more concatenations of “ L


positive closure of L ¥

written L+ L+= L
i =1
i

L+ denotes “one or more concatenations of “ L

46
L = {A, B, C, D } D = {1, 2, 3}
§ L È D = {A, B, C, D, 1, 2, 3 }
§ LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
§ L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}
§ L* = { All possible strings of L plus Î }
§ L+ = L* - Î
§ L (L È D )
§ L (L È D )*

47
§ A Regular Expression is a Set of Rules / Techniques for
Constructing Sequences of Symbols (Strings) From an
Alphabet.

§ Let S Be an Alphabet, r a Regular Expression, then L(r) is


the Language That is Characterized by the Rules of r

48
§ Defined over an alphabet Σ
§ ε represents {ε}, the set containing the empty string
§ If a is a symbol in Σ, then a is a regular expression denoting {a}, the
set containing the string a
§ If r and s are regular expressions denoting the languages L(r) and
L(s), then:
§ (r)|(s) is a regular expression denoting L(r)U L(s)
§ (r)(s) is a regular expression denoting L(r)L(s)
§ (r)* is a regular expression denoting (L(r))*
§ (r) is a regular expression denoting L(r)

§ Precedence: * (left associative), then concatenation (left associative),


then | (left associative)
49
Alphabet = {a, b}

1. a|b denotes {a, b}


2. (a|b)(a|b) denotes {ab, aa, ba, bb}
3. a* denotes {Î, a, aa, …}
4. (a|b)* - Strings of a’s and b’s
5. a|a*b – a followed by zero/more a’s followed by b

50
AXIOM DESCRIPTION
r|s=s|r | is commutative
r | (s | t) = (r | s) | t | is associative
(r s) t = r (s t) concatenation is associative
r(s|t)=rs|rt
(s|t)r=sr|tr concatenation distributes over |

Îr = r
Î Is the identity element for concatenation
rÎ = r
r* = ( r | Î )* relation between * and Î
r** = r* * is idempotent
Regular Set – Language defined by a regular expression
If two regular expressions denote the same regular set they are said to be 51
equivalent.
§ Names maybe given to regular expressions; these names can be
used like symbols
§ Let S is an alphabet of basic symbols. The regular definition is a
sequence of definitions of the form
d1® r1
d2® r2
...
dn® rn

Where, each di is a distinct name, and each ri is a regular


expression over the symbols in S È {d1, d2, …, di-1 }

52
§ Example 1:
§ letter à A|B|…|Z|a|b|…|z
§ digit à 0|1|…|9
§ id à letter (letter | digit)*

§ Example 2
§ digit ® 0 | 1 | 2 | … | 9
§ digits ® digit digit*
§ optional_fraction ® . digits | Î
§ optional_exponent ® ( E ( + | -| Î) digits) | Î
§ num ® digits optional_fraction optional_exponent

53
§ Shorthand
§ One or more instances: r+ denotes rr*
§ Zero or one Instance: r? denotes r|ε
§ Character classes: [a-z] denotes [a|b|…|z]

r* = r+ | ε and r+ = rr* = r*r

54
§ digit ® 0 | 1 | 2 | … | 9
§ digits ® digit+
§ optional_fraction ® (. digits ) ?
§ optional_exponent ® ( E ( + | -) ? digits) ?
§ num ® digits optional_fraction optional_exponent

55
§ Cannot describe balanced or nested constructs
§ Example, all valid strings of balanced parentheses
§ This can be done with CFG

§ Cannot describe repeated strings


§ Example: {wcw|w is a string of a’s and b’s}

56
§ Example Grammar:
stmt ® if expr then stmt
|if expr then stmt else stmt |Î
expr ® term relop term | term
term ® id | num

if ® if
then ® then
else ® else
relop ® < | <= | > | >= | = | <>
id ® letter ( letter | digit )*
num ® digit + (. digit + ) ? ( E(+ | -) ? digit + ) ?

57
§ Lexical Analyzer recognizes keywords and lexemes for relop, id
and num
§ Strips white space – No token is returned
blank ® blank
tab ® tab
newline ® newline
delim ® blank | tab | newline
ws ® delim+
§ Goal: To construct a lexical analyzer that isolates the lexeme for
next token and produces as output a pair having token and
attribute value

58
Regular Token Attribute-Value
Expression
ws - -
if if -
then then -
else else -
id id pointer to table entry
num num Exact value
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
59
>= relop GE
§ Transition Diagrams (TD) are used to represent the tokens
§ As characters are read, the relevant TDs are used to attempt to
match lexeme to a pattern
§ Each TD has:
§ States : Represented by Circles
§ Actions : Represented by Arrows between states
§ Start State : Beginning of a pattern (Arrowhead)
§ Final State(s) : End of pattern (Concentric Circles)

§ Each TD is Deterministic
§ States have actions which are executed when flow of control
reaches that state.

60
§ Additional characters that have been read must be
retracted

>=: start > = RTN(GE)


0 6 7

other
8 * RTN(GT)

§ If there are several transition diagrams; when failure


occurs on one path; retract FP to start state and activate
next
§ Failure in all states – Lexical error

61
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other

= 4 *
return(relop, LT)

5 return(relop, EQ)
>

=
6 7 return(relop, GE)
other
8 *
return(relop, GT)
62
letter or digit

start letter other *


9 10 11

return( get_token(), install_id())

Either returns ptr or “0” if reserved

delim

start delim other *


28 29 30

63
Direct method in black board
§ Widely used to specify lexical analyzers for a variety of languages
§ Tool - Lex Compiler
§ Input: Lex language
§ Actions corresponding to REs can also be specified
§ Can be used even when Lex compiler is not available

65
66
§ Create Specification in lex language – lex.l
§ Run lex.l through compiler to produce [Link].c
§ Run [Link].c through C compiler to produce [Link]

67
Declarations
%%
Translation rules
%%
Auxiliary procedures

68
§ Declarations Section: Variables, manifest constants and regular definitions
§ Manifest constant – identifier that is declared to represent a constant

69
§ Translation Rules
p1 {action1}
p2 {action 2}
p3 {action 3}
§ Actions are written in C language / any other language

70
§ Auxiliary Procedures
§ Contains additional routines needed by the actions
§ Maybe compiled separately and loaded along with the lexical
analyzer

71
§ Lexical Analyzer created by Lex works under the control of
the Parser
§ When activated searches for longest lexeme and executes
matching action
§ On completion control is transferred to parser / continues with
other lexemes
§ Returns a token to the parser; to pass attribute value yyval is set

72
73
§ Manifest constants
§ Declaration is surrounded by %{ and %}
§ Directly placed into [Link].c

§ Auxiliary procedures
§ Directly placed into [Link].c

74
§ Regular Definitions
§ delim – tab or newline or blank
§ ws – one or more delimiters
§ Surrounded by braces
§ [A-Za-z] – Character class
§ Definition of id and number use meta symbols
§ { } – grouping
§ ? – zero / more
§ \. – gives natural meaning for dot (otherwise represents all characters except \n)
§ \- : minus
§ Can also surround with quotes

75
76
§ No action on white space
§ On if, then, else - Token is returned
§ For relational operators
§ yylval contains code for particular operator
§ Returns token – RELOP

77
§ For identifier:
§ yylval is set to value returned by install_id

§ install_id
§ Looks in symbol table for lexeme matched by pattern id; creates new entry if needed
§ Lexeme is extracted using variables yytext and yyleng

§ Numbers are also treated similarly

78
79
§ Longest matching prefix is chosen (< / <=)
§ Two matches – precedence defined by order of definition

80
§ Lookahead maybe required by the Lexical Analyzer
§ DO 5 I = 1.25 and DO 5 I = 1, 25
§ Lex r1/r2 – matches r1 only if followed by r2
§ r2 restricts match
§ DO/({letter|{digit})* = ({letter}|{digit})*,
§ Matches keyword only if ident = ident, follows

81
§ FORTRAN IF(I, J) = 3
§ Array Assignment
§ IF condition

§ IF/ \( .* \) {letter} – indicates IF is a keyword

82

You might also like