0% found this document useful (0 votes)
8 views63 pages

Context-Free Grammar and Analysis

The document discusses grammar in programming languages, focusing on context-free grammar (CFG) and context-sensitive analysis. It explains the components of grammar, including terminals, non-terminals, and the role of syntax in programming. Additionally, it covers concepts like attribute grammar, syntax-directed translation, and intermediate code generation, including three-address code and its applications in compiling source code.

Uploaded by

KAVINKUMAR VS
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views63 pages

Context-Free Grammar and Analysis

The document discusses grammar in programming languages, focusing on context-free grammar (CFG) and context-sensitive analysis. It explains the components of grammar, including terminals, non-terminals, and the role of syntax in programming. Additionally, it covers concepts like attribute grammar, syntax-directed translation, and intermediate code generation, including three-address code and its applications in compiling source code.

Uploaded by

KAVINKUMAR VS
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Context sensitive

Analysis
Grammar

Grammar contains set of production rules


which makes the strings of language.

A Grammar is a 4-tuple such that-


where : G=(V,T,S,P)
Terminals
• Terminal symbols are those which are the constituents of the sentence
generated using a grammar.
• Terminal symbols are denoted by using small case letters such as a, b,
c etc.

Non-Terminals
• Non-Terminal symbols are those which take part in the generation of
the sentence but are not part of it.
• Non-Terminal symbols are also called as auxiliary
symbols or variables.
• Non-Terminal symbols are denoted by using capital letters such as A,
B, C etc.
Syntax
Syntax refers to the rules that define the structure of a
language. Syntax in computer programming means the rules
that control the structure of the symbols, punctuation, and
words of a programming language.

 Lexical analyzer can identify tokens with the help of


regular expressions and pattern rules.

 Regular expressions cannot check balancing tokens,


such as parenthesis.

 So for ensuring correct syntax in the code we use


CFG(Context free grammar)
CFG-Context Free Grammar
CFG is a formal grammar which is used to generate
all possible patterns of strings in a given formal
language. Context-free grammar G can be defined by
four tuples as:
G = (V, T, P, S)

V is a variable (non terminals).

T is a set of terminals.

P is a set of rules, P: V→ (V ∪ T)*, i.e., the left-hand


sides of the production rule. P does have any right
context or left context.

S is the start symbol.


Regular Grammar - Regular
Language(Language that can
be expressed in regular
expression)

Context Free Grammar –


Regular + Non-Regular
Language
CFG-Example

Regular language-CFG

Non-Regular Language -CFG


Context senistive
Analysis
Analysis of the source code further to scanning and parsing is known to
be “context sensitive analysis”.

There is a level of correctness that is deeper than grammar.

To generate code, we need to understand its meaning.


Example
Attribute Grammar
Attribute grammar is a special form of context-free grammar where
some additional information (attributes) are appended to one or
more of its non-terminals in order to provide context-sensitive
information.
Attribute
Grammar

Special form of context-free grammar where some additional information


(attributes) are appended to one or more of its non-terminals in order to
provide context-sensitive information.

Each attribute has well-defined domain of values, such as integer, float,


character, string, and expressions.
Example

E ⟶ E + T

Simple example of CFG.


Example

E ⟶ E + T
{[Link] ⟶ [Link] + [Link] }

This lower part of the CFG contains The values of non-terminals E and T
the semantic rules that specify how are added together and the result
the grammar should be interpreted. is copied to the non-terminal E.
Better Examples

Consider the string


Syntax 4 + 2 * 3
Semantics

E0 ⟶ E1 + T [Link] := [Link] +
[Link]
E->T
[Link] := [Link]
T0->T1*F
[Link] := [Link]*[Link]
T->F
[Link] := [Link]
F->(E)
[Link] := [Link]
F->Constant
[Link] := [Link]

There is one attribute, Value, and it is a synthesized


attribute. The value of this attribute can be calculated
by one pass up the parse tree.
Application Example

Real A, B, Q
Integer X, Y, Z

Declaration Type List

List Variable

List List, Variable

Type Real | Integer

Adding Attributes,

Declaration -> Type List [Link] := [Link]

List -> Variables [Link] := [Link]

List0 -> List1, Variable [Link] := [Link]


[Link] := [Link]

Type -> Real | Integer [Link] := LexValue(Type)


Ad Hoc Syntax Directed
Translation
CONTEXT FREE GRAMMAR
CFG is a formal grammar which is used to generate all
possible patterns of strings in a given formal language.
Context-free grammar G can be defined by four tuples
as:
G = (V, T, P, S)
◦ V is a variable (non terminals).

◦ T is a set of terminals.

◦ P is a set of rules, P: V→ (V ∪ T)*, i.e., the left-hand


sides of the production rule. P does have any right
◦ context or left context.

◦ S is the start symbol.


TERMINALS
• Terminal symbols are those which are the
constituents of the sentence generated using a
grammar.
• Terminal symbols are denoted by using small case
letters such as a, b, c etc.
NON -TERMINALS
• Non-Terminal symbols are those which take part
in the generation of the sentence but are not part
of it.
• Non-Terminal symbols are also called as auxiliary
symbols or variables.
• Non-Terminal symbols are denoted by using
capital letters such as A, B, C etc.
Attribute Grammar
Attribute grammar is a special form of context-
free grammar where some additional
information (attributes) are appended to one or
more of its non-terminals in order to provide
context-sensitive information. Each attribute
has well-defined domain of values, such as
integer, float, character, string, and
expressions.
Context Sensitive Analysis
◦Analysis of the source code further to
scanning and parsing is known to be “context
sensitive analysis”.
◦There is a level of correctness that is deeper
than grammar.
◦To generate code, we need to understand its
meaning.
Syntax Directed Translation

Syntax Directed Translation has


augmented rules to the grammar
that facilitate semantic analysis.
SDT involves passing information
bottom-up and/or top-down to the
parse tree in form of attributes
attached to the nodes
◦ This is a grammar to syntactically validate an
expression having additions and multiplications in it.
◦ Syntax-directed translation rules use
◦ lexical values of nodes,
◦ constants
◦ attributes associated with the non-terminals in their
definitions.
SDT are augmented rules to a CFG that
associate

1)set of attributes to every node of the


grammar and

2) a set of translation rules to every


production rule using attributes, constants,
and lexical values.
SDT VS SDD
AD-HOC SYNTAX-DIRECTED
TRANSLATION

In ad-hoc syntax-directed translation the


actions required for context sensitive analysis
are incorporated into the process of parsing a
context free grammar. This is in contrast to the
attribute grammar approach where we modify
the grammar.
SET UP
In this scheme, the compiler writer provides snippets of
code that execute at parse time.
Each snippet, or action, is directly tied to a production in
the grammar.
Each time the parser recognizes that it is at a particular
place in the grammar, the corresponding action is
invoked.
The compiler writer has complete control over when the
actions execute.
To implement this in a recursive-descent parser, the
compiler writer adds the appropriate code to the parsing
routines.
Intermediate Code Generation
• Intermediate codes are machine independent codes, but they are close
to machine instructions.
• The given program in a source language is converted to an
equivalent program in an intermediate language by the intermediate
code generator.
• Intermediate language can be many different languages, and the
designer of the compiler decides this intermediate language.
– syntax trees can be used as an intermediate language.
– postfix notation can be used as an intermediate language.
– three-address code (Quadraples) can be used as an intermediate language
• we will use quadraples to discuss intermediate code generation
• quadraples are close to machine instructions, but they are not actual machine instructions.
– some programming languages have well defined intermediate languages.
• java – java virtual machine
• prolog – warren abstract machine
• In fact, there are byte-code emulators to execute instructions in these intermediate languages
Three-Address Code (Quadraples)
• A quadraple is:
x := y op z
where x, y and z are names, constants or compiler-generated
temporaries; op is any operator.

• But we may also the following notation for quadraples (much better
notation because it looks like a machine code instruction)
op y,z,x
apply operator op to y and z, and store the result in x.

• We use the term “three-address code” because each statement usually


contains three addresses (two for operands, one for the result).
Three-Address Statements
Binary Operator: op y,z,result or result := y op z
where op is a binary arithmetic or logical operator. This binary operator is applied to
y and z, and the result of the operation is stored in result.
Ex: add a,b,c
gt a,b,c
addr a,b,c
addi a,b,c

Unary Operator: op y,,result or result := op y


where op is a unary arithmetic or logical operator. This unary operator is applied to y,
and the result of the operation is stored in result.
Ex: uminus a,,c
not a,,c
inttoreal a,,c
Three-Address Statements (cont.)
Move Operator: mov y,,result or result := y
where the content of y is copied into result.
Ex: mov a,,c
movi a,,c
movr a,,c

Unconditional Jumps: jmp ,,L or goto L


We will jump to the three-address code with the label L, and the execution continues
from that statement.
Ex: jmp ,,L1 // jump to L1
jmp ,,7 // jump to the statement 7
Three-Address Statements (cont.)
Conditional Jumps: jmprelop y,z,L or if y relop z goto L
We will jump to the three-address code with the label L if the result of y relop z is
true, and the execution continues from that statement. If the result is false, the execution
continues from the statement following this conditional jump statement.
Ex: jmpgt y,z,L1 // jump to L1 if y>z
jmpgte y,z,L1 // jump to L1 if y>=z
jmpe y,z,L1 // jump to L1 if y==z
jmpne y,z,L1 // jump to L1 if y!=z

Our relational operator can also be a unary operator.


jmpnz y,,L1 // jump to L1 if y is not zero
jmpz y,,L1 // jump to L1 if y is zero
jmpt y,,L1 // jump to L1 if y is true
jmpf y,,L1 // jump to L1 if y is false
Three-Address Statements (cont.)
Procedure Parameters: param x,, or param x
Procedure Calls: call p,n, or call p,n
where x is an actual parameter, we invoke the procedure p with n parameters.
Ex: param x1,,
param x2,,
 p(x1,...,xn)
param xn,,
call p,n,

f(x+1,y)  add x,1,t1 SWAP(Y+2,Z) add y,2,T1


param t1,, param T1;
param y,, Param Z
call f,2, CALL SWAP,2
Three-Address Statements (cont.)
Indexed Assignments:
move y[i],,x or x := y[i]
move x,,y[i] or y[i] := x

Address and Pointer Assignments:


moveaddr y,,x or x := &y
movecont y,,x or x := *y
Syntax-Directed Translation into Three-Address
Code
S  id := E [Link] = [Link] || gen(‘mov’ [Link] ‘,,’ [Link])
E  E1 + E2 [Link] = newtemp();
[Link] = [Link] || [Link] || gen(‘add’ [Link] ‘,’ [Link] ‘,’
[Link])
E  E1 * E2 [Link] = newtemp();
[Link] = [Link] || [Link] || gen(‘mult’ [Link] ‘,’ [Link] ‘,’
[Link])
E  - E1 [Link] = newtemp();
[Link] = [Link] || gen(‘uminus’ [Link] ‘,,’ [Link])
E  ( E1 ) [Link] = [Link];
[Link] = [Link]
E  id [Link] = [Link];
[Link] = null
Syntax-Directed Translation (cont.)
S  while E do S1 [Link] = newlabel();
[Link] = newlabel();
[Link] = gen([Link] “:”) || [Link] ||
gen(‘jmpf’ [Link] ‘,,’ [Link]) || [Link] ||
gen(‘jmp’ ‘,,’ [Link]) ||
gen([Link] ‘:”)
S  if E then S1 else S2 [Link] = newlabel();
[Link] = newlabel();
[Link] = [Link] ||
gen(‘jmpf’ [Link] ‘,,’ [Link]) || [Link] ||
gen(‘jmp’ ‘,,’ [Link]) ||
gen([Link] ‘:”) || [Link] ||
gen([Link] ‘:”)
Translation Scheme to Produce Three-Address Code

S  id := E { p= lookup([Link]);
if (p is not nil) then emit(‘mov’ [Link] ‘,,’ p)
else error(“undefined-variable”) }
E  E1 + E2 { [Link] = newtemp();
emit(‘add’ [Link] ‘,’ [Link] ‘,’ [Link]) }
E  E1 * E2 { [Link] = newtemp();
emit(‘mult’ [Link] ‘,’ [Link] ‘,’ [Link]) }
E  - E1 { [Link] = newtemp();
emit(‘uminus’ [Link] ‘,,’ [Link]) }
E  ( E1 ) { [Link] = [Link]; }
E  id { p= lookup([Link]);
if (p is not nil) then [Link] = [Link]
else error(“undefined-variable”) }
Translation Scheme with Locations
S  id := { [Link] = [Link] } E
{ p = lookup([Link]);
if (p is not nil) then { emit([Link] ‘mov’ [Link] ‘,,’ p); [Link]=[Link]+1 }
else { error(“undefined-variable”); [Link]=[Link] } }

E  { [Link] = [Link] } E1 + { [Link] = [Link] } E2


{ [Link] = newtemp(); emit([Link] ‘add’ [Link] ‘,’ [Link] ‘,’ [Link]); [Link]=[Link]+1 }

E  { [Link] = [Link] } E1 + { [Link] = [Link] } E2


{ [Link] = newtemp(); emit([Link] ‘mult’ [Link] ‘,’ [Link] ‘,’ [Link]); [Link]=[Link]+1 }

E  - { [Link] = [Link] } E1
{ [Link] = newtemp(); emit([Link] ‘uminus’ [Link] ‘,,’ [Link]); [Link]=[Link]+1 }

E  ( E1 ) { [Link] = [Link]; [Link]=[Link]+1 }

E  id { [Link] = [Link]; p= lookup([Link]);


if (p is not nil) then [Link] = [Link]
else error(“undefined-variable”) }
Boolean Expressions
E  { [Link] = [Link] } E1 and { [Link] = [Link] } E2
{ [Link] = newtemp(); emit([Link] ‘and’ [Link] ‘,’ [Link] ‘,’ [Link]); [Link]=[Link]+1 }

E  { [Link] = [Link] } E1 or { [Link] = [Link] } E2


{ [Link] = newtemp(); emit([Link] ‘and’ [Link] ‘,’ [Link] ‘,’ [Link]); [Link]=[Link]+1 }

E  not { [Link] = [Link] } E1


{ [Link] = newtemp(); emit([Link] ‘not’ [Link] ‘,,’ [Link]); [Link]=[Link]+1 }

E  { [Link] = [Link] } E1 relop { [Link] = [Link] } E2


{ [Link] = newtemp();
emit([Link] [Link] [Link] ‘,’ [Link] ‘,’ [Link]); [Link]=[Link]+1 }
Translation Scheme(cont.)
S  while { [Link] = [Link] } E do
{ emit([Link] ‘jmpf’ [Link] ‘,,’ ‘NOTKNOWN’);
[Link]=[Link]+1; } S1
{ emit([Link] ‘jmp’ ‘,,’ [Link]);
[Link]=[Link]+1;
backpatch([Link],[Link]); }

S  if { [Link] = [Link] } E then


{ emit([Link] ‘jmpf’ [Link] ‘,,’ ‘NOTKNOWN’);
[Link]=[Link]+1; } S1 else
{ emit([Link] ‘jmp’ ‘,,’ ‘NOTKNOWN’);
[Link]=[Link]+1;
backpatch([Link],[Link]); } S2
{ [Link]=[Link];
backpatch([Link],[Link]); }
Three Address Codes - Example
x:=1; 01: mov 1,,x
y:=x+10; 02: add x,10,t1
while (x<y) {  03: mov t1,,y
x:=x+1; 04: lt x,y,t2
if (x%2==1) then y:=y+1; 05: jmpf t2,,17
else y:=y-2; 06: add x,1,t3
} 07: mov t3,,x
08: mod x,2,t4
09: eq t4,1,t5
10: jmpf t5,,14
11: add y,1,t6
12: mov t6,,y
13: jmp ,,16
14: sub y,2,t7
15: mov t7,,y
16: jmp ,,4
17:
Arrays
• Elements of arrays can be accessed quickly if the elements are stored in a block of
consecutive locations.

A one-dimensional array A:

… …

baseA low i width

baseA is the address of the first location of the array A,


width is the width of each array element.
low is the index of the first array element

location of A[i]  baseA+(i-low)*width


Arrays (cont.)
baseA+(i-low)*width
can be re-written as i*width + (baseA-low*width)

should be computed at run-time can be computed at compile-time

• So, the location of A[i] can be computed at the run-time by evaluating


the formula i*width+c where c is (baseA-low*width) which is evaluated
at compile-time.

• Intermediate code generator should produce the code to evaluate this


formula i*width+c (one multiplication and one addition operation).
Two-Dimensional Arrays
• A two-dimensional array can be stored in
– either row-major (row-by-row) or
– column-major (column-by-column).
• Most of the programming languages use row-major method.

• Row-major representation of a two-dimensional array:

row1 row2 rown


Two-Dimensional Arrays (cont.)
• The location of A[i1,i2] is

baseA+ ((i1-low1)*n2+i2-low2)*width
baseA is the location of the array A.
low1 is the index of the first row
low2 is the index of the first column
n2 is the number of elements in each row
width is the width of each array element
• Again, this formula can be re-written as
((i1*n2)+i2)*width + (baseA-((low1*n1)+low2)*width)

should be computed at run-time can be computed at compile-time


Multi-Dimensional Arrays
• In general, the location of A[i1,i2,...,ik] is
(( ... ((i1*n2)+i2) ...)*nk+ik)*width + (baseA-((...((low1*n1)+low2)...)*nk+lowk)*width)

• So, the intermediate code generator should produce the codes to


evaluate the following formula (to find the location of A[i 1,i2,...,ik]) :

(( ... ((i1*n2)+i2) ...)*nk+ik)*width + c

• To evaluate the (( ... ((i1*n2)+i2) ...)*nk+ik portion of this formula, we


can use the recurrence equation:
e1 = i 1
em = em-1 * nm + im
Translation Scheme for Arrays
• If we use the following grammar to calculate addresses of array
elements, we need inherited attributes.
L  id | id [ Elist ]
Elist  Elist , E | E

• Instead of this grammar, we will use the following grammar to calculate


addresses of array elements so that we do not need inherited attributes
(we will use only synthesized attributes).
L  id | Elist ]
Elist  Elist , E | id [ E
Translation Scheme for Arrays (cont.)
S  L := E { if ([Link] is null) emit(‘mov’ [Link] ‘,,’ [Link])
else emit(‘mov’ [Link] ‘,,’ [Link] ‘[‘ [Link] ‘]’) }
E  E1 + E2 { [Link] = newtemp();
emit(‘add’ [Link] ‘,’ [Link] ‘,’ [Link]) }

E  ( E1 ) { [Link] = [Link]; }

EL { if ([Link] is null) [Link] = [Link])


else { [Link] = newtemp();
emit(‘mov’ [Link] ‘[‘ [Link] ‘]’ ‘,,’
[Link]) } }
Translation Scheme for Arrays (cont.)
L  id { [Link] = [Link]; [Link] = null; }
L  Elist ]
{ [Link] = newtemp(); [Link] = newtemp();
emit(‘mov’ c([Link]) ‘,,’ [Link]);
emit(‘mult’ [Link] ‘,’ width([Link]) ‘,’ [Link]) }
Elist  Elist1 , E
{ [Link] = [Link] ; [Link] = newtemp(); [Link] = [Link] + 1;
emit(‘mult’ [Link] ‘,’ limit([Link],[Link]) ‘,’ [Link]);
emit(‘add’ [Link] ‘,’ [Link] ‘,’ [Link]); }
Elist  id [ E
{[Link] = [Link] ; [Link] = [Link]; [Link] = 1; }
Translation Scheme for Arrays – Example1
• A one-dimensional double array A : 5..100
 n1=95 width=8 (double) low1=5

• Intermediate codes corresponding to x := A[y]

mov c,,t1 // where c=baseA-(5)*8


mult y,8,t2
mov t1[t2],,t3
mov t3,,x
Translation Scheme for Arrays – Example2
• A two-dimensional int array A : 1..10x1..20
 n1=10 n2=20 width=4 (integers) low1=1 low2=1

• Intermediate codes corresponding to x := A[y,z]

mult y,20,t1
add t1,z,t1
mov c,,t2 // where c=baseA-(1*20+1)*4
mult t1,4,t3
mov t2[t3],,t4
mov t4,,x
Translation Scheme for Arrays – Example3
• A three-dimensional int array A : 0..9x0..19x0..29
 n1=10 n2=20 n3=30 width=4 (integers) low1=0 low2=0 low3=0
• Intermediate codes corresponding to x := A[w,y,z]
mult w,20,t1
add t1,y,t1
mult t1,30,t2
add t2,z,t2
mov c,,t3 // where c=baseA-((0*20+0)*30+0)*4
mult t2,4,t4
mov t3[t4],,t5
mov t5,,x
Declarations
PMD
M€ { offset=0 }
DD;D
D  id : T { enter([Link],[Link],offset); offset=offset+[Link] }
T  int { [Link]=int; [Link]=4 }
T  real { [Link]=real; [Link]=8 }
T  array[num] of T1 { [Link]=array([Link],[Link]);
[Link]=[Link]*[Link] }
T  ↑ T1 { [Link]=pointer([Link]); [Link]=4 }

where enter crates a symbol table entry with given values.


Nested Procedure Declarations
• For each procedure we should create a symbol table.
mktable(previous) – create a new symbol table where previous is the parent symbol
table of this new symbol table
enter(symtable,name,type,offset) – create a new entry for a variable in the given
symbol table.
enterproc(symtable,name,newsymbtable) – create a new entry for the procedure in
the symbol table of its parent.
addwidth(symtable,width) – puts the total width of all entries in the symbol table
into the header of that table.

• We will have two stacks:


– tblptr – to hold the pointers to the symbol tables
– offset – to hold the current offsets in the symbol tables in tblptr stack.
Nested Procedure Declarations
P  M D { addwidth(top(tblptr),top(offset)); pop(tblptr); pop(offset) }
M€ { t=mktable(nil); push(t,tblptr); push(0,offset) }
DD;D
D  proc id N D ; S
{ t=top(tblptr); addwidth(t,top(offset));
pop(tblptr); pop(offset);
enterproc(top(tblptr),[Link],t) }
D  id : T { enter(top(tblptr),[Link],[Link],top(offset));
top(offset)=top(offset)+[Link] }
N € { t=mktable(top(tblptr)); push(t,tblptr); push(0,offset) }
Symbol table
Symbol table
Symbol table
Hash table
 A hash table is an array with an index range: 0 to table size – 1.

 These entries are pointers pointing to the names of the symbol


table.

 To search for a name we use a hash function that will result in an


integer between 0 to table size – 1.

 Insertion and lookup can be made very fast – O(1)

 A hash table uses a hash function to compute an index, also


called a hash code, into an array of buckets or slots, from
which the desired value can be found.

 During lookup, the key is hashed and the resulting hash indicates
where the corresponding value is stored. Hash table. Type.
Unordered associative array.
Hash table
Hash table
 Hashtable stores key/value pair in hash table.

 In Hashtable we specify an object that is used as a key, and the value we


want to associate to that key.

 The key is then hashed, and the resulting hash code is used as the index at
which the value is stored within the table.

 Hash functions (hashing algorithms) used in computer cryptography are known


as "cryptographic hash functions".

 Examples of such functions are SHA-256 and SHA3-256, which transform


arbitrary input to 256-bit output.

 Hash functions are used for data integrity and often in combination with
digital signatures.

 With a good hash function, even a 1-bit change in a message will produce a
different hash (on average, half of the bits change).

 With digital signatures, a message is hashed and then the hash itself is signed.
Hash table
 Hash tables utilize hashing to form a data structure.

 Hash tables use an associative method to store data by using what is known as
a key-value lookup system.

 All that means is that, in a hash table, keys are mapped to unique values.

 This system of organizing data results in a very fast way to find data efficiently.

Hash functions exhibit these three properties:


 They are “collision-free.” This means that no two input hashes should map to the
same output hash.

 They can be hidden. It should be difficult to guess the input value for a hash
function from its output.

 They should be puzzle-friendly.

 Types – refer data structure notes….

You might also like