302
302
What is YACC? YACC (Yet Another Compiler Compiler) is a parser generator tool used to automatically create syntax analyzers (parsers).
LEX (also called Lexical Analyzer Generator) is a tool used to automatically generate a lexical analyzer (tokenizer/scanner). It takes a set of It takes a context-free grammar (CFG) as input and produces a C program that parses input based on that grammar. YACC usually works
patterns (usually regular expressions) and produces a C program that can identify tokens such as keywords, identifiers, numbers, operators, together with LEX: LEX → generates the lexical analyzer (tokenizer). YACC → generates the syntax analyzer.
etc. Output: YACC produces a file [Link].c containing the parsing code. YACC File (expr.y) — Parser + evaluator
LEX converts: Patterns → Token recognizing code Structure of a YACC Program %{
Its output is usually compiled with yacc (for parsing) or used alone. A YACC program has three sections, just like LEX: #include <stdio.h>
Structure of a LEX Program %{ #include <stdlib.h>
A LEX program has three sections: How to Run This LEX Program Declarations (C code) int yylex();
%{ 1. Save the file %} void yyerror(char *s);
/* Declarations (C code) */ tokenize.l %% %}
%} 2. Run lex Grammar rules and actions %token NUMBER
%% lex tokenize.l %% %%
PATTERN ACTION 3. Compile Additional C code (main function, etc.) expr:
PATTERN ACTION Simple Example YACC Program expr '+' expr { printf("%d\n", $1 + $3); }
gcc [Link].c -o tokenize
%% Task: Write a YACC program to evaluate arithmetic | expr '-' expr { printf("%d\n", $1 - $3); }
4. Run
/* User-defined helper functions */ expressions like: | expr '*' expr { printf("%d\n", $1 * $3); }
./tokenize < sample.c
LEX Program to Tokenize a Simple C Program 3+5*2 | expr '/' expr { printf("%d\n", $1 / $3); }
▶ Example Input (sample.c)
This program tokenizes: This example pairs with a simple LEX scanner. | '(' expr ')' { $$ = $2; }
int x = 10;
Keywords Identifiers Numbers LEX File (expr.l) — Tokenizer for numbers and | NUMBER { $$ = $1; }
if (x > 5) {
Operators Braces, brackets, semicolons operators ;
x = x + 1;
Whitespace & comments %{ %%
}
Complete LEX Program #include "[Link].h" void yyerror(char *s) {
%{ %} printf("Error: %s\n", s);
▶ Example Output
#include <stdio.h> %% }
#include <string.h> [0-9]+ { yylval = atoi(yytext); return NUMBER; } int main() {
KEYWORD: int
%} [+\-*/()] { return yytext[0]; } printf("Enter expression: ");
IDENTIFIER: x
/* Definitions */ [ \t\n] {;} yyparse();
OPERATOR: =
KEYWORD int|float|char|if|else|for|while|return . {;} return 0;
NUMBER: 10
IDENT [a-zA-Z_][a-zA-Z0-9_]* %% }
SEMICOLON: ;
NUMBER [0-9]+ KEYWORD: if
OP \+|\-|\*|\/|\=|==|!=|<=|>=|<|> LPAREN: (
WS [ \t\n]+ IDENTIFIER: x
%% OPERATOR: >
{KEYWORD} { printf("KEYWORD: %s\n", yytext); } NUMBER: 5
{IDENT} { printf("IDENTIFIER: %s\n", yytext); } RPAREN: )
{NUMBER} { printf("NUMBER: %s\n", yytext); } LBRACE: {
"(" { printf("LPAREN: (\n"); } IDENTIFIER: x What is a Syntax-Directed Definition (SDD)?
")" { printf("RPAREN: )\n"); } OPERATOR: = A Syntax-Directed Definition (SDD) is a formal way of specifying the semantic rules associated with a grammar. It combines: 1.
"{" { printf("LBRACE: {\n"); } IDENTIFIER: x Context-Free Grammar (CFG) 2. Attributes attached to grammar symbols 3. Semantic rules that compute attribute values
"}" { printf("RBRACE: }\n"); } OPERATOR: + SDD is used in compilers for: Type checking
";" { printf("SEMICOLON: ;\n"); } NUMBER: 1 Intermediate code generation Evaluation of expressions Handling declarations
{OP} { printf("OPERATOR: %s\n", yytext); } SEMICOLON: ; Example of an attribute: [Link] = [Link] + [Link]
{WS} { /* ignore spaces */ } RBRACE: } Types of Attributes
. { printf("UNKNOWN: %s\n", yytext); } Type Meaning
%% Synthesized attribute Computed from children → parent
int main() { Inherited attribute Passed from parent / siblings → child
yylex(); 1. S-Attributed Definition An SDD is S-attributed if all attributes are synthesized attributes only.
return 0; Means: Attribute values always flow bottom-up (from children to parent).
} Suitable for bottom-up parsing (LR).
Example: Evaluate Arithmetic Expression
Grammar: E→E+T E→T T → digit
SDD: E → E1 + T { [Link] = [Link] + [Link] } E→T { [Link] = [Link] }
T → digit { [Link] = [Link] }
All attributes (value) are synthesized → S-attributed.
Example Evaluation Input: 3 + 5 [Link] = 3 [Link] = 5 [Link] = 3 + 5 = 8
2. L-Attributed Definition An SDD is L-attributed if inherited attributes are allowed, but with restrictions:
For a production A → X1 X2 … Xn
An attribute of Xi may depend on: 1. Attributes of A 2. Attributes of X1 … X(i−1) (left siblings)
3. Synthesized attributes of Xi This allows left-to-right attribute evaluation. Suitable for top-down parsing (LL). Example: Type
Inheritance in Declarations
Grammar: D → T L T → int T → float L → L , id L → id
Attributes:
[Link] — synthesized [Link] — inherited [Link] — synthesized for the list of identifiers
SDD:
D→TL { [Link] = [Link] } T → int { [Link] = "int" }
T → float { [Link] = "float" } L → L1 , id { [Link] = [Link]; print([Link], [Link]); }
L → id { print([Link], [Link]); }
Here: [Link] is inherited from its parent D [Link] is inherited from left sibling L
Thus this SDD is L-attributed.
Example Output
Input: int a, b, c
Output: a : int b : int c : int
I built the canonical collection of LR(0) item sets (states). There are 9 states (numbered 0..8). I omit the full item lists here for brevity and Y = num / den (num) /
\ /
show the transitions used to form the parsing table. ASCII DAG (operands → operator)
\ /
Transitions used (shift / goto): To make the structure clearer,
\ /
here's a more explicit labelled DAG: [ / ] <-- root: Y = num / den
Leaves: a b c d e |
Nodes: Y
n1 = (*) [inputs: a, b] (a*b)
n2 = (+) [inputs: d, e] (d+e) <-- shared
n3 = (+) [inputs: n1, c] (a*b + c)
num = (-) [inputs: n3, n2] ((a*b + c) - (d+e))
n4 = (+) [inputs: a, b] (a + b)
den = (-) [inputs: n4, n2] ((a + b) - (d+e))
root = (/) [inputs: num, den] Y
Note: n4 = (+) [a,b] is distinct from n1 — n1 is a*b, n4 is a+b.
Optimized three-address code produced from this DAG
(only compute d+e once)
t1 = a * b
t2 = d + e # computed once, reused
t3 = t1 + c
t4 = t3 - t2 # numerator
t5 = a + b
t6 = t5 - t2 # denominator
Y = t4 / t6
If you did not use a DAG (no CSE), d+e might be computed twice; DAG avoids that.
What is TAC (Three-Address Code)? Define LALR Parser LALR (Look-Ahead LR) parser stands for: Look-Ahead LR Parser
Three-Address Code (TAC) is an intermediate representation (IR) used by compilers in the intermediate code generation phase. Each TAC It is a bottom-up parser used in compiler design. LALR is obtained by merging states of the canonical LR(1) parser that have identical
instruction has at most three addresses (operands): two for source operands and one for result. LR(0) cores, and uniting their lookaheads.
General form: x = y op z Why LALR is used? Has almost the same power as LR(1). Uses much fewer states, similar to SLR. Most programming languages are
Where: x → result y, z → operands op → operator parsed using LALR (e.g., YACC, Bison).
TAC is close to machine instructions but still independent of actual CPU architecture, making optimization easier. LALR advantages More powerful than SLR Smaller parsing table than canonical LR(1)
Features of TAC Only one operator per statement Uses temporary variables (t1, t2, …) Detects many conflicts not detected by SLR
Easier for applying optimizations (constant folding, dead code elimination, etc.) Steps of Designing an SLR Parser
Machine-independent representation To design an SLR(1) parser for a grammar, follow these steps:
Example: a = b * -c + d Step 1: Augment the Grammar
TAC form: t1 = -c t2 = b * t1 t3 = t2 + d a = t3 If grammar is: S→A A → aA | b
Different Types of TAC There are four standard formats of three-address code: Augmented grammar: S' → S S→A A → aA | b
[Link] Instructions The new start symbol is S'.
General form: x = y op z x = op y (unary) x=y (copy) Step 2: Construct LR(0) Items Each production gets a dot (•) marking the parser’s position.
Examples: t1 = a + b t2 = -t1 x = t2 Example LR(0) items: S' → •S S → •A A → •aA A → •b
[Link] Jumps Used for decision making / if-else. Step 3: Compute Canonical Collection of LR(0) Item Sets
General form: if x relop y goto L Repeatedly apply: Closure(I) Add productions whose LHS appears immediately after a dot.
Examples: if t1 > t2 goto L1 if x == 0 goto L3 Goto(I, X) Move the dot over symbol X.
[Link] Jumps Simple jumps to another label. This produces states: I0, I1, I2, ..., In
Format: goto L Step 4: Construct SLR Parsing Table
Example: goto L2 Two tables are built: ACTION table For terminals: shift, reduce, accept, error.
[Link] Flow Instructions These include labels and procedure calls. GOTO table For non-terminals: transitions between states.
(a) Label Instructions L1: Rules:
(b) Function Call / Return param x call p, n return x ACTION table If item is [A → α • a β], with ‘a’ terminal → shift. If item is [A → α•], with ‘A ≠ S'` → reduce A→α for all
5. Array Access Instructions Two types: load and store. FOLLOW(A). If item is [S' → S•] → accept.
x = a[i] // load a[i] = x // store GOTO table From state i on non-terminal A, goto state j → GOTO[i, A] = j.
[Link] Instructions Used for referencing and dereferencing. Constructing an LALR Parsing Table (General Method)
x = &y x = *y *y = x LALR tables are built from LR(1) items, but many states have identical LR(0) cores.
[Link] Assignment x = y[i] x[i] = y LALR reduces table size by merging such states.
Step 1: Build Canonical LR(1) Item Sets Each LR(1) item has a lookahead:
Example: A → a•B , {$}
Compute Closure() and Goto() as in LR(1).
Step 2: Find LR(0) Cores of LR(1) States Two states may look like:
I5: A → B• , {a} C → D• , {b}
I8: A → B• , {c} C → D• , {d}
LR(0) cores are: A → B• C → D•
Phases of a Compiler A compiler typically has two major parts: Identical → Merge them.
1. Front End – Analysis (checks the code) Step 3: Merge LR(1) States with Same LR(0) Core Merged state lookaheads become the union.
2. Back End – Synthesis (generates machine code) After merging, you obtain LALR item sets: J0, J1, ..., Jm
1. Lexical Analysis (Scanning) Step 4: Build ACTION & GOTO Tables Same as LR(1), but based on merged LALR states.
Breaks the source code into tokens. Tokens are smallest meaningful units: keywords, identifiers, numbers, operators, etc. Uses a Because lookaheads were merged, LALR resolves many SLR conflicts:
symbol table to store identifiers. shift/reduce
Example: reduce/reduce
int a = 10; → tokens: int, a, =, 10, ; Example (Small Grammar) Let’s take the grammar:
2. Syntax Analysis (Parsing) S→L=R S→R L → *R L → id R→L
Checks if tokens follow the grammar rules of the language. Produces a Parse Tree / Syntax Tree. This is the standard LALR textbook grammar.
Detects syntax errors. If you want, I can construct: Canonical LR(1) states Merging into LALR states
3. Semantic Analysis Complete parsing table
Checks meaning of the program. But it is long—so I will generate it only if you confirm.
Validates: Type checking Variable declaration checking Function argument matching
Produces an annotated syntax tree.
4. Intermediate Code Generation (ICG)
Converts the syntax tree into Intermediate Representation (IR). IR is easier to optimize and portable across machines. Example IR: t1
=a+b
5. Code Optimization
Improves IR to make the program faster and smaller. Removes redundancy, improves loops, etc.
6. Target Code Generation
Converts optimized IR into machine code / assembly code. Allocates registers, selects instructions.
7. Symbol Table & Error Handler (Auxiliary Parts)
Symbol Table stores variable names, functions, types. Error Handler reports and recovers from errors in all phases.
Neat & Labelled Diagram of Compiler Phases +----------------------+
+---------------------------+ | Intermediate Code |
| Source Program | | Generator |
+-------------+-------------+ +----------+-----------+
+----------------------+ +----------------------+
| Lexical Analyzer | | Code Optimizer |
| (Token Generation) | +----------+-----------+
+----------+-----------+ +----------------------+
+----------------------+ | Target Code Generator|
| Syntax Analyzer | | (Machine Code) |
| (Parsing) | +----------+-----------+
+----------+-----------+ +---------------------------+
+----------------------+ | Target Program |
| Semantic Analyzer | +---------------------------+
| (Type/Meaning Check) |
+----------+-----------+ *Symbol Table & Error Handler work throughout all phases*