0% found this document useful (0 votes)
4 views2 pages

302

LEX is a lexical analyzer generator that produces a C program to identify tokens from patterns, while YACC is a parser generator that creates syntax analyzers from context-free grammar. Both tools work together, with LEX handling tokenization and YACC managing parsing. Additionally, the document discusses Syntax-Directed Definitions (SDD), SLR parsers, Directed Acyclic Graphs (DAG) for optimization, and Three-Address Code (TAC) in compiler design.

Uploaded by

MrScotch
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views2 pages

302

LEX is a lexical analyzer generator that produces a C program to identify tokens from patterns, while YACC is a parser generator that creates syntax analyzers from context-free grammar. Both tools work together, with LEX handling tokenization and YACC managing parsing. Additionally, the document discusses Syntax-Directed Definitions (SDD), SLR parsers, Directed Acyclic Graphs (DAG) for optimization, and Three-Address Code (TAC) in compiler design.

Uploaded by

MrScotch
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

What is LEX? What is YACC?

What is YACC? YACC (Yet Another Compiler Compiler) is a parser generator tool used to automatically create syntax analyzers (parsers).
LEX (also called Lexical Analyzer Generator) is a tool used to automatically generate a lexical analyzer (tokenizer/scanner). It takes a set of It takes a context-free grammar (CFG) as input and produces a C program that parses input based on that grammar. YACC usually works
patterns (usually regular expressions) and produces a C program that can identify tokens such as keywords, identifiers, numbers, operators, together with LEX: LEX → generates the lexical analyzer (tokenizer). YACC → generates the syntax analyzer.
etc. Output: YACC produces a file [Link].c containing the parsing code. YACC File (expr.y) — Parser + evaluator
LEX converts: Patterns → Token recognizing code Structure of a YACC Program %{
Its output is usually compiled with yacc (for parsing) or used alone. A YACC program has three sections, just like LEX: #include <stdio.h>
Structure of a LEX Program %{ #include <stdlib.h>
A LEX program has three sections: How to Run This LEX Program Declarations (C code) int yylex();
%{ 1. Save the file %} void yyerror(char *s);
/* Declarations (C code) */ tokenize.l %% %}
%} 2. Run lex Grammar rules and actions %token NUMBER
%% lex tokenize.l %% %%
PATTERN ACTION 3. Compile Additional C code (main function, etc.) expr:
PATTERN ACTION Simple Example YACC Program expr '+' expr { printf("%d\n", $1 + $3); }
gcc [Link].c -o tokenize
%% Task: Write a YACC program to evaluate arithmetic | expr '-' expr { printf("%d\n", $1 - $3); }
4. Run
/* User-defined helper functions */ expressions like: | expr '*' expr { printf("%d\n", $1 * $3); }
./tokenize < sample.c
LEX Program to Tokenize a Simple C Program 3+5*2 | expr '/' expr { printf("%d\n", $1 / $3); }
▶ Example Input (sample.c)
This program tokenizes: This example pairs with a simple LEX scanner. | '(' expr ')' { $$ = $2; }
int x = 10;
Keywords Identifiers Numbers LEX File (expr.l) — Tokenizer for numbers and | NUMBER { $$ = $1; }
if (x > 5) {
Operators Braces, brackets, semicolons operators ;
x = x + 1;
Whitespace & comments %{ %%
}
Complete LEX Program #include "[Link].h" void yyerror(char *s) {
%{ %} printf("Error: %s\n", s);
▶ Example Output
#include <stdio.h> %% }
#include <string.h> [0-9]+ { yylval = atoi(yytext); return NUMBER; } int main() {
KEYWORD: int
%} [+\-*/()] { return yytext[0]; } printf("Enter expression: ");
IDENTIFIER: x
/* Definitions */ [ \t\n] {;} yyparse();
OPERATOR: =
KEYWORD int|float|char|if|else|for|while|return . {;} return 0;
NUMBER: 10
IDENT [a-zA-Z_][a-zA-Z0-9_]* %% }
SEMICOLON: ;
NUMBER [0-9]+ KEYWORD: if
OP \+|\-|\*|\/|\=|==|!=|<=|>=|<|> LPAREN: (
WS [ \t\n]+ IDENTIFIER: x
%% OPERATOR: >
{KEYWORD} { printf("KEYWORD: %s\n", yytext); } NUMBER: 5
{IDENT} { printf("IDENTIFIER: %s\n", yytext); } RPAREN: )
{NUMBER} { printf("NUMBER: %s\n", yytext); } LBRACE: {
"(" { printf("LPAREN: (\n"); } IDENTIFIER: x What is a Syntax-Directed Definition (SDD)?
")" { printf("RPAREN: )\n"); } OPERATOR: = A Syntax-Directed Definition (SDD) is a formal way of specifying the semantic rules associated with a grammar. It combines: 1.
"{" { printf("LBRACE: {\n"); } IDENTIFIER: x Context-Free Grammar (CFG) 2. Attributes attached to grammar symbols 3. Semantic rules that compute attribute values
"}" { printf("RBRACE: }\n"); } OPERATOR: + SDD is used in compilers for: Type checking
";" { printf("SEMICOLON: ;\n"); } NUMBER: 1 Intermediate code generation Evaluation of expressions Handling declarations
{OP} { printf("OPERATOR: %s\n", yytext); } SEMICOLON: ; Example of an attribute: [Link] = [Link] + [Link]
{WS} { /* ignore spaces */ } RBRACE: } Types of Attributes
. { printf("UNKNOWN: %s\n", yytext); } Type Meaning
%% Synthesized attribute Computed from children → parent
int main() { Inherited attribute Passed from parent / siblings → child
yylex(); 1. S-Attributed Definition An SDD is S-attributed if all attributes are synthesized attributes only.
return 0; Means: Attribute values always flow bottom-up (from children to parent).
} Suitable for bottom-up parsing (LR).
Example: Evaluate Arithmetic Expression
Grammar: E→E+T E→T T → digit
SDD: E → E1 + T { [Link] = [Link] + [Link] } E→T { [Link] = [Link] }
T → digit { [Link] = [Link] }
All attributes (value) are synthesized → S-attributed.
Example Evaluation Input: 3 + 5 [Link] = 3 [Link] = 5 [Link] = 3 + 5 = 8
2. L-Attributed Definition An SDD is L-attributed if inherited attributes are allowed, but with restrictions:
For a production A → X1 X2 … Xn
An attribute of Xi may depend on: 1. Attributes of A 2. Attributes of X1 … X(i−1) (left siblings)
3. Synthesized attributes of Xi This allows left-to-right attribute evaluation. Suitable for top-down parsing (LL). Example: Type
Inheritance in Declarations
Grammar: D → T L T → int T → float L → L , id L → id
Attributes:
[Link] — synthesized [Link] — inherited [Link] — synthesized for the list of identifiers
SDD:
D→TL { [Link] = [Link] } T → int { [Link] = "int" }
T → float { [Link] = "float" } L → L1 , id { [Link] = [Link]; print([Link], [Link]); }
L → id { print([Link], [Link]); }
Here: [Link] is inherited from its parent D [Link] is inherited from left sibling L
Thus this SDD is L-attributed.
Example Output
Input: int a, b, c
Output: a : int b : int c : int

What is an SLR parser?


SLR (Simple LR) is an LR parsing method that builds a deterministic bottom-up parser using LR(0) items plus FOLLOW sets to decide reduce
actions. What is a DAG (in compiler optimization)?
Key idea: construct the canonical collection of LR(0) item sets (states), then build an ACTION/GOTO table. When a state contains a A DAG (Directed Acyclic Graph) for an expression is a graph representation where:
completed item A → α ., SLR places a reduce A→α action on every terminal in FOLLOW(A). Shift actions come from transitions on terminals. Leaves are operands (variables or constants).
The parser is SLR(1) if this table has no conflicts (no shift/reduce or reduce/reduce conflicts). Interior nodes are operators (like +, -, *, /).
Steps for designing an SLR parser Edges point from operand nodes to operator nodes (i.e., data flows into operators).
1. Augment the grammar: add S' → S (S' is new start symbol). The graph is acyclic and merges identical subexpressions into a single node — so it makes common subexpressions explicit and avoids
2. Form LR(0) items: items are productions with a dot · indicating parser position, e.g. A → α · β. recomputation.
3. Compute closure(I) for an item set I: whenever A → α · B β is in I and B is nonterminal, add B → · γ for all B → γ. Repeat until no new DAGs are used for expression optimization (eliminate common subexpressions, reduce temporaries) and for generating optimized three-
items. address code.
4. Compute goto(I, X): for each grammar symbol X, move the dot over X in all items in I where it appears, then take closure of the result. Build the DAG for a b c d e
5. Construct the canonical collection of item sets (states) by starting with closure of S' → · S and repeatedly computing goto for all symbols. Y = \frac{(a*b + c - (d+e))}{(a + b - (d+e))} \ / | \ /
6. Build ACTION / GOTO table: Stepwise decomposition [*] | [+] <-- n2 = d+e

If goto(I, a) = J for terminal a, set ACTION[I, a] = shift J. n1 = a * b | | \


| | \
If A → α · (completed) is in state I and A ≠ S', then for every terminal a in FOLLOW(A), set ACTION[I, a] = reduce A → α. n2 = d + e ← common subexpression
[ + ] <------/ [-] <-- den = (a+b)
If S' → S · is in I, set ACTION[I, $] = accept. (used both numerator and denominator)
- (d+e)
If goto(I, A) = J for nonterminal A, set GOTO[I, A] = J. n3 = n1 + c (n3) (from c) ^
7. Detect conflicts: if any ACTION cell requires two different actions (shift & reduce, or reduce & reduce), the grammar is not SLR(1). num = n3 - n2 (numerator) | |
8. If no conflicts, table yields an SLR(1) parser. n4 = a + b | |
Canonical LR(0) states (summary) den = n4 - n2 (denominator) [-] /

I built the canonical collection of LR(0) item sets (states). There are 9 states (numbered 0..8). I omit the full item lists here for brevity and Y = num / den (num) /
\ /
show the transitions used to form the parsing table. ASCII DAG (operands → operator)
\ /
Transitions used (shift / goto): To make the structure clearer,
\ /
here's a more explicit labelled DAG: [ / ] <-- root: Y = num / den
Leaves: a b c d e |
Nodes: Y
n1 = (*) [inputs: a, b] (a*b)
n2 = (+) [inputs: d, e] (d+e) <-- shared
n3 = (+) [inputs: n1, c] (a*b + c)
num = (-) [inputs: n3, n2] ((a*b + c) - (d+e))
n4 = (+) [inputs: a, b] (a + b)
den = (-) [inputs: n4, n2] ((a + b) - (d+e))
root = (/) [inputs: num, den] Y
Note: n4 = (+) [a,b] is distinct from n1 — n1 is a*b, n4 is a+b.
Optimized three-address code produced from this DAG
(only compute d+e once)
t1 = a * b
t2 = d + e # computed once, reused
t3 = t1 + c
t4 = t3 - t2 # numerator
t5 = a + b
t6 = t5 - t2 # denominator
Y = t4 / t6
If you did not use a DAG (no CSE), d+e might be computed twice; DAG avoids that.
What is TAC (Three-Address Code)? Define LALR Parser LALR (Look-Ahead LR) parser stands for: Look-Ahead LR Parser
Three-Address Code (TAC) is an intermediate representation (IR) used by compilers in the intermediate code generation phase. Each TAC It is a bottom-up parser used in compiler design. LALR is obtained by merging states of the canonical LR(1) parser that have identical
instruction has at most three addresses (operands): two for source operands and one for result. LR(0) cores, and uniting their lookaheads.
General form: x = y op z Why LALR is used? Has almost the same power as LR(1). Uses much fewer states, similar to SLR. Most programming languages are
Where: x → result y, z → operands op → operator parsed using LALR (e.g., YACC, Bison).
TAC is close to machine instructions but still independent of actual CPU architecture, making optimization easier. LALR advantages More powerful than SLR Smaller parsing table than canonical LR(1)
Features of TAC Only one operator per statement Uses temporary variables (t1, t2, …) Detects many conflicts not detected by SLR
Easier for applying optimizations (constant folding, dead code elimination, etc.) Steps of Designing an SLR Parser
Machine-independent representation To design an SLR(1) parser for a grammar, follow these steps:
Example: a = b * -c + d Step 1: Augment the Grammar
TAC form: t1 = -c t2 = b * t1 t3 = t2 + d a = t3 If grammar is: S→A A → aA | b
Different Types of TAC There are four standard formats of three-address code: Augmented grammar: S' → S S→A A → aA | b
[Link] Instructions The new start symbol is S'.
General form: x = y op z x = op y (unary) x=y (copy) Step 2: Construct LR(0) Items Each production gets a dot (•) marking the parser’s position.
Examples: t1 = a + b t2 = -t1 x = t2 Example LR(0) items: S' → •S S → •A A → •aA A → •b
[Link] Jumps Used for decision making / if-else. Step 3: Compute Canonical Collection of LR(0) Item Sets
General form: if x relop y goto L Repeatedly apply: Closure(I) Add productions whose LHS appears immediately after a dot.
Examples: if t1 > t2 goto L1 if x == 0 goto L3 Goto(I, X) Move the dot over symbol X.
[Link] Jumps Simple jumps to another label. This produces states: I0, I1, I2, ..., In
Format: goto L Step 4: Construct SLR Parsing Table
Example: goto L2 Two tables are built: ACTION table For terminals: shift, reduce, accept, error.
[Link] Flow Instructions These include labels and procedure calls. GOTO table For non-terminals: transitions between states.
(a) Label Instructions L1: Rules:
(b) Function Call / Return param x call p, n return x ACTION table If item is [A → α • a β], with ‘a’ terminal → shift. If item is [A → α•], with ‘A ≠ S'` → reduce A→α for all
5. Array Access Instructions Two types: load and store. FOLLOW(A). If item is [S' → S•] → accept.
x = a[i] // load a[i] = x // store GOTO table From state i on non-terminal A, goto state j → GOTO[i, A] = j.
[Link] Instructions Used for referencing and dereferencing. Constructing an LALR Parsing Table (General Method)
x = &y x = *y *y = x LALR tables are built from LR(1) items, but many states have identical LR(0) cores.
[Link] Assignment x = y[i] x[i] = y LALR reduces table size by merging such states.
Step 1: Build Canonical LR(1) Item Sets Each LR(1) item has a lookahead:
Example: A → a•B , {$}
Compute Closure() and Goto() as in LR(1).
Step 2: Find LR(0) Cores of LR(1) States Two states may look like:
I5: A → B• , {a} C → D• , {b}
I8: A → B• , {c} C → D• , {d}
LR(0) cores are: A → B• C → D•
Phases of a Compiler A compiler typically has two major parts: Identical → Merge them.
1. Front End – Analysis (checks the code) Step 3: Merge LR(1) States with Same LR(0) Core Merged state lookaheads become the union.
2. Back End – Synthesis (generates machine code) After merging, you obtain LALR item sets: J0, J1, ..., Jm
1. Lexical Analysis (Scanning) Step 4: Build ACTION & GOTO Tables Same as LR(1), but based on merged LALR states.
Breaks the source code into tokens. Tokens are smallest meaningful units: keywords, identifiers, numbers, operators, etc. Uses a Because lookaheads were merged, LALR resolves many SLR conflicts:
symbol table to store identifiers. shift/reduce
Example: reduce/reduce
int a = 10; → tokens: int, a, =, 10, ; Example (Small Grammar) Let’s take the grammar:
2. Syntax Analysis (Parsing) S→L=R S→R L → *R L → id R→L
Checks if tokens follow the grammar rules of the language. Produces a Parse Tree / Syntax Tree. This is the standard LALR textbook grammar.
Detects syntax errors. If you want, I can construct: Canonical LR(1) states Merging into LALR states
3. Semantic Analysis Complete parsing table
Checks meaning of the program. But it is long—so I will generate it only if you confirm.
Validates: Type checking Variable declaration checking Function argument matching
Produces an annotated syntax tree.
4. Intermediate Code Generation (ICG)
Converts the syntax tree into Intermediate Representation (IR). IR is easier to optimize and portable across machines. Example IR: t1
=a+b
5. Code Optimization
Improves IR to make the program faster and smaller. Removes redundancy, improves loops, etc.
6. Target Code Generation
Converts optimized IR into machine code / assembly code. Allocates registers, selects instructions.
7. Symbol Table & Error Handler (Auxiliary Parts)
Symbol Table stores variable names, functions, types. Error Handler reports and recovers from errors in all phases.
Neat & Labelled Diagram of Compiler Phases +----------------------+
+---------------------------+ | Intermediate Code |
| Source Program | | Generator |
+-------------+-------------+ +----------+-----------+
+----------------------+ +----------------------+
| Lexical Analyzer | | Code Optimizer |
| (Token Generation) | +----------+-----------+
+----------+-----------+ +----------------------+
+----------------------+ | Target Code Generator|
| Syntax Analyzer | | (Machine Code) |
| (Parsing) | +----------+-----------+
+----------+-----------+ +---------------------------+
+----------------------+ | Target Program |
| Semantic Analyzer | +---------------------------+
| (Type/Meaning Check) |
+----------+-----------+ *Symbol Table & Error Handler work throughout all phases*

From state 0: on 0 → state 3, on 1 → state 2, on S → state 1


Top-Down Parsing From state 2: on 0 → state 5, on 1 → state 2, on S → state 6
Top-down parsing starts from the start symbol of the grammar and tries to derive the input string by expanding productions from left to From state 3: on 0 → state 3, on 1 → state 2, on S → state 6
right. From state 4: on 1 → state 7
It predicts what the input should be and proceeds from the root of the parse tree downwards. From state 5: on 0 → state 3, on 1 → state 2, on S → state 6
How it works? Begins with Start symbol (S) Applies grammar rules to expand nonterminals From state 6: on 0 → state 8
Continues until the input string is produced Builds the parse tree from top (root) to leaves (Other states have no outgoing transitions on grammar symbols.)
Popular Top-Down Parsers Completed items (where a reduce or accept would be placed):
1. Recursive Descent Parser State 1 contains S' → S . → ACCEPT on $.
2. LL(1) Parser / Predictive Parser State 5 contains S → 1 0 . → reduce by production 3 (S → 1 0).
Characteristics Easy to implement No need to look at entire input Cannot handle left recursion Works only for a State 7 contains S → 1 S 1 . → reduce by production 2 (S → 1 S 1).
subset of grammars (LL grammars) State 8 contains S → 0 S 0 . → reduce by production 1 (S → 0 S 0).
Example We need FOLLOW(S) to place reduce actions.
Grammar: E→T+E|T T → id Compute FOLLOW(S):
If input is: id + id From S' → S we have $ ∈ FOLLOW(S).
The parser expands: E ⇒ T + E ⇒ id + E ⇒ id + T ⇒ id + id From S → 0 S 0, S is followed by 0 → 0 ∈ FOLLOW(S).
Bottom-Up Parsing From S → 1 S 1, S is followed by 1 → 1 ∈ FOLLOW(S). So FOLLOW(S) = { 0, 1, $ }.
Bottom-up parsing starts from the input string and attempts to reduce it to the start symbol by applying grammar productions in reverse. SLR parsing table (ACTION / GOTO)
It builds the parse tree from leaves to the root. Columns: terminals 0, 1, $ (ACTION); nonterminal S (GOTO).
How it works? Begins with the input tokens Combines symbols to form nonterminals using reductions Continues until the start Entries: sN = shift to state N; rk = reduce by production k; acc = accept; blank = error.
symbol S is produced State | ACTION | GOTO
Popular Bottom-Up Parsers |0 1 $ | S
1. Shift–Reduce Parser ----------------------------------------
2. LR Parsers 0 | s3 s2 () | 1
SLR(1) LALR(1) Canonical LR(1) 1 |() () acc | ()
Characteristics More powerful (handles large class of grammars) Can handle left recursion 2 | s5 s2 () | 6
More complex to build Used in real compilers (YACC, Bison) 3 | s3 s2 () | 6
Example 4 |() s7 () | ()
Input: id + id 5 | r3 r3 r3 | 6
Parser performs actions: shift id reduce id → T reduce T → E shift + 6 | s8 () () | ()
shift id reduce id → T reduce E + T → E 7 | r2 r2 r2 | ()
Finally reduces to start symbol: E. 8 | r1 r1 r1 | ()
Difference Between Top-Down and Bottom-Up Parsing Interpretation of the reduce entries:
r1 = reduce by production 1 (S → 0 S 0)
Feature Top-Down Parsing Bottom-Up Parsing r2 = reduce by production 2 (S → 1 S 1)
r3 = reduce by production 3 (S → 1 0)
Direction Builds tree root → leaves Builds tree leaves → root
Start From Start symbol Input string
Technique Prediction Reduction
Handles Left Recursion ❌ No ✔ Yes
Power Less powerful More powerful
Example Parsers LL(1), predictive LR, SLR, LALR
Implementation Simpler More complex

You might also like