Questions
a) Describe with an illustration a symbol table in relation to lexical analysis
b) Describe the formation and storage of the symbol table entry by the lexical analyzer
c) Explain the role of the lexical analyser in creating and updating the symbol table
d) Explain the concept of input buffering during the lexical analysis phase of a compiler.
Why is input buffering necessary? In your explanation, discuss the concept of reading
ahead and how it assists token recognition. Use appropriate illustrations, including a
diagram showing the input buffer and the movement of pointers, to support your
explanation.
e) Lexical analysis is carried out using a systematic scanning strategy. Analyse how this
scanning approach enables the lexical analyser to accurately determine the end of a
lexeme. In your discussion, clearly explain the mechanism used to detect lexeme
boundaries, especially in cases where a sequence of characters could match more than
one possible token (e.g., relational operators or keywords versus identifiers). Support
your explanation with a well-chosen example.
f) Compare single and double buffer schemes in lexical analysis and discuss the
advantages and disadvantages of each of these
Solutions
a) Lexical analysis is the process of breaking a program’s source code into small meaningful units
called tokens. Lexical analysis relies on symbol tables to efficiently recognize tokens.
A symbol table is a data structure used to store information about identifiers
Illustration of a symbol table.
Name Token Type Scope
Int keyword Local
X Identifier Global
; Separator global
3.14 Constant local
Reference: Aho et al., Compilers: Principles, Techniques, and Tools, Chapter 2.
b) Formation of a Symbol Table Entry
During lexical analysis, the lexical analyzer scans the source program character by character to
identify tokens.
When it encounters an identifier (for example a variable name), the following steps occur:
I. Token Recognition: The lexical analyzer recognizes a sequence of characters as an
identifier according to language rules.
II. Searching the Symbol Table: The lexical analyzer checks whether the identifier already
exists in the symbol table. If it exists, the analyzer returns a pointer/reference to that
entry and if it does not exist, a new entry is created.
III. Creating the Entry: If the identifier is new, the lexical analyzer creates a symbol table
entry containing important attributes such as: Identifier name, Token type (identifier),
Scope information, etc.
IV. Insertion into the Symbol Table: The newly created entry is inserted into the symbol
table so that other phases of the compiler can access it.
The lexical analyzer then returns a token along with a pointer to the symbol table entry to the
parser.
Symbol table entries are stored using efficient data structures to allow fast searching and
insertion. Most commonly used is the Hash Table
c) The role of a lexical analyzer in creating and updating a symbol table
1. Token Identification: The lexical analyzer reads the source code character by character
and groups them into meaningful sequences called tokens.
2. Insertion of Identifiers: Upon identifying a new identifier, the lexical analyzer inserts it
into the symbol table
3. Avoiding Redundancy: Before inserting an identifier, the lexical analyzer checks whether
it already exists in the symbol table within the current scope. If it does, it avoids
duplication.
4. Supporting Semantic Analysis: By populating the symbol table with accurate
information about identifiers, the lexical analyzer enables later stages of compilation
especially semantic analysis to perform type checking, scope resolution, and error
detection effectively.
5. Efficient Lookup and Update: The lexical analyzer facilitates operations such as insert(),
lookup() and update() on the symbol table, ensuring that identifier information is
accessible and modifiable throughout the compilation process
d) Input Buffering is a technique used to speed up reading characters from the source file.
Instead of fetching one character at a time, the compiler reads a block of characters into
memory (the buffer). The lexical analyzer then processes characters directly from this buffer,
which is much faster.
Why is Input Buffering Necessary?
I. Efficiency: Reading data in blocks reduces the number of disk accesses.
II. Look ahead (reading ahead): Sometimes the scanner needs to peek at upcoming
characters to decide what token is being formed.
Reading Ahead and Token Recognition
Reading ahead means the scanner doesn’t just look at the current character, it may need to
check the next one (or more).
Illustration showing input buffer and movement of pointers
Input Buffer
The buffer holds a block of characters from the source code for example
Int count=1%...
Each character is stored in a cell for easy scanning.
Pointers
lexeme_begin: Points to the start of the current token (I in int).
forward: Moves ahead to find the end of the token (=in count=).
Token Recognition
As forward moves, the characters between lexeme_begin and forward form a lexeme. Once a
valid token is recognized (like int, count, or =), the compiler processes it. Then lexeme_begin is
reset to the next token’s start.
e) The mechanism for detecting lexeme boundaries is primarily driven by a finite state machine
(FSM) or deterministic finite automaton (DFA), which models the recognition of token patterns.
As the scanner reads each character, it transitions between states based on the input.
A lexeme ends when:
The FSM reaches an accepting state (indicating a valid token) and the next character
would cause a transition to an invalid or error state for the current pattern.
No further valid extension is possible, adhering to the "maximal munch" rule (also called
the longest-match principle). This rule prioritizes the longest possible valid lexeme
starting from the current position to avoid prematurely truncating tokens.
Explicit delimiters like spaces, punctuation, or newlines intervene, which are often
ignored (as whitespace) or treated as separate tokens.
In cases of ambiguity (where a character sequence could match multiple possible
tokens) the lexer resolves this using the longest-match rule combined with rule
priorities.
Lookahead (peeking at one or more subsequent characters without consuming them) is often
employed to disambiguate without backtracking, ensuring efficiency.
This approach prevents errors like splitting a compound operator into separate tokens or
misclassifying a keyword as an identifier. Keywords are typically handled by first matching the
pattern for identifiers (e.g., letter followed by letters/digits), then checking the resulting string
against a keyword table. If it matches a reserved keyword, it's tokenized as such; otherwise, it's
an identifier.
Example: Disambiguating Relational Operators
Consider a programming language like C or Java, where relational operators include >, >=, ==,
and =. Suppose the input stream is x >= y == z;. The lexer starts at 'x' (matches identifier
pattern), ends at space (boundary), then skips whitespace.
At '>', the FSM enters a state for potential '>' or '>=;
It looks ahead to the next character '='. Since '>=' is a longer valid match (per maximal
munch), it consumes both, tokenizing >= as a single relational operator token rather than
separate '>' and '='.
After skipping whitespace, at 'y' (identifier), ends at space.
At '=', the FSM checks ahead: next is another '=', so it matches the longer == (equality
operator) instead of single '=' (assignment).
If the input were x > y = z;, the '>' would be tokenized alone (no following '='), and '='
would be assignment (no following '=').
Without longest-match and lookahead, the lexer might incorrectly split >= into '>' and '=',
leading to syntax errors later. This scanning strategy ensures accurate boundaries by greedily
extending matches while respecting language rules, making the process deterministic and
efficient.
f) Comparison of Single Buffer and Double Buffer Schemes in Lexical Analysis
1. Single Buffer Scheme: The source program is read into one memory buffer. The lexical
analyzer scans characters from this buffer to form tokens. When the buffer becomes empty, it is
refilled with the next block of characters from the input file.
Advantages:
- Simple to implement.
- Requires less memory since only one buffer is used.
Disadvantages:
- When the buffer is exhausted, the lexical analyzer must pause while the buffer is refilled.
- It may be difficult to handle lexemes that span across the boundary of two input blocks.
- Less efficient because frequent I/O operations may interrupt token recognition.
2. Double Buffer Scheme: Two buffers are used instead of one. Each buffer holds a block of
input characters. While the lexical analyzer scans characters from one buffer, the other buffer
can be refilled with new input. Special markers called sentinels are often placed at the end of
each buffer to signal when the analyzer should switch to the other buffer.
Advantages:
- More efficient because the next buffer can be loaded while scanning continues.
- Reduces the number of input/output interruptions.
- Makes it easier to handle lexemes that cross buffer boundaries.
- Supports faster lexical analysis.
Disadvantages:
- Slightly more complex to implement.
- Requires more memory because two buffers are maintained.