MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
MODULE -2
SYLLABUS:
Architectures for Programmable Digital Signal – Processing Devices: Introduction, Basic Architectural
Features, DSP Computational Building Blocks, Bus Architecture and Memory, Data Addressing
Capabilities, Address Generation Unit, Programmability and Program Execution, Features for External
Interfacing
TEXT, REFERENCE & ADDITIONAL REFERENCE BOOKS:
BOOK TITLE/AUTHORS/PUBLICATION /Web links/Channel
T-1 “Digital Signal Processing”, Avatar Singh and S. Srinivasan, Thomson Learning,
2004
“Digital Signal Processing: A practical approach”, Ifeachor E. C., Jervis B.
AR-1
W Pearson-Education, PHI/ 2002
AR-2 “Digital Signal Processors”, B Venkataramani and M Bhaskar TMH, 2002
AR-3 “Architectures for Digital Signal Processing”, Peter Pirsch John Weily, 2007
Architectures for Programmable Digital Signal Processing Devices
2.1 Basic Architectural Features
A programmable DSP device should provide instructions similar to a conventional microprocessor
The instruction set of a typical DSP device should include the following,
a. Arithmetic operations such as ADD, SUBTRACT, MULTIPLY etc
b. Logical operations such as AND, OR, NOT, XOR etc
c. Multiply and Accumulate (MAC) operation
d. Signal scaling operation
In addition to the above provisions, the architecture should also include,
a. On chip registers to store immediate results
b. On chip memories to store signal samples (RAM)
c. On chip memories to store filter coefficients (ROM)
Problem: Investigate the basic features that should be provided in the DSP architecture to be used to
implement the following Nth order FIR filter
𝑁−1
𝑦(𝑛) = ∑ 𝑥(𝑛 − 𝑖). ℎ(𝑛)
𝑖=0
n=0,1,2………………
In order to implement the above operation in a DSP, the architecture requires the following features
i. A RAM to store the signal samples x (n)
ii. A ROM to store the filter coefficients h (n)
iii. An MAC unit to perform Multiply and Accumulate operation
iv. An accumulator to store the result immediately
v. A signal pointer to point the signal sample in the memory
vi. A coefficient pointer to point the filter coefficient in the memory
Dept. of ECE CEC Page 1
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
vii. A counter to keep track of the count
viii. A shifter to shift the input samples appropriately
2.2 DSP Computational Building Blocks
Each computational block of the DSP should be optimized for functionality and speed
The design should be sufficiently general so that it can be easily integrated with other blocks to
implement overall DSP systems
2.2.1 Multipliers
The advent of single chip multipliers paved the way for implementing DSP functions on a VLSI chip
Parallel multipliers replaced the traditional shift and add multipliers
Parallel multipliers take a single processor cycle to fetch and execute the instruction and to store the
result and are also called as Array multipliers
The key features to be considered for a multiplier are:
a. Accuracy
b. Dynamic range
c. Speed
The numbers of bits used to represent the operands decide the accuracy and the dynamic range of the
multiplier
Whereas speed is decided by the architecture employed
If the multipliers are implemented using hardware, the speed of execution will be very high but the
circuit complexity will also increases considerably
Thus there should be a trade-off between the speed of execution and the circuit complexity
Hence the choice of the architecture normally depends on the application
Parallel Multipliers
Consider the multiplication of two unsigned numbers A and B
Let A be represented using m bits as (Am-1 Am-2 …….. A1 A0) and B be represented using n bits as
(Bn-1 Bn-2 …….. B1 B0)
Then the product of these two numbers is given by,
Fig 2.1 The 4x4 binary multiplication
This operation can be implemented paralleling using Braun multiplier whose hardware structure is as
shown in the figure below:
Dept. of ECE CEC Page 2
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Fig 2.2 Braun Multiplier for a 4x4 Multiplication
Multipliers for Signed Numbers
Generally signed numbers are represented in 2’s complement form
In the Braun multiplier the sign of the numbers are not considered into account
In order to implement a multiplier for signed numbers, additional hardware is required to modify the
Braun multiplier
The modified multiplier is called as Baugh-Wooley multiplier
Consider two signed numbers A and B,
Product P=Pm+n-1........P1P0
If we need to multiply two 4 bit numbers A3A2A1A0 x B3B2B1B0 with A3 and B3 as sign bits
Steps are as follows:
Step 1: Multiply A2 A1 A0 x B2 B1 B0 to give product p5-p0
Step 2: Multiply sign bits A3 B3 and add them at p6 position
Step 3: Multiply the sign bit A3 with B2 B1 B0 to get A3B2 A3B1 A3 B0 and add them at positions p5, p4 and
p3
The addition is performed after obtaining 2’s complement of A3B2 A3B1 A3 B0 which is equal to
subtracting A3B2 A3B1 A3 B0
Step 4: Multiply the sign bit B3 with A2 A1 A0 to get B3A2 B3A1 B3 A0 and add them to position p5, p4 and
p3
Steps 1 to 4 give the total product of two signed numbers
Dept. of ECE CEC Page 3
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Using 2’s complement the subtraction can be expressed as additions and hence resulting the
multiplier consist of only adders
Example: Multiply 2 numbers -3 and 6
Step 1: -3 = A3A2A1Ao = 1101 and 6 = B3B2B1Bo = 0110
Multiplication of
A2A1A0xB2B1B0 →101x110
000
101
101
p5p4p3p2p1p0 → 011110
Step 2: Sign bits; A3B3 = 1 x 0 =0→ add them at P6 (Subscripts of A3B3 =3 + 3 = 6) to get p6p5p4p3p2p1p0
= 0011110
Step 3:
Multiplication of A3 with B2 B1 B0 → 1x110 =110
Subtract this at p5p4p3 gives p6 p5 p4p3p2p1p0 →0011110
p5p4p3→ 1110
B=1 1101110
Borrow = 1 implies p7,s p8 etc are all 1 (sign extension)
Step 4: Multiply B3 with A2A1A0 → 0 x 101 = 000
Subtract this at p5p4p3 (since zero; p6 po remains same)
Hence
p7p6p5p4p3p2p1p0 = 1110 1110 = EEh = -18h
The Baugh Wooley multiplier is as shown below:
Dept. of ECE CEC Page 4
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Fig 2.3 Structure of Baugh wooley multiplier for signed number multiplication of two 4-bit
signed numbers
Speed
Conventional Shift and Add technique of multiplication requires n cycles to perform the
multiplication of two n bit numbers
Whereas in parallel multipliers the time required will be the longest path delay in the combinational
circuit used
As DSP applications generally require very high speed, it is desirable to have multipliers operating at
the highest possible speed by having parallel implementation
Bus Widths
Consider the multiplication of two n bit numbers X and Y
The product Z can be atmost 2n bits long
In order to perform the whole operation in a single execution cycle, we require two buses of width n
bits each to fetch the operands X and Y and a bus of width 2n bits to store the result Z to the memory
Although this performs the operation faster, it is not an efficient way of implementation as it is
expensive
Many alternatives for the above method have been proposed
One such method is to use the program bus itself to fetch one of the operands after fetching the
instruction, thus requiring only one bus to fetch the operands
And the result Z can be stored back to the memory using the same operand bus
But the problem with this is the result Z is 2n bits long whereas the operand bus is just n bits long
We have two alternatives to solve this problem,
a. Use the n bits operand bus and save Z at two successive memory locations
Although it stores the exact value of Z in the memory, it takes two cycles to store the result
b. Discard the lower n bits of the result Z and store only the higher order n bits into the memory
It is not applicable for the applications where accurate result is required
Dept. of ECE CEC Page 5
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Fig 2.4 Transfer of data from memory to hardware multiplier using one data bus
Fig 2.5 Three bus structure
Fig 2.6 Two bus structure
Another alternative can be used for the applications where speed is not a major concern
In which latches are used for inputs and outputs thus requiring a single bus to fetch the operands and
to store the result (shown below)
Fig 2.7 A Multiplier with Input and Output Latches
2.2.2 Shifters
Shifters are used to either scale down or scale up operands or the results
The following scenarios give the necessity of a shifter
a. While performing the addition of N numbers each of n bits long, the sum can grow up to n+log2N bits
long
If the accumulator is of n bits long, then an overflow error will occur
This can be overcome by using a shifter to scale down the operand by an amount of log2N
Dept. of ECE CEC Page 6
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
b. Similarly while calculating the product of two n bit numbers, the product can grow up to 2n bits long
Generally the lower n bits get neglected and the sign bit is shifted to save the sign of the product
c. Finally in case of addition of two floating-point numbers, one of the operands has to be shifted
appropriately to make the exponents of two numbers equal
From the above cases it is clear that, a shifter is required in the architecture of a DSP
Problem 1: It is required to find the sum of 64, 16 bit numbers. How many bits should the
accumulator have so that the sum can be computed without the occurrence of overflow error or loss of
accuracy?
Solution: The sum of 64, 16 bit numbers can grow up to (16+ log2 64) =22 bits long. Hence the accumulator
should be 22 bits long in order to avoid overflow error from occurring
Problem 2: In the previous problem, it is decided to have an accumulator with only 16 bits but shift
the numbers before the addition to prevent overflow, by how many bits should each number be
shifted?
Solution: As the length of the accumulator is fixed, the operands have to be shifted by an amount of log 2 64
= 6 bits prior to addition operation, in order to avoid the condition of overflow
Problem 3: If all the numbers in the previous problem are fixed point integers, what is the actual sum
of the numbers?
Solution: The actual sum can be obtained by shifting the result by 6 bits towards left side after the sum
being computed. Therefore
Actual Sum= Accumulator content X 2 6
Barrel Shifters
In conventional microprocessors, normal shift registers are used for shift operation
As it requires one clock cycle for each shift, it is not desirable for DSP applications, which generally
involves more shifts
In other words, for DSP applications as speed is the crucial issue, several shifts are to be
accomplished in a single execution cycle
This can be accomplished using a barrel shifter, which connects the input lines representing a word
to a group of output lines with the required shifts determined by its control inputs
For an input of length n, log2 n control lines are required
And an additional control line is required to indicate the direction of the shift
The block diagram of a typical barrel shifter is as shown in figure below
Dept. of ECE CEC Page 7
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Fig 2.8 A Barrel Shifter
Fig 2.9 Implementation of a 4 bit Shift Right Barrel Shifter
Figure above depicts the implementation of a 4 bit shift right barrel shifter
Shift to right by 0, 1, 2 or 3 bit positions can be controlled by setting the control inputs appropriately
Problem 4: A Barrel Shifter is to be designed with 16 inputs for left shifts from 0 to 15 bits. How many
control lines are required to implement the shifter?
Solution: As the numbers of bits used to represent the input are 16, log2 16=4 control inputs are required
2.2.3 Multiply and Accumulate (MAC) Unit
Most of the DSP applications require the computation of the sum of the products of a series of
successive multiplications
Dept. of ECE CEC Page 8
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
In order to implement such functions a special unit called a Multiply and Accumulate (MAC) unit is
required
A MAC consists of a multiplier and a special register called Accumulator
MACs are used to implement the functions of the type A+BC
A typical MAC unit is as shown in the figure below
Fig 2.10 A MAC Unit
Although addition and multiplication are two different operations, they can be performed in parallel
By the time the multiplier is computing the product, accumulator can accumulate the product of the
previous multiplications
Thus if N products are to be accumulated, N-1 multiplications can overlap with N-1 additions
During the very first multiplication, accumulator will be idle and during the last accumulation,
multiplier will be idle
Dept. of ECE CEC Page 9
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Thus N+1 clock cycles are required to compute the sum of N products
Problem 5: If a sum of 256 products is to be computed using a pipelined MAC unit, and if the MAC
execution time of the unit is 100nsec, what will be the total time required to complete the operation?
Solution: As N=256 in this case, MAC unit requires N+1=257execution cycles
As the single MAC execution time is 100nsec, the total time required will be,
(257*100nsec)=25.7μsec
Overflow and Underflow
While designing a MAC unit, attention has to be paid to the word sizes encountered at the input of
the multiplier and the sizes of the add/subtract unit and the accumulator, as there is a possibility of
overflow and underflows
Overflow/underflow can be avoided by using any of the following methods viz
a. Using shifters at the input and the output of the MAC
b. Providing guard bits in the accumulator
c. Using saturation logic
a. Shifters
Shifters can be provided at the input of the MAC to normalize the data and at the output to
denormalize the same
b. Guard bits
As the normalization process does not yield accurate result, it is not desirable for some applications
In such cases we have another alternative by providing additional bits called guard bits in the
accumulator so that there will not be any overflow error
Here the add/subtract unit also has to be modified appropriately to manage the additional bits of the
accumulator
Problem 6: Consider a MAC unit whose inputs are 16 bit numbers. If 256 products are to be summed
up in this MAC, how many guard bits should be provided for the accumulator to prevent overflow
condition from occurring?
Solution: As it is required to calculate the sum of 256, 16 bit numbers, the sum can be as long as (16+ log2
256)=24 bits
Hence the accumulator should be capable of handling these 32 bits Thus the guard bits required will
be (24-16)= 8 bits
The block diagram of the modified MAC after considering the guard or extension bits is as shown in
the figure below
Dept. of ECE CEC Page 10
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Fig 2.11 MAC Unit with Guard Bits
Problem 7: What should be the minimum width of the accumulator in a DSP device that receives 10
bit A/D samples and is required to add 64 of them without causing an overflow?
Solution: As it is required to calculate the sum of 64, 10 bit numbers, the sum can be as long as (10+ log 2
64) =16 bits
Hence the accumulator should be capable of handling these 16 bits Thus the guard bits required will
be (16-10)= 6 bits
c. Saturation Logic
Overflow/ underflow will occur if the result goes beyond the most positive number or below the
least negative number the accumulator can handle
It can be resolved by loading the accumulator with the most positive number which it can handle at
the time of overflow
The least negative number that it can handle at the time of underflow
This method is called as saturation logic
A schematic diagram of saturation logic is as shown in figure below
In saturation logic, as soon as an overflow or underflow condition is satisfied the accumulator will be
loaded with the most positive or least negative number overriding the result computed by the MAC
unit
Fig 2.12 A Schematic Diagram of the Saturation Logic
Dept. of ECE CEC Page 11
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Fig 2.13 A Schematic Diagram of the Saturation Logic
Examples of overflow and underflow are illustrated below
Overflow
+3 0011
+2 0010
+5 0101 Here Cin= 0 = Cout, Hence no overflow
+5 0101
+4 0100
+9 1001 Here Cin= 1; Cout,=0, Cin≠ Cout Hence overflow
Underflow
-3 1101
-2 1110
-5 1 1101 Here Cin= 1; Cout,=1, Cin= Cout=1 Hence no underflow
-5 1011
+2 0010
-3 1101 Here Cin= 0 = Cout, Hence no underflow
-5 1011
-4 1100
-9 10111 Here Cin= 0; Cout= 1, Cin≠ Cout Hence underflow
2.2.4 Arithmetic and Logic Unit
A typical DSP device should be capable of handling arithmetic instructions like ADD, SUB, INC,
DEC etc and logical operations like AND, OR , NOT, XOR etc
The block diagram of a typical ALU for a DSP is as shown in the figure below
It consists of status flag register, register file and multiplexers
Status Flags: ALU includes circuitry to generate status flags after arithmetic and logic operations -
flags include sign, zero, carry and overflow
Dept. of ECE CEC Page 12
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Fig 2.14 Arithmetic Logic Unit of a DSP
Overflow Management: Depending on the status of overflow and sign flags, the saturation logic
can be used to limit the accumulator content
Register File: Instead of moving data in and out of the memory during the operation, for better
speed, a large set of general purpose registers are provided to store the intermediate results
2.3 Bus Architecture and Memory
Conventional microprocessors use Von Neumann architecture for memory management wherein the
same memory is used to store both the program and data (Fig below)
Fig 2.15 Von Neumann Architecture
Although this architecture is simple, it takes more number of processor cycles for the execution of
a single instruction as the same bus is used for both data and program
In order to increase the speed of operation, separate memories were used to store program and data
and a separate set of data and address buses have been given to both memories, the architecture
called as Harvard Architecture
It is as shown in figure below
Fig 2.16 Harvard Architecture
Although the usage of separate memories for data and the instruction speeds up the processing, it
will not completely solve the problem
Dept. of ECE CEC Page 13
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
As many of the DSP instructions require more than one operand, use of a single data memory leads
to the fetch the operands one after the other, thus increasing the delay of processing
This problem can be overcome by using two separate data memories for storing operands separately,
thus in a single clock cycle both the operands can be fetched together
Although the architecture improves the speed of operation, it requires more hardware and
interconnections, thus increasing the cost and complexity of the system
Fig 2.17 Harvard Architecture with Dual Data Memory
Therefore there should be a trade off between the cost and speed while selecting memory
architecture for a DSP
2.3.1 On-chip Memory
In order to have a faster execution of the DSP functions, it is desirable to have some memory located
on chip
As dedicated buses are used to access the memory, on chip memories are faster
Speed and size are the two key parameters to be considered with respect to the on-chip memories
Speed
On-chip memories should match the speeds of the ALU operations in order to maintain the single
cycle instruction execution of the DSP
Size
In a given area of the DSP chip, it is desirable to implement as many DSP functions as possible
Thus the area occupied by the on-chip memory should be minimum so that there will be a scope for
implementing more number of DSP functions on- chip
2.3.2 Organization of On-chip Memories
Ideally whole memory required for the implementation of any DSP algorithm has to reside on-chip
so that the whole processing can be completed in a single execution cycle
Although it looks as a better solution, it consumes more space on chip, reducing the scope for
implementing any functional block on-chip, which in turn reduces the speed of execution and hence
some other alternatives required
The following are some other ways in which the on-chip memory can be organized
a. As many DSP algorithms require instructions to be executed repeatedly, the instruction can be stored in
the external memory, once it is fetched can reside in the instruction cache
b. The access times for memories on-chip should be sufficiently small so that it can be accessed more than
once in every execution cycle
Dept. of ECE CEC Page 14
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
c. On-chip memories can be configured dynamically so that they can serve different purpose at different
times
2.4 Data Addressing Capabilities
Data accessing capability of a programmable DSP device is configured by means of its addressing
modes
The summary of the addressing modes used in DSP is as shown in the table below
Table 2.1: DSP Addressing Modes
Addressing Operand Sample Format Operation
Mode
Immediate Immediate Value ADD #imm #imm +A →A
Register Register Contents ADD reg reg +A→ A
Direct Memory Address Register ADD mem mem+A →A
Indirect Memory contents with address ADD *addreg *addreg +A→ A
in register
1. Immediate Addressing Mode: In this addressing mode, data is included in the instruction itself
2. Register Addressing Mode: In this mode, one of the registers will be holding the data and the register
has to be specified in the instruction
3. Direct Addressing Mode: In this addressing mode, instruction holds the memory location of the operand
4. Indirect Addressing Mode: In this addressing mode, the operand is accessed using a pointer. A pointer is
generally a register, which holds the address of the location where the operands resides
Indirect addressing mode can be extended to inculcate automatic increment or decrement
capabilities, which has lead to the following addressing modes
Table 2.2: Indirect Addressing Modes
Addressing Mode Sample Format Operation
Post Increment ADD *addreg+ A← A + *addreg
addreg ←addreg+1
Post Decrement ADD *addreg- A← A + *addreg
addreg← addreg-1
Pre Increment ADD +*addreg addreg← addreg+1
A← A + *addreg
Pre Decrement ADD -*addreg addreg ←addreg-1
A ←A + *addreg
Post_Add_Offset ADD *addreg, offsetreg+ A ←A + *addreg
addreg ←addreg+offsetreg
Post_Sub_Offset ADD *addreg, offsetreg- A←A + *addreg
Dept. of ECE CEC Page 15
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
addreg←addreg-offsetreg
Pre_Add_Offset ADD offsetreg+,*addreg addreg ←addreg+offsetreg
A←A + *addreg
Pre_Sub_Offset ADD offsetreg-,*addreg addreg ←addreg-offsetreg
A ←A + *addreg
Problem 8: What are the memory addresses of the operands in each of the following cases of indirect
addressing modes? In each case, what will be the content of the addreg after the memory access?
Assume that the initial contents of the addreg and the offsetreg are 0200h and 0010h, respectively
a. ADD *addreg-
b. ADD +*addreg
c. ADD offsetreg+,*addreg
d. ADD *addreg,offsetreg-
Solution:
Instruction Addressing Operand Address addreg Content
Mode after Access
ADD *addreg- Post Decrement 0200h 0200-01=01FFh
ADD +*addreg Pre Increment 0200+01=0201h 0201h
ADD offsetreg+,*addreg Pre_Add_Offset 0200+0010=0210h 0210h
ADD *addreg,offsetreg- Post_Sub_Offset 0200h 0200-0010=01F0h
Problem 9: Identify the addressing modes of the operands in each of the following instructions
a. ADD #1234h
b. ADD 1234h
c. ADD *AR+
d. ADD offsetreg-,*AR
Solution:
Instruction Addressing Mode
ADD #1234h Immediate Addressing Mode
ADD 1234h Direct Addressing Mode
ADD *AR+ Post Increment Indirect Addressing Mode
ADD offsetreg-,*AR Pre-Sub_Offset Indirect Addressing Mode
2.5 Special Addressing Modes
For the implementation of some real time applications in DSP, normal addressing modes will not
completely serve the purpose
Thus some special addressing modes are required for such applications
Dept. of ECE CEC Page 16
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
1. Circular Addressing Mode
While processing the data samples coming continuously in a sequential manner, circular buffers are
used
In a circular buffer the data samples are stored sequentially from the initial location till the buffer
gets filled up
Once the buffer gets filled up, the next data samples will get stored once again from the initial
location
This process can go forever as long as the data samples are processed in a rate faster than the
incoming data rate
Circular Addressing mode requires three registers viz
a. Pointer registers to hold the current location (PNTR)
b. Start Address Register to hold the starting address of the buffer (SAR)
c. End Address Register to hold the ending address of the buffer (EAR)
There are four special cases in this addressing mode
They are
a. SAR < EAR & updated PNTR > EAR
b. SAR < EAR & updated PNTR < SAR
c. SAR >EAR & updated PNTR > SAR
d. SAR > EAR & updated PNTR < EAR
The buffer length in the first two case will be (EAR-SAR+1) whereas for the next two cases (SAR-
EAR+1)
The pointer updating algorithm for the circular addressing mode is as shown below
Four cases explained earlier are as shown in the figure below:
Dept. of ECE CEC Page 17
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Case iii) EAR<SAR and updated PNTR>SAR Case iv) EAR<SAR and updated PNTR<EAR
Problem 10: A DSP has a circular buffer with the start and the end addresses as 0200h and 020Fh
respectively. What would be the new values of the address pointer of the buffer if, in the course of
address computation, it gets updated to
a. 0212h
b. 01FCh
Solution: Buffer Length= (EAR-SAR+1)= 020F-0200+1=10h
a. New Address Pointer= Updated Pointer-buffer length = 0212-10=0202h
b. New Address Pointer= Updated Pointer+buffer length = 01FC+10=020Ch
Problem 11: Repeat the previous problem for SAR= 0210h and EAR=0201h
Solution: Buffer Length= (SAR-EAR+1)= 0210-0201+1=10h
c. New Address Pointer= Updated Pointer-buffer length = 0212-10=0202h
d. New Address Pointer= Updated Pointer+buffer length = 01FC+10=020Ch
2. Bit Reversed Addressing Mode
To implement FFT algorithms we need to access the data in a bit reversed manner
Hence a special addressing mode called bit reversed addressing mode is used to calculate the index
of the next data to be fetched
Dept. of ECE CEC Page 18
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
It works as follows. Start with index 0. The present index can be calculated by adding half the FFT
length to the previous index in a bit reversed manner, carry being propagated from MSB to LSB
Current index= Previous index+ B (1/2(FFT Size))
Compute the indices for an 8-point FFT using Bit reversed Addressing Mode
Start with index 0. Therefore the first index would be (000)
Next index can be calculated by adding half the FFT length, in this case it is (100) to the previous
index
i.e, Present Index= (000)+B (100)= (100)
Similarly the next index can be calculated as
Present Index= (100)+B (100)= (010)
The process continues till all the indices are calculated
The following table summarizes the calculation
Index in Binary BCD value Bit reversed index BCD value
000 0 000 0
001 1 100 4
010 2 010 2
011 3 110 6
100 4 001 1
101 5 101 5
110 6 011 3
111 7 111 7
2.6 Address Generation Unit
Address Generation Unit generate the address of the operands required to carry out the operation
They have to work fast in order to satisfy the timing constraints
As the address generation unit has to perform some mathematical operations in order to calculate the
operand address, it is provided with a separate ALU
Address generation typically involves one of the following operations
a. Getting value from immediate operand, register or a memory location
b. Incrementing/ decrementing the current address
c. Adding/subtracting the offset from the current address
d. Adding/subtracting the offset from the current address and generating new address according to circular
addressing mode
e. Generating new address using bit reversed addressing mode
The block diagram of a typical address generation unit is as shown in figure below
Dept. of ECE CEC Page 19
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Fig 2.18 Address Generation Unit
2.7 Programmability and Program Control
A programmable DSP device should provide the programming capability involving branching,
looping and subroutines
The implementation of repeat capability should be hardware based so that it can be programmed with
minimal or zero overhead
A dedicated register can be used as a counter
In a normal subroutine call, return address has to be stored in a stack thus requiring memory access
for storing and retrieving the return address, which in turn reduces the speed of operation
Hence a LIFO memory can be directly interfaced with the program counter
1. Program Control
Like microprocessors, DSP also requires a control unit to provide necessary control and timing
signals for the proper execution of the instructions
In microprocessors, the controlling is micro coded based where each instruction is divided into
microinstructions stored in micro memory called microstore as microcode
This microstore and microcode are not accessible by the programmer
The processor is designed such that only control unit can access them during the execution of
instructions
Advantage: Easy to design and implement as less hardware is required
Disadvantage: Low speed of execution of instruction, as execution of each instruction implies fetching the
microcode from microstore and decoding them
As this mechanism is slower, it is not applicable for DSP applications
Hence in DSP the controlling is hardwired base
Each instruction of the DSP acts as control lines to the hardwired control unit and hence resulting
control and timing signals are generated to complete the instruction decode and execution
Advantage: Instruction execution time is much faster as compared to microcoded design
Disadvantage: High hardware complexity; not easy to change the design of control unit to
incorporate additional features
Dept. of ECE CEC Page 20
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
2. Program Sequencer
It is a part of the control unit used to generate instruction addresses in sequence needed to access
instructions
It calculates the address of the next instruction to be fetched
The next address can be from one of the following sources
a. Program Counter
b. Instruction register in case of branching, looping and subroutine calls
c. Interrupt Vector table
d. Stack which holds the return address
The block diagram of a program sequencer is as shown in figure below
Fig 2.19 Program Sequencer
Program sequencer should have the following circuitry
a. PC has to be updated after every fetch
b. Counter to hold count in case of looping
c. A logic block to check conditions for conditional jump instructions
d. Condition logic-status flag
2.8 Speed Issues
Fast execution of algorithms is an essential requirement of a digital signal processing architecture
DSP architecture must include features that facilitate high speed of operation and large throughputs
Some of the features that can increase the execution speed of the DSP architecture are
1. Hardware Architecture
2. Parallelism
3. Pipelining
2.8.1 Hardware Architecture
Harvard architecture which separates the program and data memories with separate buses for each,
increases the speed of execution of programs considerably
Dual data memories with Individual buses for each help in access dual operands simultaneously
Multiple external memories require multiple buses external to the DSP in addition to being
expensive, external buses are slow for program access and execution
Dept. of ECE CEC Page 21
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
By providing on-chip memories and an instruction cache, program execution is speeded up
considerably
There are many techniques used in DSP architectures to increase their speed of operation
Two of these techniques are – Parallelism and Pipelining
2.8.2 Parallelism
High speed of operation can be achieved in DSP by the provision of Parallelism
Parallelism means several things done at a time
Functional units may operate in parallel and increase the throughput
For example, instead of the same arithmetic unit being used to do computations on data and address,
a separate address arithmetic unit can be provided to take care of address computations
This frees up the main arithmetic unit to concentrate on data computations alone and thereby
increases the throughput
Another example may be the provision of multiple memories and multiple buses to fetch an
instruction and operands simultaneously
Algorithms can perform more than one operation at the same time, such as adding while carrying out
a multiply, shifting while reading data from memory etc
The architecture should be such that instructions and data required for a computation are fetched
from the memory simultaneously
By parallelism, the following operations has to be performed in a single clock cycle
1. Fetch instructions and multiple data required for the computation
2. Shift data as they are fetched in order to accomplish scaling
3. Carryout a multiplication operation on the fetched data
4. Add the product to the previously computed result in the accumulator
5. Save the accumulator contents in the memory storage, if required, and
6. Compute new addresses for the instruction and data required for the next operation
2.8.3 Pipelining
An architectural feature to increase the speed of the DSP algorithm is pipelining
In a pipelined architecture, an instruction to be executed is broken into a number of steps
A separate unit of the architecture performs each of these steps
When the first of these units performs the first step on the current instruction
The second unit will be performing the second step on the previous instruction
The third unit will be performing the third step on the instruction prior to that, etc
If p steps were required to complete the execution of each instruction, it would take p units of time
for the complete execution of each instruction
Since all the units will work all the time, one output will flow out of the architecture at the end of
each time unit
The throughput is maintained as one instruction per unit time
Limitations
1. Dividing each instruction into steps takes equal amount of time to perform and design the
architectures
2. Extra time required at the start of algorithm execution results in pipeline latency
For example, Let us assume that the execution of an instruction can be broken into 5 steps:
Dept. of ECE CEC Page 22
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
1. Instruction fetch
2. Instruction decode
3. Operand fetch
4. Execute
5. Save the result
Figure shows how a pipelined processor will handle this
For simplicity, we will assume that all the steps take equal amounts of time
From the figure, we observe that the output corresponding to the first instruction is available after 5
units of time
Fig 2.20 Pipelining for speeding up the execution of an instruction
2.8.4 System level parallelism and pipelining
The parallelism and pipelining concepts can be extended to the implementation of DSP algorithms
Consider the example of an 8-tap (8 coefficients) FIR filter given by:
7
𝑦(𝑛) = ∑ ℎ(𝑖)𝑥(𝑛 − 𝑖)
𝑖=0
It can be realized using different methods
1. Implementation using a single MAC unit
2. Pipelined Implementation using eight multipliers and eight accumulators
3. Parallel Implementation using two MAC units
Dept. of ECE CEC Page 23
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Implementation using a single MAC unit
Fig 2.21 Single MAC implementation of an 8-tap FIR filter
If only one multiplier and accumulator is available, it must be used 8 times to compute the eight
product terms and find their sum
Each input sample is delayed from the previous sample by 8T
T is the time taken by the multiplier and accumulator to compute one product term and add it to the
previously accumulated sum in the accumulator
Input samples and the filter coefficients are fed to the multiplier through multiplexers
As each product term is generated, it is added to the previously accumulated sum in the MAC unit
Pipelined Implementation using eight multipliers and eight accumulators
Fig 2.22 Pipelined implementation of an 8-tap FIR filter using eight MACs
The implementation of the FIR filter of Eqn can be speeded up if more multipliers and
accumulators are available
Let us assume that there are eight multipliers and eight accumulators connected in a pipelined
structure, as shown in Figure
Each multiplier computes one product term and passes it on to the corresponding accumulator,
which in turn adds it to the summation passed on from the previous accumulator
Since all the multipliers and accumulators work all the time, a new output sample is generated once
every T units of time
Dept. of ECE CEC Page 24
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
This is the time required by the multiplier and accumulator to compute one product term and add it to
the sum passed on from the previous stage of the pipeline
This implementation can take in a new input sample once every T units of time and generate an
output sample at the same rate
In other words, this filter implementation works 8 times faster than the simple one MAC
implementation
Parallel Implementation using two MAC units
Fig 2.23 Parallel implementation of an 8-tap FIR filter using two MAC units
A third implementation of the FIR filter of Eqn is shown in Figure
This implementation uses two MAC units and an adder at the output
Dept. of ECE CEC Page 25
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
Each MAC computes four of the eight product terms
Input samples and the filter coefficients are fed to the MACs using multiplexers that are
controlled such that correct combinations of samples and the corresponding filter coefficients are fed
to the two MACs at any given time
If T time units are required to compute one pair of products and add them to the previously
accumulated sum in the MAC units, it will require 4T units of time to generate the final output by
adding the outputs of the two MACs
At this time, a new input sample can be applied to the filter for computation of the next output
sample
The speed of this implementation is 2 times that of one MAC implementation of Figure and one
fourth of that of the pipelined eight- multiplier, eight-accumulator implementation
The maximum rate at which input samples can be applied to this filter implementation is 2 times that
of the first implementation and one fourth that of the second
Table summarizes the performance of the three implementations described
It is possible to achieve higher speed implementation by the use of parallelism and/or pipelining
It increases the circuit complexity
Table 2.3 Performance Summary of Different Implementations of an 8-tap FIR filter
2.9 Features for external interfacing
It is important for a DSP device to be able to communicate with the outside world
The outside world provides the signal to be processed and receives the processed signal
Most of the peripherals used in conventional microprocessors are also needed in a DSP system
Peripherals include interfaces for interrupts, direct memory access, serial I/O, and parallel I/O
QUESTION BANK
Module -2 Question Bank
1 Explain the basic architectural features of DSP
2 Investigate the basic features of that should be provided in the DSP architecture to be used to
implement the Nth order FIR filter
3 Explain the Braun and Baugh Wooley multiplier with an example
4 Write a note on speed and bus widths in DSP Processors
5 Explain the Barrel shifter with necessary diagram
6 Explain the MAC unit with suitable diagram
7 Explain Overflow and Underflow and how it can be avoided
8 Explain the Saturation Logic with suitable diagram
9 Explain the Arithmetic and Logic unit of DSP with necessary diagram
10 Explain the Bus Architecture and Memory used in DSP with necessary diagram
Dept. of ECE CEC Page 26
MODULE-2 VII SEM DSP ALGORITHMS AND ARCHITECTURE [17EC751]
11 Explain the various Data addressing capabilities of DSP
12 Explain the Special addressing modes used in DSP
13 Explain the Address generation Unit of DSP with necessary diagram
14 Explain Programmability and Program control in DSP with necessary diagram
15 Explain Program Sequencer in DSP with necessary diagram
16 Discuss the speed issues in DSP
17 Explain the Parallelism and Pipelining with necessary diagram
18 Explain Implementation of 8-tap FIR filter with single MAC unit
19 Explain Implementation of 8-tap FIR filter with 8 MAC units
20 Explain Implementation of 8-tap FIR filter with 2 MAC units
Dept. of ECE CEC Page 27