An Efficient FPGA Implementation of IEEE
802.16e LDPC Encoder
Speaker: Chau-Yuan-Yu
Advisor: Mong-Kai Ku
Outline
Introduction
Low-Density Parity-Check Codes
Related work
General encoding for LDPC codes
Efficient encoding for Dual-Diagonal matrix
Better Encoder scheme
LDPC Encoder Architecture
Parallel Encoder
Serial Encoder
Result
Conclusion
Outline
Introduction
Low-Density Parity-Check Codes
Related work
General encoding for LDPC codes
Efficient encoding for Dual-Diagonal matrix
Better Encoding scheme
LDPC Encoder Architecture
Parallel Encoder
Serial Encoder
Result
Conclusion
Low-Density Parity-Check Code
Benefit of LDPC Codes.
Approaching Shannon limit
Low error floor
LDPC code is adopted by various standards
(e.g. DVB-S2, 802.11n, 802.16e)
Low-Density Parity-Check Code
Parity check matrix H is sparse
Very few 1s in each row and column
Null space of H is the codeword space
Valid Codeword
Low-Density Parity-Check Code
In (n, k) block codes, k-bit information data can be
encoded as n-bit codeword.
In systematic block codes, the information bits directly
exist in the bits of codeword.
Systematic Part Parity Part
Low-Density Parity-Check Code
General encoding of systematic linear block codes
Finding generator matrix G via H.
C = sG = [s | p]
Issues with LDPC codes
The size of G is very large.
G is not generally sparse.
Encoding complexity will be very high.
Structured LDPC Codes
Quasi-Cyclic LDPC Codes
In QC-LDPC, H can be partitioned into square sub-blocks of size z x
z.
Each sub-blocks can be Z x Z zero sub-block or identity matrix with
permutation.
Structured LDPC Codes
QC Codes With Dual-Diagonal Structure
In IEEE standards QC-LDPC Codes have Dual-Diagonal
parity structure.
We take 802.16e code rate matrix for example.
0 represent
identity matrix.
Outline
Introduction
Low-Density Parity-Check Codes
Related work
General encoding for LDPC codes
Efficient encoding for Dual-Diagonal matrix
Better Encoding scheme
LDPC Encoder Architecture
Parallel Encoder
Serial Encoder
Result
Conclusion
General Encoding for LDPC Codes
Richardson and Urbanke (RU) algorithm
Partition the H matrix into several sub-matrix.
In H, the part T is a low triangle matrix.
General Encoding for LDPC Codes
Richardson and Urbanke (RU) algorithm
p0 O(n+g2)
p1 O(n+g2)
Efficient Encoding for Dual-Diagonal LDPC Codes
A valid codeword c = [s|p] must satisfy
Replace by dual-diagonal matrix
Information bits Parity bits
Define lambda value as From equation, we obtained
Related Work (1) Sequential Encoding
Encoding scheme
One-way derivation
Step 1
Compute lambda value by doing
matrix operation x = HsS
Step 2
Determines parity vector P0 by
adding all the lambda value
Step 3
Rest of parity vector is obtained
by exploiting dual-diagonal
matrix T
Related Work (2) Arbitrary Bit-generation and Correction Encoding
In [1], an alternative encoding for standard matrix was presented.
Matrix will be modify by
parity portion of weight-3 A Q U
column set.
H can be sectorized into
three sub matrices
The information bit region A
The parity bit region Q
for bit-flipping operation
The parity bit region U
for non bit-flipping.
Replace with zero
cyclic shift
[1] C. Yoon, E. Choi, M. Cheong, and S.-K. Lee, "Arbitrary bit generation and correction technique for encoding QC-LDPC codes
with dual-diagonal parity structure," IEEE Wireless Communications and Networking Conference, (WCNC 2007), pp. 662-666,
March 2007.
Related Work (2) Arbitrary Bit-generation and Correction Encoding
Encoding scheme
Step 1 One-way derivation
Compute lambda value by doing
matrix operation x = As
Step 2
Set P0 as arbitrary binary
values. solve unknown parity
bits
Step 3
Computed correction vector f
from P0
Step 4
Add correction vector to parity
bits in region Q to correct them
Related Work (2) Arbitrary Bit-generation and Correction Encoding
Advantage
Low-complexity encoding
The number of addition required is less than RU scheme
Drawback
Can not directly applicable to standard code
Modifying matrix will decrease code performance
Outline
Introduction
Low-Density Parity-Check Codes
Related work
General encoding for LDPC codes
Efficient encoding for Dual-Diagonal matrix
Better Encoding scheme
LDPC Encoder Architecture
Parallel Encoder
Serial Encoder
Result
Conclusion
Better encoding scheme
Advantages of the encoding scheme
proposed in [2]
Low-complexity encoding
Can directly applicable to matrices defined in
IEEE standards without any modification
Achieve higher level parallelism
[3] C.-Y. Lin, C.-C. Wei, and M.-K. Ku, "Efficient Encoding for Dual-Diagonal Structured LDPC Code Based on Parity bits Prediction and
Correction," IEEE Asia Pacific Conference on Circuits and Systems (APPCCAS), pp.1648-1651, Dec. 2008.
Better Encoding Scheme
Step 1
Set P0 as any binary vector
Correct prediction vector by f
Step 2
Compute lambda value by
doing matrix operation Hs
Step 3
[Forward Derivation]
Step 4
[Backward Derivation]
Step 5
Compute the P0 by adding
prediction parity vector
Step 6
Compute the correction vector f
Step 7
Compute P0 by adding prediction vector
Correct prediction parity by
adding f Compute correction vector f f = (P0)d
Better Encoding Scheme
Step 1
Set P0 as any binary vector. Reduce encoding delay !!
Step 2 Two-way derivation
Compute lambda value by
doing matrix operation Hs.
Step 3
[Forward Derivation]
Step 4
[Backward Derivation]
Step 5
Compute the P0 by adding
prediction parity vector.
Step 6
Compute the correction vector f.
Step 7
Correct prediction parity by
adding f.
Outline
Introduction
Low-Density Parity-Check Codes
Related work
General encoding for LDPC codes
Efficient encoding for Dual-Diagonal matrix
Better Encoding scheme
LDPC Encoder Architecture
Parallel Encoder
Serial Encoder
Result
Conclusion
LDPC Encoder Architecture
Based on the encoding scheme proposed bedore, we design both
parallel and serial architecture.
Parallel architecture
Achieve higher level parallelism
High-speed
Serial architecture
Parallel architecture
Barrel shifter#1
divider
Prediction
Matrix Parity
Accumulator Correct
memory
Barrel shifter#6
Input data register lambda position
Parallel architecture (Stage 1)
Barrel shifter#1
divider
Prediction
Matrix Parity
Accumulator Correct
memory
Barrel shifter#6
Input data register lambda position
Benefit:
In this stage, matrix [Link] the input data
select the shift values is coming, it can work
and multiply specific immediately without all
value according to the input data are
the code length. coming.
[Link] the numbers
of barrel shifter.
Shifter Value Computation
Equation for computing shift value
Normal code rate :
Code rate 2 3 A code :
Two type of matrix implement result with multiple rate and length
Slice FFs LUTs CLK Total gate
(MHz) count
One matrix + 14,179 4,071 26,846 141.391 227,076
calculate IP
Using matrices to 41,409 12,078 76,977 165.591 635,691
save shifter value
Parallel architecture (Stage 2)
Barrel shifter#1
divider
Prediction
Matrix Parity
Accumulator Correct
memory
Barrel shifter#6
Input data register lambda position
Divide the datas from This module used to save
matrix. the input data. These data
are used in barrel shifters.
Parallel architecture (Stage 3)
Barrel shifter#1
divider
Prediction
Matrix Parity
Accumulator Correct
memory
Barrel shifter#6
Input data register lambda position
Lambda position = 3
These module are This module records the
used to circulated row position of the
shift the input data shifter values
Lambda position = 8
Lambda position = 11
Shifter value
Parallel architecture (Stage 4)
Barrel shifter#1
divider
Prediction
Matrix Parity
Accumulator Correct
memory
Barrel shifter#6
Input data register lambda position
According to the Computed the
lambda position, in lambda value by
this clock cycle 1, 2, accumulating the
5, 8, 9, 11 need to be shifted data after Kb
accumulated. clock cycle
Kb
Parallel architecture (Stage 5)
Barrel shifter#1
divider
Prediction
Matrix Parity
Accumulator Correct
memory
Barrel shifter#6
Input data register lambda position
Computed the
prediction vector Pi
by equation
Parallel architecture (Stage 5)
P_0 <= acc_out0;
P_1 <= acc_out0 ^ acc_out1;
P_2 <= acc_out0 ^ acc_out1 ^ acc_out2;
P_3 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3;
P_4 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4;
P_5 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4 ^ acc_out5;
P_6 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7 ^ acc_out6;
P_7 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7;
P_8 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8;
P_9 <= acc_out11 ^ acc_out10 ^ acc_out9;
P_10 <= acc_out11 ^ acc_out10;
P_11 <= acc_out11;
For saving the hardware area, we use In code rate 21 / 3,
2, P_0 ~ P_3
P_11
one architecture to compute the are the prediction
P_8~P_11are the prediction
prediction values for four different
code rate.
Parallel architecture (Stage 5)
P_0 <= acc_out0;
P_1 <= acc_out0 ^ acc_out1;
P_2 <= acc_out0 ^ acc_out1 ^ acc_out2;
P_3 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3;
P_4 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4;
P_5 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4 ^ acc_out5;
P_6 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7 ^ acc_out6;
P_7 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7;
P_8 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8;
P_9 <= acc_out11 ^ acc_out10 ^ acc_out9;
P_10 <= acc_out11 ^ acc_out10;
P_11 <= acc_out11;
For saving the hardware area, we use In code rate 53 / 6,
4, P_0 ~ P_1
P_2
one architecture to compute the P_9~P_11 are the
P_10~P_11are theprediction
prediction
prediction values for four different vectors
code rate.
Parallel architecture (Stage 6)
Barrel shifter#1
divider
Prediction
Matrix Parity
Accumulator Correct
memory
Barrel shifter#6
Input data register lambda position
Step2: Step1:
Correct the other Pi. Compute the P0. In
Using the equation code rate = 1 / 2,
Pi= Pi^ P0 P0 = P5 ^ P6
Serial architecture (Stage 1)
Barrel shifter#1 Accumulator
&
divider
Matrix
Predict Correct
memory
Barrel shifter#2
Input data Input
register control
As the stage1 in 1 In the first Kb clock
parallel architecture. cycle, encoder order are
from top->middle and
3
3 down ->middle, column
2
by column
1
Serial architecture (Stage 1)
Barrel shifter#1 Accumulator
&
divider
Matrix
Predict Correct
memory
Barrel shifter#2
Input data Input
register control
1 2
3
Reason: In the last clock cycle,
[Link] the input encoder order are from
data left->right, row by row
[Link] the slice
3
1 2
Serial architecture (Stage 2)
Barrel shifter#1 Accumulator
&
divider
Matrix
Predict Correct
memory
Barrel shifter#2
Input data Input
register control
Divide the datas from Choose the corresponding
matrix. input value to barrel shifter
(Take clock cycle #2 for
example)
Serial architecture (Stage 3)
Barrel shifter#1 Accumulator
&
divider
Matrix
Predict Correct
memory
Barrel shifter#2
Input data Input
register control
Shift the input data
according to the shifter
value chosen form Mux
Serial architecture (Stage 4)
Barrel shifter#1 Accumulator
&
divider
Matrix
Predict Correct
memory
Barrel shifter#2
Input data Input
register control
In normal, this module In this module, there
accumulate the shifted are three works:
data to compute i . [Link] i
When the data is the [Link] Pi
last value in this row, [Link] P0
also compute Pi.
Serial architecture (Stage 4)
Barrel shifter#1 Accumulator
&
divider
Matrix
Predict Correct
memory
Barrel shifter#2
Input data Input
register control
When all Pi have been
computed, compute the
P0 by Xor Px and Px+1
which are the middle
prediction vector in the
matrix.
Serial architecture (Stage 5)
Barrel shifter#1 Accumulator
&
divider
Matrix
Predict Correct
memory
Barrel shifter#2
Input data Input
register control
Correct the other Pi.
Using the equation
Pi= Pi^ P0
Outline
Introduction
Low-Density Parity-Check Codes
Related work
General encoding for LDPC codes
Efficient encoding for Dual-Diagonal matrix
Better Encoding scheme
LDPC Encoder Architecture
Parallel Encoder
Serial Encoder
Result
Conclusion
Implementation Results
The proposed encoder based on IEEE 802.16e LDPC codes can
encode the code with code rate 1/2 2/3 3/4 5/6 and code length
ranging from 576 to 2304.
The hardware implementation was performed and verification on Xilinx
Virtex-4 and Altera Stratix Field Programmable Gate Array (FPGA)
device.
Implementation Results
Parallel architecture
Rate 1/2 Rate 2/3 Rate 3/4 Rate 5/6
Z N Slice FFs LUTs CLK (MHz) IT (Gbps) IT (Gbps) IT (Gbps) IT (Gbps)
24 576 2.262 2.468 2.545 2.61
40 960 3.77 4.113 4.241 4.35
60 1440 14,179 4,071 26,846 141.391 5.656 6.17 6.363 6.526
80 1920 7.541 8.226 8.483 8.701
96 2304 9.049 9.872 10.18 10.441
Information throughput ranging from 2.262 to 10.441 Gbps
The encoder area is constant in any code rate or code length.
For a given code rate, an increase in the code length will increase the throughput.
Implementation Results
Serial architecture
Information throughput ranging from 0.867 to 4.019 Gbps
For a given code rate, an increase in the code length will increase the
throughput.
Implementation Results
Parallel architecture using row by row
Area comparison
Implementation Results
IT comparison
IT/Area comparison
Table 4.5a The synthesis result of [22] at code rate 1/2
Compare to Related Work
We compare implementation with [3].
Code Length Area (LE) Clk (MHz) IT (Gbps) IT/Total Area Code Length Area (LE) Clk IT (Gbps) IT/Total Area
(Mb per Le) (MHz) Rate 1/2 (Mb per Le)
576 3391 192.23 2.129 0.0612 rate1/2
[2] 576 1.561 0.07447
960 5100 159.57 2.253 0.0648
Proposed 960 20960 97.58 2.602 0.12414
1440 7012 164.83 2.697 0.0776
1440 3.903 0.18621
1920 8924 148.72 2.644 0.0761
1920 5.204 0.24828
2304 10339 148.41 2.758 0.0793
2304 6.245 0.29794
34766
Better throughput for longer code length
Using less area to implement multiple code length and code rate
The clock cycle is shorter the [3].
[3] S. Kopparthi and D. M. Gruenbacher, "Implementation of a fiexible encoder for structured low-density parity-check codes," IEEE Pacic Rim
Conference on Communications, Computers and Signal Processing (PacRim 2007), pp.438-441, Aug. 2007.
Compare to Related Work
The comparison of throughput
The proposed encoder outperforms the work in [3] in terms of throughput
when the code length longer then 1200
The proposed encoder architecture provides better throughput for a longer
code length while the work in [3] does not have this kind of speed-up
Compare to Related Work
The comparison of throughput/area ratio
The proposed encoder outperforms the work in [3] in terms of
throughput/area ratio by 1.216 to 3.757 times
The proposed encoder utilizes hardware resources more efficiently
Compare to Related Work
We compare implementation with [2].
Compare to Related Work
The comparison of throughput
The throughput in our proposed encoder is higher then [2] in all code rate
and code length
The proposed encoder outperforms the work in [2] in terms of throughput
ratio by 1.237 to 1.963 times
Compare to Related Work
The comparison of throughput/area
The proposed encoder outperforms the work in [2] in terms of throughput
ratio by 2.427 to 5.256 times
The result shows that our proposed encoder utilizes hardware resources
efficiently
Compare to Related Work (Serial)
We compare implementation with [4].
Slices FFs LUTs Block rams CLK IT
[4] 4,724 1,807 8,335 81 186 3.34
Proposed 12,567 3,885 22,050 0 123.502 4.626
Our proposed encoder achieve higher IT in low clock.
In our proposed encoder, the matrix information are built in it without
additional blockrams.
The IT/Area of our serial encoder is 0.3681(Mbps) per slice and the
IT/Area of [4] is 0.1768.
[4] Jeong Ki KIM1, Hyunseuk YOO1 and Moon Ho LEE1, "Efficient Encoding Architecture for IEEE 802.16e LDPC Codes, " IEICE Transactions
on Fundamentals of Electronics, Communications and Computer Sciences 2008.
Outline
Introduction
Low-Density Parity-Check Codes
Related work
General encoding for LDPC codes
Efficient encoding for Dual-Diagonal matrix
Proposed Encoding scheme
LDPC Encoder Architecture
Parallel Encoder
Serial Encoder
Result
Conclusion
Conclusion
An efficient encoding architecture for IEEE 802.16e LDPC
codes with multiple code lengths and code rates are
implemented.
In our design, change between different code rate or code
length only to change the type in information data.
This architecture is also suitable the IEEE 802.11n standard.
Our encoder achieve higher throughput and better
throughput/area ratio than conventional encoding scheme
when code length longer than 1200.
Thank you!!