Design Con 2019
Modeling System Signal
Integrity Dynamic to Achieve
Optimal Memory Performance
for DDR4 and Beyond
Hing “Thomas” To, Xilinx Inc.
Changi Su, Xilinx Inc.
Juan Wang, Xilinx Inc.
Carl Gabrielson, Xilinx Inc.
David Smith, Xilinx Inc.
Terry Magee, Intel Inc.
Karthi Palanisamy, Xilinx Inc.
John Schmitz, Xilinx Inc.
Abstract
System memory performance is usually measured by the memory bandwidth, which is the
speed of the memory IO interface (such as 3200Mbps for top speed DDR4) multiplied by
the number of DQ signal that a system will support (such as 64 DQ bits). In addition to the
bandwidth, the capacity of the memory system is also an important figure of merit. Memory
array density in each DRAM is limited, therefore, multiple ranks and multiple DIMMs are
often added to the system to increase system memory capacity. The effectiveness of
quantifying memory bandwidth will need to take into account the actual data throughput.
When a system has multiple ranks, the dynamic between the ranks is more complicated.
Memory controller can issue a Read/Write from/to the same rank or from/to a different
rank. Traditionally, empirical measurement from a real system can be tested to determine
the optimal bus turnaround. This turnaround time is for the bus signal to settle. A violation
will affect channel timing such as the pre-amble timing & causes data error. The
turnaround time will directly impact the system effective bandwidth. The bandwidth
efficiency is measured by the data throughput versus the total transaction duration. The
efficiency will be estimated incorrectly if not considering the actual bus channel settling
time of the system.
This paper will present a modeling approach to cover the dynamics of system DDR bus
turnaround. The DQ bus on the controller side and the DRAM is modeled with IO
behavioral model which captures the On/Off timing. The On/Off states represent the driver
mode and receive mode with on-die termination enabled. Details will be presented on how
to handle the necessary feature for the modeling. This approach enables the prediction of
bus channel dynamics, particularly in the case of turnaround signal integrity analysis. The
method can equally apply to more complicated system configurations, such as multiple-
rank DIMM systems, which often need to cover rank-to-rank switching between DIMMs.
A validation system will be used to correlate this approach and will be presented in the
paper.
Key Terms– Memory IO interface, System Power, Data Bus Turn around
Author(s) Biography
Hing Y (Thomas) To (SM’11) is a Technical Director in System Memory Signal
Integrity & Device Power Group at Xilinx, Inc. Prior to joining Xilinx, Thomas was with
NVIDIA Advanced Technology Group focused on high speed (32GTs) circuits & system
channel designs and supported different test chips for different advanced process nodes
such as 20nm SOC & 16nm FINFET process. Before NVIDIA, Thomas worked for Intel
for 17 years covered and led many different types of system memory IO development.
Thomas received his PhD degree in Electrical Engineering from the Ohio State
University in 1995 & he has over 37 patents in the fields of mixed signal IO circuits and
system memory configurations as well as high speed clocking for high speed memory
designs.
Changyi Su is a Staff Design Engineer at Xilinx Inc. Her responsibilities include signal
integrity in high-speed circuits & system channels, jitter and timing impact from power
delivery network, and memory system characterization. She received a Ph.D. degree in
electrical engineering from Clemson University, Clemson, SC.
Juan Wang is a Staff Signal Integrity engineer at Xilinx Inc. She has been focusing on
memory interface timing analysis such as DDR4/DDR3/RLDRAM3 and corresponding
lab verification. Prior to Xilinx, she worked for Juniper as signal integrity engineer for
more than 5 years supporting system design
10GE/XFI/XLAUI/SFI/sGMII/rGMII/PCIE/DDR3 signal integrity modeling, simulation
and measurements. Juan received her MSEE from University of Missouri-Rolla and
Tsinghua University.
Carl Gabrielson is a Sr. Staff Product Applications Engineer at Xilinx, Inc. focused on
DDR interface applications. Prior to Xilinx he worked on I/O circuit design at Synopsys,
NVIDIA, 3dfx, and other companies. He began his career designing DRAM at Vitelic
Semiconductor. Carl holds MSEE and BSEE degrees from Stanford University.
David Smith is the Managing Principal Engineer for Xilinx IO interfaces. David has
been leading the IO teams and defending the IO architecture since the UltraScale family.
David is a Ross Freeman Award winner and has been awarded several Circuit Design
Patents. David holds an EECS degree from the University of California at Los Angeles.
Karthi Palanisamy is a Director at Xilinx Inc. where he has worked in multiple roles on
the memory controller design.
Terry Magee has 23 years’ experience in silicon valley, having started his career at Cypress
Semiconductor. Since Cypress, Terry has worked in various design and engineering positions
from Ethernet MAC design in a 10-person startup to IP design for DDR memory at LSI and
Synopsys. Terry was lead architect and technical director for Xilinx’s external memory interface
solutions for 20, 16 and 7nm. More recently Terry has joined the Intel PSG group. Terry has
several technical papers and 15 patents to his name.
John Schmitz is a Principal Engineer at Xilinx Inc. where he has worked in multiple
roles on the memory interface team for the past 13 years. Prior to Xilinx he worked at
Pinnacle Systems and Keysight as a development engineer. John received his MSEE and
BSEE degrees from the Massachusetts Institute of Technology.
1. Introduction
Computation system performance demand has been fueled by new application
requirements, such as hardware machine learning and ultra HD (high definition) real time
processing. The relative trend comparison is illustrated in Figure 1(a) [1]. The
performance requirement is shown by the red line. To meet the system performance
requirement, the core processor clock frequency of the SOC was increased but then when
the improvement was not sufficient, the number of processors were also increased. Figure
1(b) [2] shows the microprocessor trend data. The scatter dots in green indicates
processor frequency was sped up until around mid-2000s. Then the number of processor
cores were increased to meet the system performance requirements, indicated by the
scatter dots in black.
System memory performance is a key limiting factor, but scaling the memory
performance is not the same as scaling the ASIC SOC performance. The memory wall
refers as the memory limitation in terms of memory capacity, memory bandwidth and
latency requirement under certain power and implementation constraints. Memory device
performance has been a key focus in the computing industry for performance
improvement.
Figure 1 (a) Computing Performance Trends [1] (b) Relative System Performance Requirement
Figure 2 shows an example of DNN (Deep Learning Neural Network) requirement for
data movement when used in inference [3]. Based on a recent estimation [4], a 50 layer
ResNet will require about 8GBytes of memory for storage capacity. Another example
that illustrates the memory bandwidth is in Ultra HD real time video processing, in
Figure 3 [5]. It is expected that the memory bandwidth will need at least 20GByte/s.
Figure 2 Data Movement in Deep Learning Neural Network Example
Figure 3 Ultra High Definition Streaming Memory Bandwidth Requirement
To maintain the capacity requirements, multiple ranks are added to the system platform.
Figure 4 is a multiple ranks (multiple DIMM) DDR4 system example and Figure 5 is a
multiple ranks LPDDR4 system. The LPDDR4 multi-rank is formed by dual die package
configurations.
Figure 4 Duel Rank DDR4 Memory System for Memory Capacity Improvement
Figure 5 Duel Rank LPDDR4 Memory System for Memory Capacity Improvement
Traditionally, memory system performance is gauged by memory capacity and memory
bandwidth. As system memory capacity scales up by adding additional ranks, the
effective data throughput should also be considered as a figure of metrics. DDR IO
standard is a half-duplex architecture, that is the DRAM and the Memory Controller
(FPGA in this case) can send and receive data but only one side can transmit at a time. So
the impact of the bus turnaround for Read-to-Write and Write-to-Read in single rank
system and in multi-rank system should be considered. DRAM industry specifications
indicate the minimum required duration from DRAM device perspective but the actual
sustainable turnaround time will vary based on the particular system design.
This paper proposed a modeling concept to capture the IO turnaround effect and the
incorporation to model a selected system. Section 2 will provide the model concept and
philosophy of the IO that can capture the dynamics covering N-over-N driver, which is
LPDDR4 and LPDDR5 IO standards and the push-pull driver, which is the DDR4 and
DDR5 standard. Section 3 illustrates the incorporation to model system bus turnaround
dynamic and also will highlight the effective data throughput for different configurations.
Section 4 will cover system memory performance gradient versus the wait cycle (gap)
between transactions for different system configurations. Section 5 is the experimental set
up and measurement. Section 6 will summarize the measurement results and their
implications. Section 7 will conclude the results and key take away.
2. IO Driver Modeling Concepts
IBIS model of IO behavior is used to model channel signal integrity. However, traditional
IBIS models only transmit and receive behavior without covering the transition dynamic.
Further, the IBIS models is a snapshot of the IO driver and usually needs to be re-
generated if design iteration requires a change of IO driver specifications. In order to
enable early stage channel analysis and design exploration, Su [Link].[6] proposed an
element model approach which covered the structural behavior of the IO. This approach
allows the modeling to extend to model system bus dynamic. This section covers the N-
over-N driver modeling concept because it is the IO standard for LPDDR4 and LPDDR5.
Then, the concept is extended to push-pull driver because it is for DDR4 and DDR5.
For the N-over-N driver, the driver can be divided into two state of operations. The upper
portion of the N-over-N driver is a non-linear driver and the lower is a linear driver, as
shown in Figure 6. The IV curves are also shown. When signal is set to logic Low, the
pull down section of the N-over-N driver is active. The I-V characteristic can be
represented by a linearized resistor with turn-on & turn-off switch, as shown in Figure 7.
The large signal gain can also be modeled in the switching.
Figure 6 LPDDR4 N-over-N Driver
For a Logic One, the pull up section of the driver operates. Even though the full range of
operation of the driver is non-linear, the LPDDR4 signal swing limits the range of
operation and it can be model with a segmented approximation current source.
Figure 7 N-over-N Drive Pull Down and Pull Up IV Characteristic
Figure 8 shows the simulation correlation between the element model and IBIS model.
Also, it shows the signal eye when the element model is used in channel simulation
versus when IBIS model is used. This shows the modeling method can represent IO
characteristic.
Figure 8 (a) VI & VT curves Comparison between IBIS and Element Model (b) Channel Jitter Comparison
Based on the same concept, the element modeling for push pull IO driver can be
established. The generalized push pull driver is shown in Figure 9. The switching
characteristic can be set up in the signal control similar to the N-over-N driver.
Figure 9 Push Pull IO Driver and IV characteristic vs IBIS comparison
Since the IO driver are used as on-die termination when it is in receive mode, the
termination leg and the assertion time of the transition can be modelled.
3. Bus System Turnaround Modeling
The two IO driver standard modeling are set up and they can be applied to LPDDR4/5
and DDR4/5 system modeling.
Figure 10 is a single Rank DIMM configuration. For this configuration, the push pull
element models will be used to model the bus dynamic. Because this is a single rank
system, the bus turnaround consideration is the Read-to-Write and Write-to-Read
operations. The element models form the correct data byte with the correct switch on and
off timing.
Figure 10 One Rank System Turnaround Modeling
Figure 11 shows the Read to Write timing relationship. When the Memory controller
completes the Read operation, the controller will need to switch from a receive mode
(with on-die termination enabled) to transmit mode (driver mode enabled). This
switching effect will be modeled by channel simulation. The combination of this
switching timing and the channel simulation enables a better understanding of the
turnaround effect in signal integrity.
Figure 11 System Read Write Turnaround
Similar modeling approach can be applied to represent multi-rank system for
DDR4/DDR5 and LPDDR4/LPDDR5.
4. Turnaround Gap versus Data Throughout Efficiency
The actual data throughput should be considered as an additional figure of merit. When
the wait cycle (or gap) between the read/write transactions widen, the effective data
transfer throughput will drop, hence, the system efficiency.
Figure 12 shows the case when the memory controller completes the Read transaction
and then switches its IO from On-die termination to a driver. If the IO signal does not
settle, the system may have to increase this gap before it can drive again.
Figure 13 illustrates the performance impact when the gap between Read/Write
transactions is widen. The horizontal numbers indicates the number of system clock cycle
(tck) in this gap. The performance simulation is done with the assumption that there is
50% Read/50% Write, Read CAS latency of 9 tck and Write CAS latency of 9 tck as
well.
Figure 12 Read to Write Turnaround Gap (1 Rank System)
Figure 13 Data Throughput vs Gap in 1 Rank System
The bus turnaround effect was simulated for the 1 Rank system. The results are shown in
Figure 14. The signal settling behavior can be observed in the DQS signal line.
Figure 14 Bus Turnaround System Simulation with Element Model
The modeling of multi-rank system can be leverage from the same concept. However, the
bus dynamic will be more complicated because the turnaround transactions can be from
two different ranks. For instance, the operation can Read data from Rank 0 and then write
to Rank 1 or vice versa. The channel jitter effect will be more challenging.
Figure 15 Relative Power Improvement Comparing across 11 different Usage Programs
Figure 16 Relative Power Improvement Comparing across 11 different Usage Programs
The modeling of multi-rank system can be leverage from the same concept. However, the
bus dynamic will be more complicated because the turnaround transactions can be from
two different ranks. For instance, the operation can Read data from Rank 0 and then write
to Rank 1 or vice versa. The channel jitter effect will be more challenging.
Figure 17 is the relative performance gradient versus the number of gap between the data
transaction in a 2 Rank system. As expected, the performance drops as the gap increases.
Figure 17 Relative Power Improvement Comparing across 11 different Usage Programs
5. Experimental Data Validations & Results
Based on the proposed analysis approach, different system memory platforms were
designed. Below in Figure 18 is one of the example; it is a 32 bit DQ wide LPDDR4
interface. The memory interface system was served as a validation platform to our
proposed method.
Figure 18 A LPDDR4 System for Validation
The Write to Read transaction was first monitored and high speed probes were placed on
the back side of the DRAM as shown in Figure 18 (b). A set of defined gaps between
these transactions was applied to the controller in the lab and signal integrity of the bus
turnaround dynamic was simulated. The results were compared.
Figure 19 (a) is the actual measurement for Write-to-Read turnaround and (b) is the
simulation based on proposed method. Figure 20 is the measurement for Read-to-Write
transaction and (b) is the simulation result. Based on the proposed modeling approach,
the turnaround timing dynamic can be predicted actually. This allows system designers to
have a better understanding of signal settling limitation for their selected system.
(a) (b)
Figure 19 Write to Read Signal Turnaround Dynamic Comparison
Figure 20 Read to Write Signal Turnaround Dynamic Comparison
6. Summary and Conclusions
The data bus turnaround dynamic can impact the actual data throughput. This paper
provides a new method, which uses the element IO model as the base unit, to capture the
IO turnaround dynamic. The element modeling of two key type of IO designs, N-over-N
driver and push pull driver, which covers LPDDR4/LPDDR5 and DDR4/DDR5 were
discussed. The incorporation of the element model to represent system IO signal for
channel simulation was also described as well as the performance impact with different
wait cycle (gap) between transactions. The proposed method allows system designers
accurately predict the data throughput and hence providing a better trade offs before the
system is manufactured.
References
[1] J. Kim, “Memory Innovation for Embedded Vision Systems,” February 22nd 2017,
[Link]
vision-systems-a-presentation-from-samsung-electronics
[2] K. Rupp, [Link]
[3] A. Luo, “ FPGA Machine Learning Applications,” Matlabexpo, June 20th , 2017,
[Link]
[4] J. Hanlon, “Why is so much memory need for Deep Neural Network?”
[Link]
networks
[5] T. To, “System Memory Impact to New Computing Platform,” UC Berkeley Tech
Talk, Oct, 25th 2018.
[6] C. Su, T. To, Y. Wang, “LPDDR4 IO Modeling and System Correlation for Critical
Targeted Data Speed and Data Through-put,” Jan. 22nd, DesignCon 2018.