Downloaded from [Link]
ir
[Link] [Link]
Low Power Network-on-Chip Architecture Design
Technique
Tejas Musale1 , Arun Ganti2 , Ankur Gogoi3 and Kanchan Manna4
1
Department of Electrical and Electronics Engineering, BITS Pilani, K. K. Birla Goa Campus, India
2,4
Department of Computer Science and Information Systems, BITS Pilani, K. K. Birla Goa Campus, India
3
Galgotias University, Uttar Pradesh, India
2024 IFIP/IEEE 32nd International Conference on Very Large Scale Integration (VLSI-SoC) | 979-8-3315-3967-2/24/$31.00 ©2024 IEEE | DOI: 10.1109/VLSI-SOC62099.2024.10767805
Email: {1 f20190409, 2 f20190021, 4 kanchanm}@[Link], 3 [Link]@[Link]
Abstract—In the evolving landscape of many-core architec-
tures, the Network on Chip (NoC) emerges as a pivotal inter-
connection framework, accommodating the escalating number of
cores on a single chip. Despite its widespread adoption, power
consumption remains a critical challenge, significantly influenced
by factors such as topology, bit toggling, and routing algorithms,
with bit-switching power being a predominant concern. In this
work, we have proposed an innovative technique in the cores
aimed at reducing the power consumption of the NoC. To
calculate our method’s area, time, and power consumption,
we have synthesized it on a Xilinx Virtex-7 FPGA board. We
have modified the Noxim open-source simulator to validate our
approach using synthetic traffic. Empirical results demonstrate a
29.64% reduction in dynamic power consumption and a 29.09%
improvement in switching activity for the average case (Mode 1)
traffic scenario.
Index Terms—Network-on-Chip (NoC), Field Programmable
Gate Array(FPGA), Switching power, Hamming distance sorting,
switch-router architecture, Noxim.
I. I NTRODUCTION
The rapid advancement of artificial intelligence (AI) tech-
nologies has precipitated an unprecedented demand for more
complex and integrated computing cores capable of seamless
intercommunication. As Moore’s law propels the miniatur-
ization of chip sizes, the resulting intricate architectures and
increased logic density pose significant challenges. System-
Figure 1: A 4 × 4 mesh-based NoC Architecture
on-chips (SoCs), which feature multiple cores, are particularly
affected by the critical issue of core-to-core communication.
Cores communicate using shared buses on processors. These
low latency, and high bandwidth, mesh-based NoCs are pop-
buses are prone to scalability issues such as congestion,
ular in modern electronic systems. Fig. 1 presents a 4 × 4
performance, and power dissipation. In order to address these
mesh-based NoC. Besides, NoC consumes significant power.
issues, Network-on-Chip (NoC) architecture provides a scal-
For example, 30% and 40% of chip power in Intel 80-core
able, modular, and flexible interconnect fabric that provides
Terascale chip [2] and MIT RAW [3] consumed by NoC.
high-bandwidth, low-latency data transfer capabilities [1]. The
components of a NoC include routers, links, and Network In the quest to minimize the power consumption of the
Interfaces (NIs). Network topologies are formed by connecting NoC, researchers have proposed several approaches, which
routers via links. The cores are associated with the routers via are discussed in section 2. In this work, we considered the
NIs. Router fabrics are used to communicate between cores. switching factor, which is determined by the packet’s toggle
Due to the scalability, modularity, low power consumption, rates, a critical consideration. High toggle rates increase power
consumption [4], necessitating strategies to reduce bit toggling
at the packet level. The authors in [4] proposed the selective
This work was supported in part by the Science and Engineering Research
Board (SERB), Department of Science and Technology (DST), Government packet interleaving (SIP) technique at the router’s output link’s
of India, under Grant SRG/2021/001239. different virtual channels to minimize power consumption.
979-8-3315-3967-2/24/$31.00 ©2024 IEEE
Authorized licensed use limited to: Julius-Maximilians-Universitaet Wuerzburg. Downloaded on June 11,2025 at 11:25:33 UTC from IEEE Xplore. Restrictions apply.
Downloaded from [Link]
[Link] [Link]
Based on the Hamming distance, SIP multiplex flits in the is decomposed into smaller units called flits (flow units).
router’s output channel so that bits transition between two Three types of flits are available: header, body, and tailer. The
consecutive flits get reduced. For instance, consider a scenario header flit determines the transmission path of all the flits of
where a flit 11001100 is transmitted from the router x to the a packet based on the employed routing algorithm. The body
router y. The next available flits are 00110011 and 11011101. flit contains application data, which is to be processed by the
By employing the Hamming distance sorting, the next flit processing element, and the tail flit indicates the end of a
chosen for transmission would be 11011101, which requires particular packet.
only 2-bit switches compared to the previous flit, reducing the The power consumption in NoC is dynamic because it is
switching activity and conserving power. The selection of flits generated by the charging and discharging of the capacitive
is happening in each router, and as a result, the NoC’s routers load on the data-bus wires inside channels and buffers when
will be slow to process the packet. It works well when more a transistor changes its state. The overall power is exactly
virtual channels (VCs) are present in the routers [4]. These proportional to the frequency of the switching occurrences.
are the gaps present in the SOTA. During the flit transmission, bit-switching activity occurs in
Contributions to our work are as follows: the ports of routers, leading to power dissipation. That is, the
• We have introduced a novel approach at the core or higher the flit dissimilarity, the higher the power dissipation
processor level instead of at the router level as in [4] to in routers. A very few literature have been reported on this
minimize the bit transition at the packet level, reducing issue and are explained in this section. Authors in [7] proposes
the NoC’s power consumption. Our approach does not a highly scalable and portable FPGA-based single cycle, low
affect the routers’ frequency, whereas the SIP [4] reduces latency router design for NoC applications. Reducing the input
the router’s processing speed. Our proposed technique’s port buffer depth allowed for increased network speed while
performance doesn’t depend on the number of VCs. using low power for data transfer. The communication of
• We have presented an integrated analysis of our proposed several cores of a single chip has become the largest source of
approach using an open-source Noxim NoC simulator [5] power dissipation. To tackle this problem of NoCs, wireless
and Orion [6] for power consumption assessment. We NoCs have been proposed in [8] for multi-core SoCs. The
have evaluated the impact of switching factor variations whole architecture has been implemented and integrated over
on the NoC’s power usage. Noxim [5]. The dynamic power consumption of a state-of-
• We have synthesized our proposed technique on Field- art microprocessor is investigated in [9] and further advised
Programmable Gate Arrays (FPGAs) to find the area, power-saving approaches to modify a router, which resulted
power, and latency information. in an average reduction of 14% of dynamic power. In [10],
The rest of the paper is organized as follows: Section II a unified nanometer-scale bus energy dissipation and thermal
describes the basics of NoC and reviews related works. Section model is reported that can be used by designers to track
III explains the proposed methodology. Section IV presents temperature increase and energy dissipation in individual wires
the experimental results and discussion, and conclusions are during trace-driven or power/performance simulation. Apart
drawn in Section V. from the self-capacitance, the proposed model incorporates
the impact of lateral heat transfer between neighboring wires.
II. BACKGROUND AND R ELATED W ORK The authors in [11] proposed an approach that uses routing
The Bus-based communication infrastructures are used to flexibility to provide a solution for power and performance-
meet the demand for aggressive communication among the ten aware mapping, resulting in significant energy savings. Ac-
cores of multi-core architectures. However, when the cores in- cording to [12], power consumption increases as the hamming
creased, the performance of the bus-based architecture started distance between adjacent flits increases. Higher hamming
to decrease which led to the inception of NoC. In NoC, flexible distance implies more bit transitions when moving from one
communications are provided by routers and the routers are flit to another which in turn increases the power consumption
connected to each other via bidirectional communication links. due to the charging and the discharging of the capacitive
Each processing element or core is connected to a router via loads. Accordingly, an approach is presented that minimizes
the Network Interface (NI). Instead of fixed communication the hamming distance between successive elements of the
links, flexible communication is provided in NoC for low packet, thereby optimizing switching power. In [4], the authors
latency and dynamic data transmission paths. In the litera- introduced selective packet interleaving (SPI), a flit transmis-
ture, various NoC topologies have been proposed, such as sion scheme that reduces dynamic power consumption in NoC
mesh, torus, butterfly, ring, tree, and etc. In our work, we links by decreasing the number of bit transitions in links. SPI
have considered a 2D mesh topology. However, our proposed minimizes the hamming distance between two successive flits
method can be applicable to other NoC topologies also. The and chooses the flit out of all virtual channels, which have
processing elements of a NoC coordinate with each other for a minimum hamming distance with the previously transmit-
concurrent processing of a particular application data, and the ted flit. One drawback of this approach is that sometimes,
application data is transmitted from one core to another using certain VCs might have to wait for a very long time prior
various routing approaches such as XY, Odd-Even, North- to transmitting their flits. While selecting flits with less bit
Last, etc. In NoC, for convenience of transmission, a packet transition, this approach only considers the flits available at
979-8-3315-3967-2/24/$31.00 ©2024 IEEE
Authorized licensed use limited to: Julius-Maximilians-Universitaet Wuerzburg. Downloaded on June 11,2025 at 11:25:33 UTC from IEEE Xplore. Restrictions apply.
Downloaded from [Link]
[Link] [Link]
Number Adder (SNA) module to add the sequences with flits
(sequence bits are highlighted as the dotted red box) and
generates SeqFlits. The Hamming Distance Sorter (HDS) sorts
the SeqFlits and sends the rearranged packet (set of SeqFlit)
to the NI. The NI takes the SeqFlits one after another and puts
them in the flit structure available in the routers. Based on the
destination address routing logic, forwards the flits.
The destination router can receive the SeqFlits in out-of-
order. The SeqFlits are sent to NI, and NI stores them in a
reorder buffer (ROB) as per the sequence numbers. Next, the
entire sequence number is removed from SeqFlits in the ROB
using the sequence number remover (SNR) module, and the
entire packet is stored in another Pkt unit.
The destination core receives the packets in an out-of-
order fashion. The sequence numbers in the flits are used to
rearrange themselves and are stripped away at the end. Next,
the packet is sent to the core.
Figure 2: Flow diagram of the proposed technique
The HDS module has been presented in the Algorithm
1. The hamming_distance() calculates the Hamming distance
the front or head of each buffer and does not consider all the between two SeqFlits. It performs a bitwise XOR operation
flits available in the link buffers of the NoC router. In [13], between consecutive SeqFlits, identifying differences bit by
the authors conducted a study on reducing power consumption bit. The outcome of this operation dictates the arrangement
in NoC by minimizing signal transition activity via the use strategy, aiming to position flits in a sequence that minimizes
of data encoding methods. Experimental investigations have these distances. Crucially, the header and tail SeqFlit maintain
shown that encoding effectiveness depends on the transition their positions, while the body SeqFlit are reordered based
activity patterns. The proposed works modify the basic NoC on their Hamming proximity to preceding SeqFlit(s). The
router architecture to minimize the NoC’s power consumption, algorithm has an overall time complexity of O(N 2 ), where N
which can increase the flit processing time. To deal with this is the number of flits in SeqFlits. While there is potential
problem, we have proposed a technique in the core that can to optimize this time complexity further, which we propose as
reduce the NoC’s power consumption while unchanging the part of future work, the current implementation should suffice
basic router architecture. for the purposes of this study.
III. P ROPOSED M ETHODOLOGY Algorithm 1 Rearrangement of SeqFlits
In response to the escalating power consumption challenges Result: Rearrange SeqFlits based on Hamming distance
in the NoC architectures, this paper proposes a novel flit /*HD: Hamming Distance*/
processing architecture that efficiently minimizes the bit- for i ← 0 to N − 1 do
switching frequency in the packet. In our approach, the packet min_dist ← HD(SeqF lits[i], SeqF lits[i + 1]);
generated by the core is split into several flits based on the min_idx ← i + 1
flit size. It also appends a sequence number with each flit. for j ← i + 2 to N − 1 do
We have named such an entity as SeqFlit. The flit size is dist ← HD(SeqF lits[i], SeqF lits[j])
taken from the NI, which does not include bits required to if dist < min_dist then
identify the flit types. The module then sorts each SeqFlit min_idx ← j
based on the hamming distance with reference to the header min_dist ← dist
flit. The header flit contains the source and destination routers’ end
information and 0s for the sequence number field. First, the end
module finds a SeqFlit who’s hamming distance is lesser if min_idx ̸= i + 1 then
swap(SeqF lits, i + 1, min_idx)
with the header flit. Next, it will find the next SeqFlit whose
end
hamming distance will be lesser with the previous SeqFlit.
end
Similarly, the rest of the SeqFlits will be arranged for the
packet. The sequence number facilitates the reordering of
SeqFlit into their original sequence at the destination core. IV. E XPERIMENTS R ESULTS AND D ISCUSSION
The packet’s maximum sequence number is fixed, and the bits We have implemented our proposed approach in the open-
required for the field is log2 (No_Flits). source Noxim simulator [5]. For the experiment purpose, we
Our proposed strategy has been presented in Fig. 2. In the have used the following simulation parameters:
diagram, the Pkt unit stores the entire packet as the flits, for In this experiment, we have generated random synthetic
example, F1, F2, F3, and F4, and sends it to the Sequence traffic. We have generated random numbers between minimum
979-8-3315-3967-2/24/$31.00 ©2024 IEEE
Authorized licensed use limited to: Julius-Maximilians-Universitaet Wuerzburg. Downloaded on June 11,2025 at 11:25:33 UTC from IEEE Xplore. Restrictions apply.
Downloaded from [Link]
[Link] [Link]
TABLE I: Noxim Settings
Parameters Values
Topology Mesh (4×4)
Buffer depth 4
VCs 1
Flit size 32 bits
Routing Dimension-order (XY)
Selection logic Random
Warmup time 10,000 clk cycles
Simulation time 200,000 clk cycles
Traffic pattern Random
and maximum values and used those values as body flits. The
minimum value is 0, and the maximum is 2k − 1, where k is
the number of bits required for body flits; it doesn’t include
the bits required for the sequence number. We have generated
Figure 3: Switching factor variation
three modes of random traffic: Mode 1, 2, and 3.
1) Mode 1 (average case): Randomly generated body flits
between minimum and maximum values.
2) Mode 2 (not average case): Alternative transmission of
two randomly selected numbers between minimum and
maximum values as the body flits.
3) Mode 3 (not average case): Sending the maximum and
minimum numbers of the range alternatively as the body
flits.
A. Switching Factor and Power Consumption Analysis
We have generated synthetic traffic for the three modes,
sent them "In-Order" and "Out-of-Order" forms and measured
the switching activity for each router’s input ports. The bit
switching factor for each packet was calculated using the
Noxim simulator, with packet sizes ranging from 8 to 132 flits
and flit of size 32 bits. Figure 3 illustrates the comparison of Figure 4: Total Power variation
the switching factor for all three modes. The "In-Order" trans-
mission refers to the scenario without applying our proposed
technique, whereas the "Out-of-Order" transmission applies modes. Our simulations demonstrate average improvements of
our proposed technique to optimize the Hamming distance 29.64%, 93.28%, and 93.28% in switching power for Modes
between consecutive flits. Significant improvements have been 1, 2, and 3, respectively. Overall, the random generation case
observed across all modes, with average improvements in the of the NoC saves an average of 17.8 mW in power across
switching factor of 29.09%, 92.95%, and 92.96% for Modes varying packet sizes. We observe significant improvements
1, 2, and 3, respectively. These results indicate a definitive in both the switching factor and total power consumption,
improvement even in the random generation case and more underscoring the efficacy of our approach in enhancing the
drastic improvements in the other two modes. The reported efficiency of NoC operations. The experimental results confirm
power consumption doesn’t include the power consumed by that the benefits of our proposed technique increase with
our proposed technique as it is done at the core or processor, the packet size, leading to significant power savings and
not in the network or NoC. enhanced efficiency in NoC operations. However, we have
The total power consumed by the NoC has been calculated used additional bits for sequence numbers.
using the ORION3.0 simulator [6], which employs 65 nm Our proposed technique does not impact other NoC pa-
technology. The total power, Ptotal , is given by: rameters such as network latency and throughput, as the
implementation is at the core or processor level rather than
Ptotal = Pinternal + Pswitching + Pleakage + Plink
the NoC itself. However, it can increase the processing
where Pswitching and Plink are the dynamic power components time at the core, leading to a potential increase in overall
determined by the switching factor, and Pleakage and Pinternal latency, including the core’s processing time, which may
represent the static power components of the NoC. The subsequently reduce throughput. Despite this, the technique
link power depends on the number of links available in the offers significant benefits in applications requiring low power
NoC topology. Figures 4 and 5 show the variations in total NoCs, such as high-performance computing environments,
power and switching power for varying packet sizes across all wireless communication systems (such as smartphones and
979-8-3315-3967-2/24/$31.00 ©2024 IEEE
Authorized licensed use limited to: Julius-Maximilians-Universitaet Wuerzburg. Downloaded on June 11,2025 at 11:25:33 UTC from IEEE Xplore. Restrictions apply.
Downloaded from [Link]
[Link] [Link]
thereby reducing power consumption. Unlike [4], which can
face starvation due to changes in packet transmission order,
our method reorders flits within a packet without altering the
order of packet transmission. This ensures that all packets are
transmitted in their generation order, avoiding starvation and
the need for counters in each virtual channel. Consequently,
our approach is more area-efficient, as it eliminates the area
overhead associated with the counters used in [4].
V. C ONCLUSION AND F UTURE W ORK
In this paper, we presented a flit sorting technique before
transmission to achieve energy-efficient NoCs. The flit sorting
is performed by optimizing the Hamming distance between
successive flits, which significantly reduces the switching
activity. Our technique, which uses a simple and very low
Figure 5: Switching Power variation
complexity circuit addition to the core, has demonstrated
significant improvements in dynamic power consumption.
IoT devices), embedded systems in medical and industrial Specifically, we observed a 29.09% improvement in switch-
applications, and AI/ML on edge or wearable devices [14]. ing activity and a 29.64% improvement in dynamic power
These applications would benefit significantly from our design consumption in the average case traffic scenario (Mode 1).
due to their need for energy efficiency, extended operational Furthermore, in scenarios where alternate body flits are the
life, improved performance, and reliable operation in power- same, which is common in real workloads, our technique
constrained environments. achieved remarkable average power consumption and switch-
ing activity improvements of 92.9% and 92.8%, respectively.
B. FPGA Implementation and Performance Evaluation of The benefits of our technique grow with increasing packet
Hamming Distance Sorter IP size. For future work, we plan to conduct simulations using
We have implemented the proposed method using Verilog the Gem5 simulator with real workloads to further validate our
HDL and executed on a Virtex-7 FPGA board using the Xilinx approach. Additionally, we look to explore other encoding and
Vivado Design Suite for synthesis, placement, and routing. power optimization techniques in integration with our sorting
This process aimed to assess the resource utilization and the approach to address the low power challenges introduced by
maximum clock frequency achievable by the IP. As indicated modern multi-core systems, particularly the dynamic power
in Table II, resource utilization was minimal, with Look- consumption due to the switching activity of interconnects. In
Up Tables (LUTs) and Flip-Flops (FFs) showing very low addition, a more ambitious study could investigate the use of
usage, which demonstrates an efficient use of the FPGA’s other sorting techniques for flits.
logic resources. Specifically, the IP utilized only 673 LUTs
R EFERENCES
and 632 FFs, corresponding to just 0.05% and 0.02% of the
total available resources, respectively. The maximum clock [1] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. Oberg,
frequency achieved by the IP was 447 MHz, underscoring K. Tiensyrja, and A. Hemani, “A network on chip architecture and
design methodology,” in Proceedings IEEE Computer Society Annual
the IP’s ability to operate at high speeds, a critical feature Symposium on VLSI. New Paradigms for VLSI Systems Design. ISVLSI
for the cores in a multi-core system. This IP, which can 2002, 2002, pp. 117–124.
be integrated alongside the core, highlights its potential for [2] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, “A 5-ghz
mesh interconnect for a teraflops processor,” IEEE Micro, vol. 27, no. 5,
efficient incorporation into large NoCs that require rapid flit pp. 51–61, 2007.
reordering capabilities. [3] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald,
H. Hoffmann, P. Johnson, J. Kim, J. Psota, A. Saraf, N. Shnidman,
TABLE II: Resource Utilization V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, “Evaluation of
the raw microprocessor: An exposed-wire-delay architecture for ilp and
Resource Usage Available Utilization % streams,” in Proceedings of the 31st Annual International Symposium on
Computer Architecture, ser. ISCA ’04. IEEE Computer Society, 2004,
LUT 673 1303680 0.05 p. 2.
FF 632 2607360 0.02 [4] A. Berman, R. Ginosar, and I. Keidar, “Order is power: Selective
IO 530 624 84.94 packet interleaving for energy efficient networks-on-chip,” in 2010 18th
BUFG 2 1008 0.2 IEEE/IFIP International Conference on VLSI and System-on-Chip, 2010,
pp. 37–42.
[5] V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti, “Noxim:
A qualitative comparison with [4] shows that our approach An open, extensible and cycle-accurate network on chip simulator,”
consumes less power. The method in [4] is buffer-dependent, in 2015 IEEE 26th International Conference on Application-specific
and its power efficiency improves with larger buffer sizes. Systems, Architectures and Processors (ASAP), 2015, pp. 162–163.
[6] A. B. Kahng, B. Lin, and S. Nath, “Orion3.0: A comprehensive noc
In contrast, our approach computes the next flit selection router estimation tool,” IEEE Embedded Systems Letters, vol. 7, no. 2,
within the core, eliminating the need for additional buffers and pp. 41–45, 2015.
979-8-3315-3967-2/24/$31.00 ©2024 IEEE
Authorized licensed use limited to: Julius-Maximilians-Universitaet Wuerzburg. Downloaded on June 11,2025 at 11:25:33 UTC from IEEE Xplore. Restrictions apply.
Downloaded from [Link]
[Link] [Link]
[7] P. Gupta, A. Akoglu, K. Melde, and J. Roveda, “Fpga based single cycle,
reconfigurable router for noc applications,” in 2013 IEEE International
Symposium on Circuits and Systems (ISCAS), 2013, pp. 2428–2431.
[8] S. Mnejja, Y. Aydi, and M. Abid, “Exploring hybrid noc architecture
for chip multiprocessor,” in 2018 30th International Conference on
Microelectronics (ICM), 2018, pp. 307–310.
[9] N. Magen, A. Kolodny, U. Weiser, and N. Shamir, “Interconnect-
power dissipation in a microprocessor,” in Proceedings of the 2004
International Workshop on System Level Interconnect Prediction,
ser. SLIP ’04. New York, NY, USA: Association for Computing
Machinery, 2004, p. 7–13. [Online]. Available: [Link]
966747.966750
[10] K. Sundaresan and N. Mahapatra, “Accurate energy dissipation and
thermal modeling for nanometer-scale buses,” in 11th International
Symposium on High-Performance Computer Architecture, 2005, pp. 51–
60.
[11] J. Hu and R. Marculescu, “Exploiting the routing flexibility for ener-
gy/performance aware mapping of regular noc architectures,” in 2003
Design, Automation and Test in Europe Conference and Exhibition,
2003, pp. 688–693.
[12] A. Agrawal, “An architectural power model for networks on chip,” 2023.
[13] J. C. Palma, L. S. Indrusiak, F. G. Moraes, A. G. Ortiz, M. Glesner, and
R. A. Reis, “Inserting data encoding techniques into noc-based systems,”
in IEEE Computer Society Annual Symposium on VLSI (ISVLSI ’07),
2007, pp. 299–304.
[14] E. Ofori-Attah and M. O. Agyeman, “A survey of low power noc
design techniques,” in Proceedings of the 2nd International Workshop
on Advanced Interconnect Solutions and Technologies for Emerging
Computing Systems, ser. AISTECS ’17. New York, NY, USA:
Association for Computing Machinery, 2017, p. 22–27. [Online].
Available: [Link]
979-8-3315-3967-2/24/$31.00 ©2024 IEEE
Authorized licensed use limited to: Julius-Maximilians-Universitaet Wuerzburg. Downloaded on June 11,2025 at 11:25:33 UTC from IEEE Xplore. Restrictions apply.