0% found this document useful (0 votes)
18 views23 pages

Advanced Computer Architecture Insights

The document discusses advanced computer architecture, focusing on parallel processing architectures, multithreading, and the differences between multicore and multiprocessor systems. It outlines the challenges of instruction-level parallelism (ILP), the limitations of hardware multithreading, and the advantages and disadvantages of multicore and multiprocessor systems. Additionally, it describes various types of multithreading, including coarse-grained, fine-grained, and simultaneous multithreading, emphasizing their impact on performance and efficiency.

Uploaded by

joelnaz1979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views23 pages

Advanced Computer Architecture Insights

The document discusses advanced computer architecture, focusing on parallel processing architectures, multithreading, and the differences between multicore and multiprocessor systems. It outlines the challenges of instruction-level parallelism (ILP), the limitations of hardware multithreading, and the advantages and disadvantages of multicore and multiprocessor systems. Additionally, it describes various types of multithreading, including coarse-grained, fine-grained, and simultaneous multithreading, emphasizing their impact on performance and efficiency.

Uploaded by

joelnaz1979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT V ADVANCED COMPUTER ARCHITECTURE

Parallel processing architectures and challenges, Hardware multithreading, Multicore and shared
memory multiprocessors, Introduction to Graphics Processing Units, Clusters and Warehouse
scale computers, Introduction to Multiprocessor network topologies.

5.1 PARALLEL PROCESSING CHALLENGES


5.1.1 Limitations of ILP
The Hardware Model
An ideal processor is one where all constraints on ILP are removed. The only limits on ILP in such a
processor are those imposed by the actual data flows through either registers or memory.
The assumptions made for an ideal or perfect processor are as follows:
5.1.1. [Link] renaming
—There are an infinite number of virtual registers available, and hence all WAW and WAR hazards
are avoided and an unbounded number of instructions can begin execution simultaneously.
5.1.1. [Link] prediction
—Branch prediction is perfect. All conditional branches are predicted exactly.
5.1.1. [Link] prediction
—All jumps (including jump register used for return and computed jumps) are perfectly predicted.
When combined with perfect branch prediction, this is equivalent to having a processor with perfect
speculation and an unbounded buffer of instructions available for execution.
5.1.1. [Link] address alias analysis
—All memory addresses are known exactly, and a load can be moved before a store provided that
the addresses are not identical. Note that this implements perfect address alias analysis.

5.1.1. [Link] caches


—All memory accesses take 1 clock cycle. In practice, superscalar processors will typically consume
large amounts of ILP hiding cache misses, making these results highly optimistic.
To measure the available parallelism, a set of programs was compiled and optimized with the
standard MIPS optimizing compilers. The programs were instrumented and executed to produce a
trace of the instruction and data references. Every instruction in the trace is then scheduled as early
as possible, limited only by the data dependences. Since a trace is used, perfect branch prediction and
perfect alias analysis are easy to do. With these mechanisms, instructions may bescheduled much
earlier than they would otherwise, moving across large numbers of instructions on which they are not
data dependent, including branches, since branches are perfectly predicted.
The effects of various assumptions are given before looking at some ambitious but realizable
processors.
5.1.2 Limitations on the Window Size and Maximum Issue Count
To build a processor that even comes close to perfect branch prediction and perfect alias analysis
requires extensive dynamic analysis, since static compile time schemes cannot be perfect. Of course,
most realistic dynamic schemes will not be perfect, but the use of dynamic schemes will provide the
ability to uncover parallelism that cannot be analyzed by static compile time analysis. Thus, a
dynamic processor might be able to more closely match the amount of parallelism uncovered by our
ideal processor.
5.1.3. The Effects of Realistic Branch and Jump Prediction
Our ideal processor assumes that branches can be perfectly predicted: The outcome of any branch in
the program is known before the first instruction is executed! Of course, no real processor can ever
achieve this.
We assume a separate predictor is used for jumps. Jump predictors are important primarily with the
most accurate branch predictors, since the branch frequency is higher and the accuracy of the branch
predictors dominates.
[Link] —All branches and jumps are perfectly predicted at the start of execution.
[Link]-based branch predictor —The prediction scheme uses a correlating 2-bit predictor and
a noncorrelating 2-bit predictor together with a selector, which chooses the best predictor for each
branch.
5.1.4. The Effects of Finite Registers
Our ideal processor eliminates all name dependences among register references using an infinite set
of virtual registers. To date, the IBM Power5 has provided the largest numbers of virtual registers: 88
additional floating-point and 88 additional integer registers, in addition to the 64 registers available
in the base architecture. All 240 registers are shared by two threads when executing in multithreading
mode, and all are available to a single thread when in single-thread mode.
5.1.5 The Effects of Imperfect Alias Analysis
Our optimal model assumes that it can perfectly analyze all memory dependences, as well as
eliminate all register name dependences. Of course, perfect alias analysis is not possible in practice:
The analysis cannot be perfect at compile time, and it requires a potentially unbounded number of
comparisons at run time (since the number of simultaneous memory references is unconstrained).
5.1.6 The three models are
1. Global/stack perfect—This model does perfect predictions for global and stack references and
assumes all heap references conflict. This model represents an idealized version of the best compiler-
based analysis schemes currently in production. Recent and ongoing research on alias analysis for
pointers should improve the handling of pointers to the heap in the future.
2. Inspection—This model examines the accesses to see if they can be determined not to interfere at
compile time. For example, if an access uses R10 as a base register with an offset of 20, then another
access that uses R10 as a base register with an offset of 100 cannot interfere, assuming R10 could not
have changed. In addition, addresses based on registers that point to different allocation areas (such
as the global area and the stack area) are assumed never to alias. This analysis is similar to that
performed by many existing commercial compilers, though newer compilers can do better, at least
for looporiented programs.
3. None—All memory references are assumed to conflict. As you might expect, for the FORTRAN
programs (where no heap references exist), there is no difference between perfect and global/stack
perfect analysis
5.2 Multithreading
The multithreading paradigm has become more popular as efforts to further
exploit instruction-level parallelism have stalled since the late 1990s. This allowed the concept
of throughput computing to re-emerge from the more specialized field of transaction processing.
Even though it is very difficult to further speed up a single thread or single program, most computer
systems are actually multitasking among multiple threads or programs. Thus, techniques that
improve the throughput of all tasks result in overall performance gains.
Two major techniques for throughput computing are multithreading and multiprocessing.
Advantages
If a thread gets a lot of cache misses, the other threads can continue taking advantage of the unused
computing resources, which may lead to faster overall execution, as these resources would have been
idle if only a single thread were executed. Also, if a thread cannot use all the computing resources of
the CPU (because instructions depend on each other's result), running another thread may prevent
those resources from becoming idle.
Disadvantages
Multiple threads can interfere with each other when sharing hardware resources such as caches
or translation lookaside buffers (TLBs). As a result, execution times of a single thread are not
improved and can be degraded, even when only one thread is executing, due to lower frequencies or
additional pipeline stages that are necessary to accommodate thread-switching hardware.
Overall efficiency varies; Intel claims up to 30% improvement with its Hyper-Threading
Technology, while a synthetic program just performing a loop of non-optimized dependent floating-
point operations actually gains a 100% speed improvement when run in parallel. On the other hand,
hand-tuned assembly language programs using MMX or AltiVec extensions and performing data
prefetches (as a good video encoder might) do not suffer from cache misses or idle computing
resources. Such programs therefore do not benefit from hardware multithreading and can indeed see
degraded performance due to contention for shared resources.
From the software standpoint, hardware support for multithreading is more visible to software,
requiring more changes to both application programs and operating systems than multiprocessing.
Hardware techniques used to support multithreading often parallel the software techniques used
for computer multitasking. Thread scheduling is also a major problem in multithreading.
Merging data from two processes can often incur significantly higher costs compared to processing
the same data on a single thread, potentially by two or more orders of magnitude due to overheads
such as inter-process communication and synchronization
5.2.1 Types of multithreading

Fig 5.1 Types of multithreading


[Link] Coarse-grained multithreading
The simplest type of multithreading occurs when one thread runs until it is blocked by an event that
normally would create a long-latency stall. Such a stall might be a cache miss that has to access off-
chip memory, which might take hundreds of CPU cycles for the data to return. Instead of waiting for
the stall to resolve, a threaded processor would switch execution to another thread that was ready to
run. Only when the data for the previous thread had arrived, would the previous thread be placed
back on the list of ready-to-run threads.
For example:
1. Cycle i: instruction j from thread A is issued.
2. Cycle i + 1: instruction j + 1 from thread A is issued.
3. Cycle i + 2: instruction j + 2 from thread A is issued, which is a load instruction that misses in
all caches.
4. Cycle i + 3: thread scheduler invoked, switches to thread B.
5. Cycle i + 4: instruction k from thread B is issued.
6. Cycle i + 5: instruction k + 1 from thread B is issued.
Conceptually, it is similar to cooperative multi-tasking used in real-time operating systems, in which
tasks voluntarily give up execution time when they need to wait upon some type of event. This type
of multithreading is known as block, cooperative or coarse-grained multithreading.
The goal of multithreading hardware support is to allow quick switching between a blocked thread
and another thread ready to run. Switching from one thread to another means the hardware switches
from using one register set to another. To achieve this goal, the hardware for the program visible
registers, as well as some processor control registers (such as the program counter), is replicated. For
example, to quickly switch between two threads, the processor is built with two sets of registers.
Additional hardware support for multithreading allows thread switching to be done in one CPU
cycle, bringing performance improvements. Also, additional hardware allows each thread to behave
as if it were executing alone and not sharing any hardware resources with other threads, minimizing
the amount of software changes needed within the application and the operating system to support
multithreading.
Many families of microcontrollers and embedded processors have multiple register banks to allow
quick context switching for interrupts. Such schemes can be considered a type of block
multithreading among the user program thread and the interrupt threads.[citation needed]
[Link] Fine-grained multithreading
The purpose of fine-grained multithreading is to remove all data dependency stalls from the
execution pipeline. Since one thread is relatively independent from other threads, there is less chance
of one instruction in one pipelining stage needing an output from an older instruction in the pipeline.
Conceptually, it is similar to preemptive multitasking used in operating systems; an analogy would
be that the time slice given to each active thread is one CPU cycle.
For example:
1. Cycle i + 1: an instruction from thread B is issued.
2. Cycle i + 2: an instruction from thread C is issued.
This type of multithreading was first called barrel processing, in which the staves of a barrel
represent the pipeline stages and their executing threads. Interleaved, preemptive, fine-grained or
time-sliced multithreading are more modern terminology.
In addition to the hardware costs discussed in the block type of multithreading, interleaved
multithreading has an additional cost of each pipeline stage tracking the thread ID of the instruction
it is processing. Also, since there are more threads being executed concurrently in the pipeline,
shared resources such as caches and TLBs need to be larger to avoid thrashing between the different
threads.
[Link] Simultaneous multithreading
The most advanced type of multithreading applies to superscalar processors. Whereas a normal
superscalar processor issues multiple instructions from a single thread every CPU cycle, in
simultaneous multithreading (SMT) a superscalar processor can issue instructions from multiple
threads every CPU cycle. Recognizing that any single thread has a limited amount of instruction-
level parallelism, this type of multithreading tries to exploit parallelism available across multiple
threads to decrease the waste associated with unused issue slots.
For example:
1. Cycle i: instructions j and j + 1 from thread A and instruction k from thread B are
simultaneously issued.
2. Cycle i + 1: instruction j + 2 from thread A, instruction k + 1 from thread B, and
instruction m from thread C are all simultaneously issued.
3. Cycle i + 2: instruction j + 3 from thread A and instructions m + 1 and m + 2 from
thread C are all simultaneously issued.
To distinguish the other types of multithreading from SMT, the term "temporal multithreading" is
used to denote when instructions from only one thread can be issued at a time.
5.3 Difference between MultiCore and MultiProcessor System
In today’s tech world, multi-core and multi-processor systems have become essential for boosting
computing power and efficiency. A multicore system packs several processing units or cores into a
single chip allowing it to tackle multiple tasks at once. Multiprocessor system uses two or more
separate processors that share resources enabling them to work together seamlessly. This means that
if one processor runs into trouble the others can keep things running smoothly. Both systems are
designed to meet the increasing demands of modern applications with multicore systems focusing on
maximizing performance within a single chip and multiprocessor systems expanding capabilities
through multiple CPUs.

Fig 5.2 Multicore system


5.3.1 What is a Multicore System?
A processor that has more than one core is called a Multicore Processor while one with a single core
is called Unicore Processor or Uniprocessor. Nowadays, most systems have four cores (Quad-core)
or eight cores (Octa-core). These cores can individually read and execute program instructions,
giving feel like a computer system has several processors but in reality, they are cores and not
processors. Instructions can be calculation, data transferring instruction, branch instruction, etc.
Processors can run instructions on separate cores at the same time. This increases the overall speed
of program execution in the system. Thus heat generated by the processor gets reduced and increases
the overall speed of execution.
Multicore systems support MultiThreading and Parallel Computing. Multicore processors are widely
used across many application domains, including general-purpose, embedded, network, digital signal
processing (DSP), and graphics (GPU). Efficient software algorithms should be used for the
implementation of cores to achieve higher performance. Software that can run parallel is preferred
because we want to achieve parallel execution with the help of multiple cores.
5.3.2. Advantages of Multicore System
 These cores are usually integrated into single IC (integrated circuit) die, or onto multiple dies
but in single chip package. Thus allowing higher Cache Coherency.
 These systems are energy efficient since they allow higher performance at lower energy. A
challenge in this, however, is additional overhead of writing parallel code.
 It will have less traffic(cores integrated into single chip and will require less time).
5.3.3. Disadvantages of Multicore System
 Dual-core processor do not work at twice speed of single processor. They get only 60-80%
more speed.
 Some Operating systems are still using single core processor.
 OS compiled for multi-core processor will run slightly slower on single-core processor.
5.4. MultiProcessor System
Two or more processors or CPUs present in same computer, sharing system bus, memory and I/O is
called MultiProcessing System. It allows parallel execution of different processors. These systems
are reliable since failure of any single processor does not affect other processors. A quad-processor
system can execute four processes at a time while an octa-processor can execute eight processes at a
time. The memory and other resources may be shared or distributed among processes.
5.4.1 Advantages of MultiProcessor System
 Since more than one processor are working at the same time, throughput will get increased.
 More reliable since failure in one CPU does not affect other.
 It needs little complex configuration.
 Parallel processing (more than one process executing at same time) is achieved through
MultiProcessing.
5.4.2 Disadvantages of MultiProcessor System
 It will have more traffic (distances between two will require longer time).
 Throughput may get reduced in shared resources system where one processor using some I/O
then another processor has to wait for its turn.
 As more than processors are working at particular instant of time. So, coordination between
these is very complex.
Tabel 5.1 Difference Between MultiCore and MultiProcessor System
MultiCore MultiProcessor

A single CPU or processor with two or more


A system with two or more CPU’s that
independent processing units called cores that are
allows simultaneous processing of
capable of reading and executing program
programs.
instructions.

It executes single program faster. It executes multiple programs Faster.

More reliable since failure in one CPU will


Not as reliable as multiprocessor.
not affect other.

It has less traffic. It has more traffic.

It does not need to be configured. It needs little complex configuration.

It is Expensive (Multiple separate CPU’s


It’s very cheaper (single CPU that does not require
that require system that supports multiple
multiple CPU support system).
processors) as compared to MultiCore.

5.5. Graphics Processing Unit (GPU)


Graphics Processing Unit (GPU) is a specialized processor originally designed to render
images and graphics efficiently for computer displays. In recent years, GPUs have evolved into
powerful co-processors that excel at performing parallel computations, making them indispensable
for tasks beyond graphics, such as scientific simulations, artificial intelligence, and machine learning.
Graphics processing unit (GPU) are made to process the required images and speed up the rendering
of 3D computer graphics on consumer electronics including PCs, game consoles, and smartphones or
any systems and they are also known as video cards and graphics cards.
This article explores the role of GPUs in modern computing, their architecture, applications
across various industries, and their impact on accelerating complex computations and improving
overall system performance.
Graphics Processing Unit (GPU) is a specialized electronic circuit designed to accelerate the
creation and rendering of images, animations, and video for computer displays. Originally developed
for rendering graphics in video games The required arithmetic computations are completed quickly
by a GPU, freeing up the CPU to conduct other activities or tasks.
Fig 5.3 Graphics Processing Unit (GPU)
A CPU uses some cores primarily for some required sequential serial processing, whereas a
GPU has many smaller cores designed for multitasking purpose. Whereas every CPU core operates
independently on a distinct job, the GPU cores concurrently do the required iterative computations
that underpin machine learning (ML) or deep learning.
GPUs are characterized by their high parallel processing power, which allows them to
perform thousands of computations simultaneously, making them well-suited for tasks requiring
heavy computational workload and real-time processing.
5.5.1 History of GPU
In the field of technology, NVidia created the first GPU, which is known as the GeForce 256, in the
year of 1999. With more than 22 million transistors, this GPU model was much capable of
processing 10 million polygons per second as per requirement. Through the initial 3D gaming
performance optimization, the GeForce 256 outperformed other processors in terms of technology
and science. Although Nvidia continues to dominate the GPU industry properly, technology has
much advanced significantly. Nvidia also able to introduce the GeForce 8800 GTX in the 2000s, a
graphics processing unit with an astounding 36.8 billion texture fills per second as per requirement.
5.5.2 Features of GPU
 A chip or electronic circuit that can render the required images for display on an electronic
device is referred to as a graphics processing unit (GPU).
 Despite the fact that the terms are much distinct, "graphics card" and "GPU" are frequently
used synonymously.
 Polygon rendering in 2-D and 3-D graphics that tastes good and the digital output to monitors
with flat panel displays properly.
 The application support for graphically intensive programs like AutoCAD and YUV color
space support is another one.
 Across a variety of devices or systems, including tablets, smart TVs, and smartphones of
various stripes, Arm GPUs deliver the best possible visual experience for the user.
5.5.3. Uses of GPU
GPUs are typically utilized to power the top-notch gaming experiences by producing required
incredibly smooth and lifelike graphics and rendering to the user. Nevertheless, a large number of
corporate applications or configurations also require powerful graphics processors. The important
uses are mentioned below:
 For Machine Learning: The several intriguing GPU technology packages are commonly
available in the fields of artificial intelligence and machine learning. Due to their
extraordinary computational capacity in the system, GPUs may significantly speed up tasks
like the required image recognition that benefit from the highly parallel architecture of the
GPU.
 For Gaming: Basically, with their expansive, incredibly realistic, and intricate in-game
universes, video games have become exceptionally computationally demanding for the user.
The need for graphics processing is very much rising quickly due to advances in display
technology or user interface, such as 4K screens and fast refresh rates, as well as a surge in
virtual reality games become very much attractive for the user.
 For Content Creation and Video Editing: The initial lengthy process of video editing and
content creation has long plagued graphic designers, video editors, and other professionals as
per requirement. This has clogged system resources and very much impeded creative flow.
GPU parallel processing now facilitates the faster and easier rendering of graphics and video
in higher quality formats as per user requirement.
5.6 Warehouse-scale computers (WSCs)
Warehouse-scale computers (WSCs) are a collection of hardware and software that work together to
deliver internet services. They are the foundation of the cloud infrastructure for companies like
Amazon and Google, and are essential for processing cloud applications. WSCs are designed to be
highly efficient and scalable, and are optimized for metrics like power efficiency, reliability, and cost
of ownership.
Here are some key aspects of WSC architecture:
 Design
WSCs are designed holistically, with the entire data center treated as one computer. This includes the
hardware infrastructure, software components, and the building's electrical and cooling systems.

Fig 5.4 Warehouse-scale computers (WSCs)


 Components
WSCs are made up of tens of thousands of servers, networking equipment, and other hardware and
software components.
 Operation
WSCs are designed to handle a wide range of workloads, including interactive and batch
applications. They are capable of exploiting both request-level and data-level parallelism.
 Optimization
WSCs are optimized to improve metrics like power efficiency, reliability, cost of ownership,
sustainability, and manageability.
WSCs are the descendants of supercomputers, and are much more powerful than high performance
computing systems. They are the backbone of the internet, supporting services like search, social
networking, online shopping, and cloud computing.
5.7 MULTIPROCESSOR NETWORK TOPOLOGIES
Multiprocessor system consists of multiple processing units connected via some
interconnection network plus the software needed to make the processing units work
together. There are two major factors used to categorize such systems:
The processing units

The interconnectionnetwork
A number of communication styles exist for multiprocessing networks. These can be
broadly classified according to the communication model as shared memory (single
address space) versus message passing (multiple address spaces).
Design Issues of Interconnection Networks
The important issue in the design of multiprocessor systems is how to cope with the
problemofanadequate designofthe interconnection network in order to achieve the desired
performance at low cost. The choice of the interconnection network may affect several
characteristics of the system such as node complexity, scalability and cost etc. The following
are the issues which should be considered while designing an interconnection network.
Dimension and size of network: It should be decided how many processing element are
there in the network and what the dimensionality of the network is i.e. with how many
neighbors, each processor is connected.
Symmetry of the network: It is important to consider whether the network is symmetric or
not i.e., whether all processors are connected with same number of processing elements
or the processing elements of corners or edges have different number of adjacent
elements.
Message Size: Message size is dependent on the amount of data that can be
transferred in one unit time.
Data transfer Time: The time taken for a message to reach to another processor, Whether
this time is a function of link distance between two processors or it depends upon the
number of nodes coming in between are chief factors

Startup Time: It is the time of initiation of the process.


Performance parameters
Number of nodes (N): The number of nodes in a multiprocessor network plays a dynamic
role by virtue of which the performance of the system is evaluated. Higher number of
nodes means higher complexity but higher is the system performance. Therefore, number
of processors should be optimal.
Node degree (D): The node degree of the network is defined as the number of edges
connected with the nodes. It is the connectivity among different nodes in a network. The
connectivity of the nodes determines the complexity of the network. The greater number of
links in the network means greater is the complexity. If the edge carries data fromthe node,
it is called out degree and if this carries data into the node then it is called in degree.
Diameter (D): The network diameter is defined as the maximum shortest path between the
source and destination node. The path length is measured by the number of links traversed.
This virtue is important in determining the distance involved in communication and hence
the performance of parallel systems. The low
diameter is always better because the diameter puts a lower bound on the complexity of
parallel algorithms requiringcommunication between arbitrary pairs ofnodes.
Cost (C): It is defined as the product of the diameter and the degree of the node for a
symmetric network.
Cost (C) = Diameter * Degree = D * d
Greater number of nodes means greater the cost of the network. It is good creation to
measure the hardware cost and the performance of the multiprocessor network and gives
more insight to design a cost-effective parallel system.

Extensibility
It is virtue which facilitates large sized system out of small ones with minimum
changes in the configuration of the nodes. It is the smallest increment by which the system
can be expanded in a useful way. A network with large number of links or a large node
degree tends to increase the hardware cost. Expandability is an important

parameter to evaluate the performance of a multiprocessor system. The feasibility to


extend a system while retaining its topological characteristics enables to design large scale
parallelsystems.
Network Topologies
The multiprocessor networks are classified in two broad categories based on their
topological properties. These are given below:
Cube based network

Linearly Extensible Network


Cube Based Network
The cube based architectures are widely used networks in parallel systems. They have
good topological properties such as symmetry, scalability and possess a rich
interconnection topology. The types of cube based networks are:

Binary hypercube or n-cube:

This is a loosely coupled parallel multiprocessor based on the binary n-cube network.

An n-dimensional hypercube contains 2n nodes and has n edges per node.


In hypercube, the number of communication links for each node is a logarithmic function
of the total number of nodes.
The hypercube organization has low diameter and high bisection width at the expense of the
numberofedges pernode andthe lengthofthe longestedge.
The length of the longest edge in a hypercube network increases as the number of nodes
in the network increases.
The node degree increases exponentially with respect to the dimension, making it
difficult to consider the hypercube a scalable architecture.
The major drawback of the hypercube is the increase in the
number of communication links for each node with the increase in the total number
of nodes.
The bellow Fig 1 represents the Hypercube.

Fig 5.5: Hypercube


Source: Miles J. Murdocca and Vincent P. Heuring, ― “Computer Architecture and
Organization: An Integrated approach”

Cube Connected Cycle (CCC)


The CCC architecture is an attractive parallel computation network suitable for VLSI
implementation while preserving all the desired features of hypercube.
The CCC is constructed from the n- dimensional hypercube by replacing each node in
hypercube with a ring containing n node.
Each node in a ring then connects to a distinct node to one of the n dimensions.
The advantage of the cube- connected cycles is that node╆ s degree is always ぬ,
independent of the value of n. This architecture is modified from hypercube
i.e. a 3-cube is modified to form a 3-cube-connected cycles (CCC) restricted the node
degree to 3.
The idea is to replace the corner nodes (vertices) of the 3-cube with a ring of 3-nodes.

In general one can construct k-cube-connected cycles from a k-cube with n=2k rings nodes.
Folded Hyper Cube (FHC)
The FHC is the variation of the hypercube network and constructed by introducing some
extra links to the hypercube.
Halved diameter, better average distance, shorter delay in communication links, less
message traffic density, lower cost make it very promising.
The hardware overhead is almost 1/n, n being the dimensionality of the hypercube,
which is negligible for large n.
Optimal routing algorithms are developed are developed and proven to be remarkably more
efficient than those of the conventional n-cube.
A folded hypercube of dimension n is called FHC (n).

The FHC (n) is a regular network of node connectivity (n+1)and the hypercube of degree 3
is converted to FHC (n) network. The bellow Fig 2 represents Folded Hypercube.
Extended versions of FHC (n) is called Extended Folded Cube (EFC). The EFC has better
properties than the other variations of basic hypercube in terms of parameter.
It has constant node degree, smaller diameter, and lower cost and also it maintains several
numerous desirable characteristics including symmetry,
hierarchical, expansive, recursive.
Fig 5.6: Folded Hypercube
Source: Miles J. Murdocca and Vincent P. Heuring, ― “Computer Architecture and
Organization: An Integrated approach”

Crossed Cube
The Crossed Cube (CC) has the same node and link complexity as the hypercube and has
most of its desirable properties including regularity, recursive structure, partition
ability, strong connectivity and ability to simulate other architectures.

Its diameter is only half of the diameter of the hypercube.

Mean distance between vertices is smaller and it can simulate a hypercube through
dilation 2 embedding.
The basic properties of the CC, optimal routing and broadcasting algorithms are
developed. The below fig 3 shows crossed cube.
The CC is derived from a hypercube by changing the way of connection of some
hypercube links.
The diameter of CC is almost half of that of its corresponding hypercube.

Fig 5.7: Crossed Cube

Reduced Hypercube (RHC)


The RH (k, m) is obtained from the n- dimensional hypercube by reducing node edges in
hypercube by following rules where k+2m= n.
The lower VLS) complexity of R(╆ s permit the construction of systems with more
processing elements than are found in conventional hypercube.
There are clusters and each cluster is a conventional k- dimensional hypercube.

Of the higher n-k=2m dimensions, a node has only one direct connection is decided by the
leftmost mbits in the k-bits field, i.e., the (2i + k) dimension, where i is the value of them-bit
binary number.

Fig 5.8. Reduced Hypercube Hierarchal Cube Network (HCN)


Source: Miles J. Murdocca and Vincent P. Heuring, ― “Computer Architecture and
Organization: An Integrated approach”

The Hierarchical Cube Network (HCN) is interconnection network for large-scale


distributed memory multiprocessors. The above fig 4 represents the Reduced
Hypercube Hierarchal Cube Network(HCN).
HCN has about three-fourths the diameter of a comparable hypercube, although it uses
about half as many links per node-a fact that has positive ramifications on the
implementation of HCN-connected systems.
The HCN (n, n)has 2n clusters, where each cluster is an n-cube.
Each node in the HCN (n, n) has n+1 links connected to it. n links are used inside the
cluster. The additional links are used to connect nodes among clusters.
The advantage of HCN is that the number of links required is reduced approximately
to half as many links per node and the diameter is reduced to about three-fourth of a
corresponding hypercube. The below Fig 5 shows the Hierarchical Cube Network.
Fig 5.9. Hierarchical Cube Network
Source: Miles J. Murdocca and Vincent P. Heuring, ― “Computer Architecture and
Organization: An Integrated approach”

Dual Cube (DC)


The DC is a new interconnection topology for large-scale distributed memory
Multiprocessors that reduces the problem of increasing number of links in the large- scale
hypercube network.
This preserves most of the topological properties of the hypercube network.
The DC shares the desired properties of the hypercube, however increases tremendously the
total number of nodes in the systemwith limited links per node.
The key properties of hypercube are also true in the dual-cube: each node can be
represented by unique binary number such that two nodes are connected by an edge if and
only if the two binary numbers differ in one bit only.
However, the size of the dual-cube can be as large as eight thousands with up to eight links
pernode.
A dual-cube uses binary hypercube as basic components. Each such hypercube component
is referred to as a cluster.
Assume that the number of nodes in a cluster is 2m. In a dual cube, there are two

Classes with each class consisting of 2m clusters.

The total number of nodes is 2m or 2m+1. Therefore, the nodes address has 2m+1 bits
The leftmost bit is used to indicate the type of the class (class 0 and class 1).

For the class 0, the rightmost m bits are used as the node ID within the cluster.
Each node in cluster of class 0 has one and only one extra connection to a node in a cluster
of class 1.
Meta Cube (MC)
The MC is an interconnection topology for a very large parallel system. Meta cube

network has two level cube structures. An MC (k, m) network can connect 2k+m2k nodes
with (k+m) links per node where k is the dimension of the high-level cubes (classes)
and m is the dimension of the low-level cubes (clusters).
In this network, the number of nodes is much larger than the hypercube with a small number
of links per node

An MC network is a symmetric network with short diameter, easy and


efficient routing.
Similar to that of the hypercube.
The meta cube has tremendous potential to be used as an interconnection network for very
large scale parallel computers since the meta cube can connect hundreds of millions nodes
with up to six links per node and it keeps some desired properties of the hypercube that are
useful efficient communication among the nodes.
Folded Dual Cube (FDC)
The FDC is a new cube based Interconnection topology for parallel systems with
reduced diameter, cost and constructed from DC and FHC.
The FDC is a graph Fr (V, E), where V represents a set of vertices and E represent a set of
links.
The FDC is to be slightly greater than Dual cube but quite less than HC and FHC.
Diameter of FDC is found to be smaller than that of Dual cube and with the
comparison of Dual cube, HC and FHC.
FDC exhibits quite a good improvement in broadcast time over its parent networks
with millions of nodes.
The cost of the FDC topology is found to be less. The FDC will help to speed up the overall
operation of large scale parallel systems.
Fig 5.10. Folded Dual Cube
Folded Meta cube (FMC)
The FMC is an efficient large scale parallel interconnection topology with better features
such as reduced diameter, cost, improved broadcast time and constructed from MC. The
above fig 6 shows Folded Dual Cube.
The FMC is a graph G (V, E), where V represents a set of vertices and E represent a set of
links.
The FMC is to be slightly greater than Meta cube but quite less than HC and FHC.

Diameter of FMC is found to be smaller than that of Meta cube.


FMC exhibits quite a good improvement in broadcast time over its parent network while
connecting millions of nodes.
The cost of the FMC is found to be less and will help to speed the overall operation of large
scale parallel systems.

Necklace Hypercube (NH)


NH is an array of processors attached to each two adjacent nodes of the hypercube
network.
It is highly scalable architecture while preserving most of the desirable properties of
hypercube such as logarithmic diameter, fault tolerance etc.

It has also some other properties such as hardware scalability and efficient VLSI layout
that make it more attractive than an equivalent hypercube network.
The Necklace-Hypercube is an undirected graph which has a necklace of processors to each
edge of hypercube. The below Fig 7 shows the Necklace Hypercube.
The necklace length may be fixed or variable for different edge necklaces.

Fig 5.11 Necklace Hypercube

Linearly Extensible Network


The Linearly Extensible Networks is another class of multiprocessor architectures which
reduces some of the drawbacks of HC architectures. The complexity of these networks is
lesser as they do not have exponential expansion. Besides the scalability, other parameters
to evaluate the performance of such networks are degree, number of nodes, diameter,
bisection width and fault tolerance. Selection of a better interconnection network may have
several applications with lesser complexities and improved power-efficiency.
Linear Array (LA)
It is one dimensional network having the simplest topology with n-nodes having n-1
communication links.
The internal nodes have degree 2 and the termination nodes have degree1.
The diameter is n-1, which is long for large n and the bisection width is 1.
It is asymmetric network. Linear array are the simplest connection topology.

As the diameter increases linearly with respect to n, it should not be used for large n.
For every small n, it is rather economical to implement a linear array.
Binary Tree (BT)
A binary tree is either empty or consists of node called the root together with two
binary trees called left sub tree and the right sub tree.
When h is equal to height of a binary tree then maximum leaves are equal to 2h and

maximum nodes are 2h+1-1.


In a binary tree network there is only one path between any two nodes.
The binary tree is scalable architecture with a constant node degree and constant bisection
width. In general, an n-level, complexity balanced binary tree should have N=2n-1 nodes.
The maximum node degree is 3 and the diameter is 2(n-1). But has a poor bisection width
of 1.

Ring (R)
This is a simple linear array where the end nodes are connected. It is equivalent to mesh
with wrap around connections.
The data transfer in a ring is normal one direction. A ring is obtained by connecting the two
terminal nodes of a linear array with one extra link.
A ring network can be uni-or bidirectional and it is symmetric with a constant.
It has a constant node degree of d=2, the diameter is N/2for a bidirectional ring and N for
unidirectionalring.
A ring network has a constant width 2.
Linearly Extensible Tree (LET)
The Linearly Extensible Tree (LET) architecture exhibits better connectivity, lesser number
of nodes over cube based networks.
The LET network has low diameter, hence reduce the average path length traveled by all
message and contains a constant degree per node. The below Fig 8 shows Linearly
Extensible Tree.
The LET network grows linearly in a binary tree like shape.

In a binary tree the number of nodes at level n is 2n whereas in LET network the number is
(n+1).
Fig 5.12 Linearly Extensible Tree

Common questions

Powered by AI

Coarse-grained multithreading switches between threads when one is blocked by events like cache misses, allowing another thread to execute without prolonged delays, thereby improving throughput in the presence of long-latency stalls . Fine-grained multithreading, on the other hand, aims to remove all data dependency stalls by issuing instructions from different threads in consecutive cycles, minimizing pipeline delays associated with data dependencies . However, fine-grained multithreading necessitates larger shared resources like caches to avoid thrashing due to the concurrent execution of multiple threads .

Multithreading enhances the efficiency of computing resources by allowing multiple threads to utilize unused computing resources. If a thread is idling due to cache misses or instruction dependencies, other threads can continue executing. This results in faster overall execution because resources that would otherwise be idle can be used productively . This process can prevent idle resources and decrease latency when threads may switch quickly once blocked on long-latency stalls .

Cube-based networks like the hypercube offer low diameter and high communication capability at the expense of increased node degree and complexity as the size grows, leading to issues in scalability and increased costs . More complex structures, such as the meta cube, provide greater scalability by permitting larger node counts with fewer links per node, maintaining symmetry and efficient routing. However, they require more intricate design and are often harder to implement due to their multi-level construction . These trade-offs mean that the choice between these architectures depends on the specific needs and constraints of the parallel system, balancing performance gains against architectural complexity and cost .

Node degree, which determines the connectivity of a network, and diameter, representing the longest shortest path between nodes, directly impact both performance and cost. A higher node degree increases network complexity and connectivity, enhancing performance but also raising costs . A low diameter reduces the communication distance required in parallel systems, thus improving algorithmic efficiency and performance . Increased node degree can lead to higher hardware costs, while lower diameter can reduce complexity in communication .

The dual-cube topology enhances scalability and performance by maintaining key hypercube properties while allowing a large number of nodes, thus supporting extensive parallel processing environments . Each node connects minimally to others, enabling significant growth without exponential increases in node connections, which aids scalability. Moreover, the unique node addressing scheme utilizing clusters maintains essential connectivity features, ensuring efficient communication suitable for very large systems .

Using a folded hypercube (FHC) in large-scale parallel systems offers several advantages including reduced diameter, better average distance, and lower communication delay, which overall contributes to efficient routing and cost-effective network design . The FHC minimizes complexity through improved message traffic density and lower hardware overhead, making it suitable for extensive parallel computations . These features can improve scalability and operational speed in expansive networks .

Multicore systems integrate multiple processing units (cores) within a single chip, allowing concurrent execution of multiple tasks within a single unified architecture, effectively enhancing performance on a single chip . Conversely, multiprocessor systems use two or more separate processors to cooperate and share resources, spreading tasks across different physical CPUs to increase system capabilities . Both systems are designed to handle multiple tasks efficiently but differ in their physical architecture and resource distribution methods .

A hypercube network features node degree that increases exponentially with the number of dimensions, making it less scalable due to the high number of connections needed as node count increases . In contrast, a Cube Connected Cycle (CCC) limits the node degree to 3, regardless of the dimension, maintaining a constant degree beneficial for scalability and reducing complexity . CCC architecture modifies the hypercube by replacing each hypercube node with a ring, which preserves many properties desirable for parallel computation while simplifying node connection requirements .

Caches and pipeline stages create challenges for multithreading because multiple threads sharing these resources can lead to interference, potentially degrading the execution times of single threads due to resource contention . The necessity for additional pipeline stages to handle thread-switching hardware can further impact individual thread performance by lowering effective frequencies or increasing latency . Despite these challenges, overall performance can improve through optimized resource utilization as multithreading allows system throughput to increase when these resources are shared effectively .

Simultaneous multithreading (SMT) differs from temporal multithreading by allowing a superscalar processor to issue instructions from multiple threads every CPU cycle, leveraging instruction-level parallelism across threads. This method effectively fills unused issue slots that single-thread execution might leave unutilized . Temporal multithreading, in contrast, involves each cycle executing instructions from only one thread, resulting in potentially less efficient use of available execution slots .

You might also like