0% found this document useful (0 votes)
8 views8 pages

Distributed Machine Learning Exam Revision

The document outlines key concepts in distributed computing, including the Distributed Gradient Descent (DGD) algorithm, its advantages and challenges, and the contributions of frameworks like Spark, GraphLib, and TensorFlow to distributed machine learning. It also discusses Federated Learning, intelligent resource management strategies, and various architectures for distributed machine learning. Additionally, it provides concise notes on specific topics such as Hogwild and Elastic Averaging SGD.

Uploaded by

bigogaj933
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Distributed Machine Learning Exam Revision

The document outlines key concepts in distributed computing, including the Distributed Gradient Descent (DGD) algorithm, its advantages and challenges, and the contributions of frameworks like Spark, GraphLib, and TensorFlow to distributed machine learning. It also discusses Federated Learning, intelligent resource management strategies, and various architectures for distributed machine learning. Additionally, it provides concise notes on specific topics such as Hogwild and Elastic Averaging SGD.

Uploaded by

bigogaj933
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT - 4

tommorow is my engineering exam of distributed


computing , its endsem so exam is of 70 marks i will give
you questions and marks associated to it , give me short
answers to every question like i have to see it , 2 - 3 hrs
before exam , like make it just for revision , still it should
contain all the steps and entities in the given concept and
explain the work or the significance of each step in 1 or 2
sentences ( it should be understandable ) . in general just
make it short but it should contain all processes and steps
required . understood ?
and i will also share the marks weitage for that question so
give me answers according to that weighatage in pointwise
format for quick revision and understanding

Got it!
You’ll share each question with its marks weightage, and I’ll make concise, pointwise revision
notes covering all key steps, entities, and their significance.
The answers will be short but complete enough for quick review 2-3 hours before the exam.

Whenever you’re ready, send the first question and its marks!

dont give any starting / ending comments in ur responses


just the answers

Understood. Send the first question with marks.

Explain the Distributed Gradient Descent (DGD) algorithm


used in
distributed machine learning. Analyze the advantages and
challenges of
implementing DGD in large-scale distributed systems,
considering factors
such as communication overhead, scalability, and
convergence speed.[9 marks]

1. Distributed Gradient Descent (DGD) Algorithm:

Each node holds a local subset of data and model parameters.

Nodes compute local gradients based on their data.

Nodes share gradients or parameter updates with neighbors or a central server.

1/8
Nodes update their model parameters by aggregating received gradients/updates.

Iteratively repeat until convergence.

2. Advantages:

Parallelism: Data and computations are distributed, speeding up training.

Scalability: Easily scales with more nodes, handling large datasets.

Fault Tolerance: Local computations reduce impact of single node failures.

Privacy: Data remains local, minimizing sharing sensitive information.

3. Challenges:

Communication Overhead: Frequent exchange of gradients/parameters can cause


network congestion.

Synchronization: Synchronous updates can slow down due to waiting for slow
nodes; asynchronous updates can cause stale gradients.

Convergence Speed: Can be slower due to delays and inconsistent updates.

Load Balancing: Unequal data or computation load can cause inefficiency.

Network Reliability: Failures and delays affect algorithm stability.

“How do Spark, GraphLib, and TensorFlow contribute to the


efficiency
and scalability of distributed machine learning algorithms.
[8 marks ]

1. Apache Spark:

Provides in-memory distributed data processing for faster computation.

Supports fault tolerance and easy scaling across clusters.

Enables iterative machine learning through MLlib with distributed algorithms.

2. GraphLib:

Optimizes graph-based data structures and computations in distributed settings.

Efficiently handles dependencies and communication in graph algorithms.

Supports parallel processing of graph data to speed up learning on relational


datasets.

3. TensorFlow:

2/8
Enables distributed training of deep learning models across multiple GPUs and
nodes.

Uses data and model parallelism to improve scalability.

Provides automatic differentiation and optimized execution graphs for efficient


computation.

4. Overall Contribution:

They reduce computation time by parallelizing tasks.

Handle large datasets and complex models by distributing workloads.

Improve scalability by efficiently managing resources and communications.

Describe how Federated Learning enables model training


across multiple
decentralized devices while preserving data privacy. Discuss
the
advantages and challenges of Federated Learning
compared to traditional
centralized machihe learning approaches. [9 marks ]

1. Federated Learning Process:

Local devices train models on their private data independently.

Devices send only model updates (gradients/weights), not raw data, to a central
server.

The server aggregates updates (e.g., by averaging) to create a global model.

The updated global model is sent back to devices for the next training round.

Repeat until model converges.

2. Advantages:

Data Privacy: Raw data never leaves the device, enhancing privacy.

Reduced Communication: Only model parameters are exchanged, lowering


bandwidth use.

Personalization: Models can adapt to local device data characteristics.

Scalability: Leverages many decentralized devices without centralizing data.

3. Challenges:

3/8
Heterogeneous Data: Non-IID data across devices can slow convergence.

Communication Costs: Frequent model update exchanges still consume bandwidth.

System Heterogeneity: Devices have different compute power, connectivity, causing


stragglers.

Security Risks: Vulnerable to poisoning and model inversion attacks.

Aggregation Complexity: Handling unreliable or malicious updates is difficult.

Provide examples of how intelligent resource management


strategies
improve resource utilization, reduce latency, and facilitate
seamless
scalability in distributed computing environments for
machine learning
tasks.” [8 marks ]

1. Dynamic Resource Allocation:

Allocates CPU, GPU, and memory based on current workload demands, avoiding
under- or over-utilization.

Example: Kubernetes autoscaling pods in ML training clusters.

2. Load Balancing:

Distributes tasks evenly across nodes to prevent bottlenecks and reduce latency.

Example: Spark’s task scheduling balances data processing jobs.

3. Data Locality Optimization:

Schedules tasks near the data to minimize network transfer delays.

Example: Hadoop’s MapReduce prioritizes local data processing.

4. Priority Scheduling:

Prioritizes urgent or latency-sensitive ML tasks over batch jobs to improve


responsiveness.

Example: Real-time inference workloads given higher priority in resource queues.

5. Resource Prediction and Scaling:

Predicts workload patterns to proactively scale resources up or down.

Example: Auto-scaling clusters during peak ML training periods.

4/8
6. Containerization:

Uses lightweight containers for quick deployment and flexible resource sharing.

Example: Docker containers enabling seamless scaling in TensorFlow Serving.

7. Fault Tolerance Mechanisms:

Automatically detects failures and reallocates resources to maintain continuity.

Example: Spark’s task retries and checkpointing for resilient ML workflows.

8. Energy-Aware Scheduling:

Optimizes resource usage to reduce energy consumption while maintaining


performance.

Example: Scheduling ML tasks in data centers during low-energy cost periods.

Explain Systems and Architectures for Distributed Machine


Learning.
[9 marks ]

1. Parameter Server Architecture:

Centralized servers store and update global model parameters.

Worker nodes compute local gradients and send updates to servers.

Servers aggregate updates and broadcast new parameters to workers.

Enables asynchronous or synchronous updates.

2. All-Reduce Architecture:

Workers collectively aggregate gradients using decentralized communication.

No central server; each node shares updates directly with peers.

Reduces bottlenecks and improves fault tolerance.

3. Federated Learning Architecture:

Decentralized devices train local models on private data.

A central aggregator combines model updates without accessing raw data.

Preserves privacy and supports large-scale distributed training.

4. Hybrid Architectures:

Combine parameter servers and peer-to-peer communication for flexibility.

5/8
Adapt based on workload and system requirements.

5. Hardware Architectures:

Use of GPUs, TPUs, and specialized accelerators for parallel computations.

Distributed clusters interconnected via high-speed networks.

6. Cloud-Based Architectures:

Scalable infrastructure supporting elastic resource allocation.

Supports big data storage and on-demand computation for ML tasks.

7. Edge Computing Architectures:

Perform ML training/inference near data sources to reduce latency.

Often integrated with federated learning for privacy and bandwidth savings.

8. Communication Patterns:

Synchronous: all nodes update together, ensuring consistency.

Asynchronous: nodes update independently, improving speed but risking stale


updates.

9. Scalability and Fault Tolerance:

Architectures designed to handle node failures and network delays without major
disruptions.

Use checkpointing, replication, and recovery mechanisms.

b) Write note on [8 marks ]


i) Federated Learning,
ii) Hogwild
iii) Elastic Averaging SGD

i) Federated Learning:

Decentralized training where devices train models locally on private data.

Only model updates are sent to a central server for aggregation, preserving data privacy.

Enables scalable, privacy-aware machine learning across many devices.

ii) Hogwild:

Lock-free, asynchronous stochastic gradient descent algorithm.

Multiple processors update shared model parameters simultaneously without locks.

6/8
Improves speed but may cause inconsistent updates; effective for sparse data.

iii) Elastic Averaging SGD (EASGD):

Distributed SGD variant where workers maintain local models linked to a global center
model.

Workers update local models and periodically exchange information with the center
model using an elastic force.

Balances exploration (local updates) and exploitation (global consensus), improving


convergence.

a) What is Apache Spark? Explain working of Apache Spark.


[9 marks ]

1. Apache Spark:

Open-source distributed computing framework for big data processing.

Supports in-memory computation, speeding up iterative tasks like machine learning.

2. Working of Apache Spark:

Driver Program: Coordinates the execution, creates SparkContext.

Cluster Manager: Allocates resources across worker nodes (e.g., YARN, Mesos).

Worker Nodes: Execute tasks assigned by the driver.

RDD (Resilient Distributed Dataset): Immutable distributed data structure that


supports fault tolerance.

Job Execution:

Driver breaks application into stages and tasks.

Tasks are distributed to worker nodes for parallel processing.

Data transformations are lazy; actions trigger computation.

In-Memory Processing: Intermediate data cached in memory to avoid repeated disk


I/O.

Fault Tolerance: Achieved via lineage graphs and data recomputation.

Explain how integration of AI algorithms in distributed


systems can help

7/8
in Intelligent Resource Management, Anomaly Detection. [8
marks ]

1. Intelligent Resource Management:

AI algorithms predict workload patterns to optimize resource allocation dynamically.

Enables proactive scaling of CPUs, GPUs, and memory based on demand forecasts.

Reduces resource wastage and improves utilization efficiency.

Automates task scheduling and load balancing for minimal latency.

2. Anomaly Detection:

AI models analyze system metrics (CPU usage, network traffic) to detect abnormal
patterns.

Enables early identification of faults, attacks, or performance degradation.

Supports real-time monitoring and automatic alerts for quick response.

Improves system reliability by preventing failures before they impact services.

8/8

You might also like