0% found this document useful (0 votes)
9 views27 pages

Understanding Big Data: 5V Model & Applications

BIg data notes

Uploaded by

pdell4527
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views27 pages

Understanding Big Data: 5V Model & Applications

BIg data notes

Uploaded by

pdell4527
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1) What is Big Data?

Key Characteristics using 5V Model


with real world examples.
 Massive Collection of structured, unstructured and semi-structured data that is growing
exponentially over time.
 It is so large and complex that traditional data management tools can’t store or process it.

📊 Importance of Big Data (in short points)


 📈 Better Decision Making – Helps businesses make smarter choices using data trends.
 🚀 Faster Insights – Real-time data gives quick understanding of situations.
 🧠 Improves Customer Experience – Tracks user behavior to personalize services.
 💰 Cost Efficiency – Detects waste, fraud, and saves money.
 🏭 Boosts Innovation – Helps create new products and services.
 🔒 Improves Security – Detects suspicious activity or cyber threats.
 📦 Optimizes Operations – Enhances supply chains and resource use.
 🧠 Healthcare Advances – Supports disease prediction and patient care.

5V Model of Big Data (with Examples)


Big Data is often described using 5 main characteristics, called the 5Vs:

1️⃣ Volume – 📦 (Amount of Data)

Refers to the huge amount of data generated every second.

 📱 Example: Facebook generates over 4 petabytes of data daily from posts, likes,
messages, and uploads.

2️⃣ Velocity – ⚡ (Speed of Data)

The speed at which data is generated, processed, and analyzed.

 📉 Example: Stock trading systems analyze millions of transactions per second to make
real-time trading decisions.
 🚗 Example: Sensors in self-driving cars generate and process real-time data to avoid
collisions.
3️⃣ Variety – 🎭 (Different Types of Data)

Refers to the different formats of data: structured, semi-structured, and unstructured.

 🖼️ Example: YouTube handles videos (unstructured), comments (text), and views


(numbers) — all at once.

4️⃣ Veracity – ❓ (Data Accuracy & Trustworthiness)

Deals with the quality, accuracy, and reliability of data.

 🏥 Example: In healthcare, data errors can lead to wrong diagnosis — so veracity is


critical.

5️⃣ Value – 💰 (Usefulness of Data)

Refers to how much useful insight or benefit can be gained from the data.

 🛒 Example: Amazon uses customer data to recommend products and boost sales.
 🚌 Example: Google Maps uses traffic data to suggest fastest routes, saving time for
users.

2. Compare traditional relational database management


systems (RDBMS) with Big Data frameworks in terms of
scalability, performance, and data structure handling.
1️⃣ Scalability

 🔸 RDBMS can scale vertically, which means adding more power (CPU/RAM) to a
single server.
🔹 This is limited and expensive.
 🔸 Big Data frameworks like Hadoop or Spark scale horizontally, meaning you can add
more cheap servers (nodes) as data grows.
🔹 This is more flexible and cost-effective.
2️⃣ Performance

 🔸 RDBMS works well with small to medium data but becomes slow when data becomes
very large.
 🔸 Big Data frameworks are designed to process massive data quickly by splitting the
work across many machines (parallel processing).

3️⃣ Data Structure Handling

 🔸 RDBMS can only handle structured data (tables with fixed rows and columns).
 🔸 Big Data frameworks can handle:
o Structured data (like tables)
o Semi-structured data (like XML, JSON)
o Unstructured data (like images, videos, social media text)

3. Describe how Big Data architecture supports scalability,


high availability, and fault tolerance.
1. Scalability

Definition:
Scalability is the ability of a system to handle increasing amounts of data or traffic by expanding
its resources.

How Big Data architecture supports it:

 Horizontal Scaling (Scale-out): Big Data systems (like Hadoop, Spark, Cassandra)
allow adding more nodes (machines) to the cluster rather than upgrading a single server.
 Distributed Processing: Frameworks like Apache Hadoop and Apache Spark divide
tasks across multiple nodes for parallel processing.
 Elastic Cloud Resources: Cloud platforms (e.g., AWS, Azure) provide auto-scaling to
increase or decrease resources as needed.

Example:
An e-commerce site sees a spike in user activity during a sale. The architecture automatically
adds more servers to handle increased traffic.
2. High Availability (HA)

Definition:
High availability ensures that the system remains operational with minimal downtime, even
during failures or maintenance.

How Big Data architecture supports it:

 Replication of Data: Data is stored in multiple copies across different nodes (e.g., HDFS
replicates each data block 3 times by default).
 Load Balancers: Distribute user requests across healthy nodes to prevent overloading a
single machine.
 Cluster Management Tools: Tools like Apache ZooKeeper help manage node
coordination and failover.

Example:
If a node storing customer data goes down, the system fetches it from another replica without
interrupting service.

3. Fault Tolerance

Definition:
Fault tolerance is the system’s ability to continue functioning correctly even when parts of it fail.

How Big Data architecture supports it:

 Data Redundancy: Duplicate data storage helps recover lost data from failures.
 Task Re-execution: In Spark or Hadoop, if a task fails on one node, it can be retried on
another node.
 Decentralized Design: No single point of failure — multiple master and worker nodes
work together.

Example:
During a data processing job, if a node crashes, Spark automatically reassigns the task to another
available node.
4. Define and differentiate between structured, semi-
structured, and unstructured data. Provide relevant
examples to illustrate each category.
1. Structured Data

 Structured data is stored in an organized format, usually in relational databases.


 It follows a predefined schema (tables with rows and columns).
 Sources:
o Human-Generated: Manually entered data (e.g., names, addresses).
o Machine-Generated: Data from sensors, logs, financial systems, etc.
 Example:
Customer records in SQL database.

2. Semi-Structured Data

 Contains both structured and unstructured elements.


 Does not follow a strict table format but uses tags or markers to separate data.
 It appears organized but lacks a fixed schema like relational databases.
 Example:
XML, JSON files, or email with header + body.

3. Unstructured Data

 Data with no predefined format or structure.


 Difficult to store, process, and analyze using traditional tools.
 Comes from various sources and formats.
 Example:
Text files, images, audio, video, social media posts.

5. Discuss the role and significance of Big Data in


contemporary industries such as healthcare, banking,
logistics, and e-commerce.
1. Healthcare

 Role: Analyzing patient records, medical images, and treatment histories.


 Significance:
o Improves diagnosis and personalized treatment.
o Tracks disease outbreaks and predicts health trends.
o Enhances hospital operations and reduces costs.

Example: Predictive analytics to identify patients at risk of chronic diseases.

2. Banking and Finance

 Role: Managing transactions, fraud detection, customer insights.


 Significance:
o Detects fraudulent activities in real-time.
o Improves credit scoring and risk assessment.
o Enables personalized financial services.

Example: Monitoring transaction patterns to detect credit card fraud.

3. Logistics and Supply Chain

 Role: Optimizing routes, inventory management, demand forecasting.


 Significance:
o Reduces delivery time and costs.
o Enhances warehouse efficiency.
o Predicts supply chain disruptions.

Example: Using GPS and traffic data to optimize delivery routes.

4. E-Commerce

 Role: Tracking customer behavior, sales data, product trends.


 Significance:
o Enables personalized recommendations.
o Optimizes pricing and marketing strategies.
o Improves user experience and sales.

Example: Amazon using data to recommend products based on browsing history.


6. Identify and explain the major challenges associated with
Big Data storage, processing, and analysis.
1. Storage Challenges

 Massive Volume: Storing huge amounts of data (terabytes to petabytes) requires scalable
infrastructure.
 Data Variety: Managing structured, semi-structured, and unstructured data types is
complex.
 Cost: High storage costs, especially for cloud-based or high-performance storage.

2. Processing Challenges

 Speed (Velocity): Real-time or near-real-time processing is difficult with huge, fast-


incoming data.
 Scalability: Processing frameworks must scale across many nodes efficiently.
 Data Quality: Incomplete, noisy, or duplicate data affects processing accuracy.

3. Analysis Challenges

 Complexity: Advanced analytics (e.g., machine learning) requires skilled expertise and
powerful tools.
 Integration: Combining data from different sources (e.g., web, sensors, databases) is
difficult.
 Security & Privacy: Ensuring safe handling of sensitive information (e.g., in healthcare
or finance) is critical.

7. Define data cleaning in the context of data preprocessing.


Why is it considered foundational to data quality and
analytics integrity?
Definition:
Data cleaning is the process of removing or correcting inaccurate, incomplete, or irrelevant
data from a dataset to ensure quality and reliability in analysis.
Importance of Data Cleaning

 Clean data is essential for making accurate and reliable inferences.


 It ensures that analysis results are not skewed or misleading due to flawed data.
 It is a foundational step in data preprocessing and critical for analytics integrity.

Common Data Issues:

 Missing Values: Data entries that are not recorded.


 Duplicates: Multiple identical rows or entries that can bias the analysis.
 Incorrect Data Types: Mismatched data types (e.g., numeric data stored as text).
 Inconsistent Formats: Variations in data representation (e.g., date formats).

Why It Is Foundational to Data Quality and Analytics Integrity:

1. Ensures Accuracy:
Removes errors like typos, duplicates, and incorrect values that can mislead analysis.
2. Improves Consistency:
Standardizes formats (e.g., date formats, units) to ensure uniformity across data.
3. Handles Missing Data:
Fills in or removes missing values to avoid biased or incomplete results.
4. Boosts Model Performance:
Clean data improves the reliability and accuracy of machine learning and analytics
models.
5. Builds Trust:
Reliable data builds trust in decisions made from analysis.

8. What is Apache Hadoop? Describe its architectural


components and ecosystem.
Definition:
Apache Hadoop is an open-source framework used for storing and processing large-scale data
in a distributed and fault-tolerant manner across clusters of computers using simple
programming models.

Key Features:

 Handles big data (structured, semi-structured, unstructured)


 Supports scalable and parallel processing
 Offers high fault tolerance and cost-effective storage
Hadoop Architecture Components

Apache Hadoop is mainly composed of two core layers:

1. HDFS (Hadoop Distributed File System)

 Purpose: Distributed storage system that stores large data files across multiple machines.
 Components:
o NameNode: Manages file system metadata (directory structure, file locations).
o DataNodes: Store actual data blocks and respond to read/write requests.

2. MapReduce

 Purpose: Distributed data processing model.


 Components:
o JobTracker (in older versions): Manages job scheduling.
o TaskTracker: Executes tasks on DataNodes.

Newer versions use YARN for resource management.

3. YARN (Yet Another Resource Negotiator)

 Purpose: Manages system resources and job scheduling.


 Components:
o ResourceManager: Allocates resources across applications.
o NodeManager: Manages resources and task execution on individual nodes.

Hadoop Ecosystem

The Hadoop ecosystem consists of several tools and components that support data storage,
processing, analysis, and management:

Tool Purpose
Hive SQL-like querying on big data
Pig High-level data flow language
HBase NoSQL database for real-time access
Sqoop Transfers data between Hadoop and RDBMS
Flume Collects and transfers log data
Oozie Workflow scheduling and coordination
Zookeeper Manages distributed coordination
Mahout Machine learning on Hadoop
Spark Fast in-memory data processing
9. Explain the architecture and working mechanism of the
Hadoop Distributed File System (HDFS).
HDFS (Hadoop Distributed File System) is the primary storage system of Hadoop. It is
designed to store very large files across multiple machines in a reliable, fault-tolerant, and
scalable way.

✅ HDFS Architecture

HDFS follows a master-slave architecture:

1. NameNode (Master)

 Manages metadata (file names, directory structure, block locations).


 Does not store the actual data.
 Keeps track of where each file is split and stored across DataNodes.

2. DataNodes (Slaves)

 Store the actual data blocks.


 Send periodic heartbeats to the NameNode to confirm they are active.
 Handle read/write requests from clients.

3. Secondary NameNode (Checkpoint Node)

 Does not replace the NameNode.


 Periodically stores a checkpoint of the NameNode’s metadata.
 Helps in recovery if the NameNode fails.

✅ Working Mechanism

📥 1. Writing Data to HDFS

 A file is split into blocks (default size: 128MB).


 Each block is replicated (default: 3 copies) for fault tolerance.
 The NameNode decides which DataNodes will store the blocks.
 The client writes the blocks directly to the chosen DataNodes.
📤 2. Reading Data from HDFS

 The client requests the file from the NameNode.


 The NameNode returns the locations of blocks.
 The client reads data directly from DataNodes in parallel for faster performance.

✅ HDFS Features

 Fault Tolerance: If a DataNode fails, data is read from other replicas.


 High Throughput: Parallel processing and block-based architecture.
 Scalability: Easily adds more nodes to store growing data.
 Write Once, Read Many: Optimized for batch processing (not frequent updates).
10. Define data blocks in HDFS. Discuss how block size and
replication contribute to fault tolerance and reliability.
In HDFS, a data block is the smallest unit of storage. Large files are split into fixed-size blocks
(default size: 128 MB or 256 MB) and distributed across different DataNodes in the Hadoop
cluster.

✅ Block Size

 HDFS stores files by splitting them into blocks.


 Default block size is 128 MB, but it can be configured.
 Larger block size reduces the number of blocks and improves processing efficiency.

Example:
A 300 MB file is split into three blocks:

 Block 1 = 128 MB
 Block 2 = 128 MB
 Block 3 = 44 MB

✅ Replication in HDFS

 Each block is replicated (default: 3 copies) and stored on different DataNodes.


 Replication ensures data availability and protection against node failures.

✅ How Block Size & Replication Contribute to:

🔹 Fault Tolerance

 If one DataNode fails, HDFS can read the block from another replica.
 This ensures uninterrupted data access even during hardware failures.

🔹 Reliability

 Replication keeps multiple copies of each block across the cluster.


 Ensures data durability and protects against data loss.
✅ Conclusion:

Data blocks, along with replication, form the backbone of HDFS’s fault-tolerant and reliable
architecture. The system is designed to continue working even if some parts fail, making
Hadoop ideal for storing big data safely.

11. Explain the limitations of Hadoop 1.x and describe how


Hadoop 2.x addressed these issues. Highlight major
architectural changes.
✅ Limitations of Hadoop 1.x

1. Single Point of Failure


o The NameNode was the only master; if it failed, the entire cluster stopped.
2. Limited Scalability
o Could handle only a few thousand nodes and jobs due to centralized job
management.
3. MapReduce-Only Processing Model
o Only supported MapReduce, limiting use cases like real-time, graph, or stream
processing.
4. Resource Management Bottleneck
o The JobTracker handled both job scheduling and resource management, creating
a bottleneck under heavy load.

✅ How Hadoop 2.x Addressed These Issues

🔹 1. Introduction of YARN (Yet Another Resource Negotiator)

 Decouples resource management from job processing.


 Allows multiple processing frameworks (e.g., MapReduce, Spark, Tez) to run on the
same cluster.

🔹 2. High Availability (HA) for NameNode

 Hadoop 2.x introduced Active and Standby NameNodes, eliminating single point of
failure.

🔹 3. Improved Scalability

 Can now scale to tens of thousands of nodes due to YARN’s distributed architecture.
🔹 4. Support for Non-MapReduce Applications

 With YARN, Hadoop 2.x supports diverse data processing models like:
o Apache Spark
o Apache Flink
o Graph processing (e.g., Giraph)

✅ Major Architectural Changes:

Hadoop 1.x Hadoop 2.x


Single JobTracker YARN separates ResourceManager & AppManager
Only MapReduce supported Multiple processing models supported
NameNode = single point of failure High Availability for NameNode
Limited scalability Supports larger clusters

12. Define a distributed system. Illustrate how it differs


from a centralized system in terms of architecture and
performance.
✅ What is a Distributed System?

A distributed system is a collection of independent computers (nodes) that work together as a


single system to achieve a common goal. These systems communicate over a network and
coordinate tasks to deliver results.

✅ What is a Centralized System?

A centralized system is one in which all the processing and data storage is done on a single
machine (server), and all clients are directly dependent on that central server.
✅ Key Differences:

Feature Centralized System Distributed System


Multiple connected machines (nodes)
Architecture Single server handles all tasks
share tasks
Data Storage Stored in one central location Spread across multiple machines
Fault Low – failure of central server High – if one node fails, others can
Tolerance breaks system continue
High – can scale horizontally by adding
Scalability Limited – harder to scale vertically
more nodes
Performance Slower under heavy load Faster due to parallel processing
Cost May be cheaper initially Scales cost-effectively over time

✅ Example:

 Centralized System: A single server hosting a website with all files and databases.
 Distributed System: Google Search or Hadoop, where tasks and data are split across
thousands of machines.

13. Discuss the functionality of JobTracker and


TaskTracker in the legacy Hadoop 1.x framework.
In the Hadoop 1.x framework, JobTracker and TaskTracker were the two main components
of the MapReduce processing engine.

✅ 1. JobTracker (Master)

Functionality:

 Resides on the NameNode (or a separate master node).


 Manages MapReduce jobs submitted by clients.
 Schedules tasks on available TaskTrackers (workers).
 Tracks progress of tasks and handles task re-execution if a failure occurs.
 Keeps track of job status (success/failure).

Limitations:

 It was a single point of failure — if JobTracker failed, all jobs stopped.


 Managed both resource allocation and job scheduling, causing performance
bottlenecks at scale.
✅ 2. TaskTracker (Slave)

Functionality:

 Runs on each DataNode in the Hadoop cluster.


 Receives tasks (Map/Reduce) from the JobTracker.
 Executes tasks and sends heartbeat signals to JobTracker to report progress.
 If a task fails, JobTracker can assign it to another TaskTracker.

✅ Working Mechanism (Simplified Flow):

1. Client submits a job → JobTracker splits it into tasks.


2. JobTracker assigns tasks to TaskTrackers close to the data.
3. TaskTrackers execute the tasks and report status back.
4. On failure, JobTracker reassigns the task elsewhere.
5. JobTracker monitors and completes the job.

14. Describe the architectural components of YARN,


including the ResourceManager, NodeManager, and
ApplicationMaster.
✅ 1. ResourceManager (RM)

 The central authority that manages cluster resources.


 Two main components:
o Scheduler: Allocates resources (CPU, memory) based on constraints like
capacity or fairness.
o ApplicationManager: Manages the lifecycle of submitted applications (e.g.,
MapReduce, Spark).
 Does not monitor or execute tasks directly.

Example: Decides how much memory and how many cores an application should get.

✅ 2. NodeManager (NM)

 Runs on each node (machine) in the cluster.


 Manages containers (which run tasks) on that node.
 Reports resource usage and node health to the ResourceManager via heartbeats.
 Launches and monitors tasks based on instructions from the ApplicationMaster.

Example: Starts a container to run a mapper task and monitors its progress.

✅ 3. ApplicationMaster (AM)

 One per application/job (e.g., one for each MapReduce job).


 Negotiates resources with the ResourceManager.
 Coordinates execution of tasks across multiple NodeManagers.
 Handles fault recovery if a task fails.

Example: For a Spark job, the Spark ApplicationMaster handles task scheduling and retries.
15. Describe the MapReduce programming model. Illustrate
its working with a relevant example.
Definition:
MapReduce is a programming model used for parallel and distributed processing of large datasets. It
simplifies big data processing by breaking it into two main tasks: Map and Reduce.

✅ 1. Map Function

 Takes input data and converts it into key-value pairs.


 Works independently on each data block in parallel.

✅ 2. Reduce Function

 Aggregates or processes the key-value pairs output from the Map step.
 Produces the final result.

✅ Functions of MapReduce

1. Splits Input Data into smaller chunks (Input Splits) for parallel processing.
2. Processes Data with Map Function to generate intermediate key-value pairs.
3. Groups Key-Value Pairs using the Shuffle and Sort phase (organizes by keys).
4. Aggregates Data with Reduce Function to compute the final output.
5. Distributes Tasks across multiple nodes in a cluster for parallel processing.
6. Handles Large-Scale Data efficiently by dividing and conquering.
7. Supports Multiple Programming Languages like Java, Python, C++, and Ruby.

Example of Map Reduce Process:


Stages of MapReduce

1. Input Splits
The input data is divided into smaller units called Input Splits. Each split is processed by
a single Map task.
2. Mapping
In this phase, each input split is passed to a Map function that processes the data and
produces key-value pairs (e.g., words and their counts).
3. Shuffling
The output from the Map phase is grouped by keys. All similar keys are brought
together along with their associated values.
4. Reducing
The Reduce function takes the grouped data from the shuffle phase and aggregates the
values to generate the final output (e.g., total word count).

16. Define batch processing and real-time processing.


Contrast their core principles and operational differences.
Batch Processing

Batch processing is the execution of a group of programs or jobs on a large volume of data
without user interaction. It is ideal for tasks that are not time-sensitive.

Key Characteristics:

 Data is collected, stored, and processed at once.


 Does not provide immediate results.
 Often scheduled to run during off-peak hours to reduce load on the system.
 Ideal for operations where latency is acceptable.

Examples:

 Generating monthly bank statements.


 Payroll processing at the end of the month.

Real-Time Processing

Real-time processing handles data immediately as it is generated or received. It is used where


immediate action or response is required.

 ✅ Example: ATM transactions, stock market systems, live traffic monitoring.


Key Characteristics:

 Data is processed instantly as it is generated or received.


 Immediate feedback or output is provided.
 Often used in mission-critical systems.
 Requires a continuous stream of incoming data and system availability.

Examples:

 ATM transactions and online banking.


 Live stock market updates.

🔄 Core Differences Between Batch and Real-Time Processing

Aspect Batch Processing Real-Time Processing


Processing Processes data immediately as it
Processes data after collection (delayed)
Time arrives
Data Handles individual or continuous
Handles large datasets at once
Handling streams of data
High latency (results come after Low latency (results are instant or
Latency
processing is complete) near-instant)
Suitable for periodic tasks and historical Suitable for time-sensitive
Use Case
data analysis applications
Runs continuously and needs robust
System Load Usually runs during off-peak hours
infrastructure
Examples Salary generation, report creation Fraud detection, live traffic updates

17. Explain Data Visualization in Big Data Analytics


🔹 Definition:

Data Visualization in Big Data Analytics is the process of displaying large and complex datasets
in visual formats like charts, graphs, maps, and dashboards to make data easy to understand
and analyze.
🔹 Purpose & Significance:

 Helps simplify complex data for quick insights.


 Supports faster decision-making.
 Reveals hidden patterns, trends, and anomalies.
 Makes it easier to communicate results with stakeholders.
 Useful for real-time monitoring through dashboards.

🔹 Common Tools:

 Tableau, Power BI, Google Data Studio, QlikView, [Link]

🔹 Types of Visuals:

Type Use Case


Bar Chart Compare values
Line Graph Show trends over time
Pie Chart Show parts of a whole
Heat Map Show data density or range
Dashboard Combine multiple visuals

🔹 Example:

An e-commerce site can use a dashboard to monitor:

 Sales trends (line graph)


 Top products (bar chart)
 Regional sales (map)
 Inventory (real-time indicators)
18. Discuss the shuffle and sort phase in MapReduce. Why is
it considered critical for data aggregation?
✅ What is Shuffle and Sort?

 The Shuffle phase occurs after the Map phase. It transfers the intermediate key-value
pairs generated by mappers to the reducers.
 The Sort phase organizes these key-value pairs so that all values corresponding to the
same key are grouped together.

Together, Shuffle and Sort prepare the data for the Reduce phase by grouping related data.

Functions of Shuffle and Sort

 Transfers intermediate key-value pairs from mappers to reducers.


 Groups all values with the same key together.
 Sorts the key-value pairs to organize data for efficient processing.
 Ensures that reducers receive complete and related data for each key.
 Enables accurate aggregation like counting, summing, or averaging.
 Improves reducer performance by minimizing processing overhead.

✅ Why is it Critical for Data Aggregation?

1. Grouping by Key:
Shuffle and Sort ensure that all occurrences of a particular key from different mappers are
collected together, which is essential for aggregation (e.g., counting, summing).
2. Data Organization:
Sorting data before reduction helps the reducer efficiently process the data in a sequential
manner.
3. Ensures Correctness:
Without this phase, reducers wouldn’t know which values belong to which key, making
aggregation impossible.
4. Performance Optimization:
Properly shuffled and sorted data minimizes reducer overhead and improves overall
MapReduce job efficiency.
19. How does MapReduce ensure fault tolerance during job
execution and task failures?
✅ Fault Tolerance Mechanisms in MapReduce

1. Task Re-execution:
If a mapper or reducer task fails, the JobTracker (in Hadoop 1.x) or ApplicationMaster
(in YARN) detects the failure and restarts the task on another healthy node.
2. Speculative Execution:
To handle slow or “straggler” tasks, MapReduce may run duplicate tasks on multiple
nodes and use the result from the task that finishes first.
3. Data Replication:
Input data and intermediate outputs are stored in HDFS, which replicates data blocks
across multiple nodes, ensuring that tasks can be retried using replicated data if a node
fails.
4. Heartbeat Monitoring:
Nodes regularly send heartbeat signals to the master node. If a node stops sending
heartbeats, it is marked as failed and its tasks are rescheduled.
5. Task Checkpointing (Limited):
While classic MapReduce doesn’t checkpoint tasks, some frameworks built on
MapReduce provide task progress tracking to minimize re-computation.

20. Elaborate on the process of input data splitting and task


assignment to mapper instances.
✅ Input Data Splitting

 Large input data is divided into smaller chunks called input splits to enable parallel
processing.
 Each split usually matches the size of an HDFS block (default 128 MB or 64 MB), but
this can be configured.
 Input splits are logical divisions of data; a single file can be split into multiple parts.
 The system tries to assign splits to mappers running on the same node or nearby nodes to
take advantage of data locality and reduce network traffic.

✅ Task Assignment to Mapper Instances

 Each input split is processed by one mapper task.


 The system’s resource manager schedules mapper tasks based on resource availability
and data locality.
 Mappers run in parallel on different nodes, processing their assigned splits independently.
 This enables efficient and scalable processing of large datasets.

✅ Summary

 Input data is split into smaller parts called splits.


 Each split is assigned to a mapper task.
 Data locality is prioritized to optimize performance.
 This approach allows parallel and distributed processing across the cluster.

21. What is Apache Spark? Compare its architecture and


capabilities with Hadoop MapReduce.
Apache Spark is an open-source distributed computing system that processes big data quickly
using in-memory computation, making it faster than traditional disk-based systems.

Function:

 Processes large-scale data quickly using clusters


 Supports batch and real-time data processing
 Provides APIs for Java, Scala, Python, and R
 Enables tasks like data analysis, machine learning, and graph processing
 Efficiently handles iterative algorithms and interactive queries

This diagram shows the Apache Spark architecture:

 The Driver Program contains the Spark Context, which coordinates the whole process.
 The Cluster Manager manages resources and schedules tasks across the cluster.
 The cluster has multiple Worker Nodes, each running Executors.
 Executors run Tasks and use Cache to store data in memory for faster processing.
 The Driver communicates with both the Cluster Manager and Worker Nodes to control
job execution.
Comparison: Apache Spark vs Hadoop MapReduce

Feature Apache Spark Hadoop MapReduce


Processing Model In-memory processing for faster execution Disk-based batch processing
Up to 100x faster for certain workloads due Slower due to frequent disk
Speed
to in-memory computing I/O
Provides high-level APIs in Java, Scala, Primarily Java, more complex
Ease of Use
Python, and R to program
Uses RDD lineage to recompute lost data Relies on data replication in
Fault Tolerance
partitions HDFS
Supports batch, streaming, interactive, and
Data Processing Primarily batch processing
machine learning
Integrates with MLlib, GraphX, Spark Focused mainly on batch
Flexibility
Streaming MapReduce jobs
Resource Can run on YARN, Mesos, or standalone
Runs on YARN or standalone
Management cluster
Real-time analytics, iterative algorithms, Large-scale batch data
Use Cases
machine learning processing

22. Discuss the concept of in-memory computation in Spark.


How does it enhance performance?
Concept:
Absolutely! Here’s the concept of in-memory computation in Spark explained in clear points:

 Spark stores intermediate data in RAM (memory) instead of writing it to disk after every
step.
 This allows Spark to quickly access data during processing.
 It is especially useful for tasks that reuse the same data multiple times, like machine
learning or iterative algorithms.
 Avoiding disk read/write operations reduces the slow input/output delays.
 As a result, Spark performs much faster than traditional systems that depend on disk
storage.
 In-memory computation supports real-time and interactive data processing by
providing quick responses.

How it Enhances Performance:

 Faster Data Access: Reading and writing from memory is much faster than from disk,
reducing latency.
 Efficient Iterations: For algorithms that repeatedly access the same data (e.g., machine
learning), keeping data in memory avoids expensive disk I/O.
 Reduced Disk I/O: Minimizes the overhead of disk reads/writes, which are slower and
resource-intensive.
 Improved Throughput: Enables faster processing of large datasets by allowing Spark to
pipeline operations efficiently.

23. Compare batch processing and stream processing in the


context of Spark and Hadoop
Feature Batch Processing Stream Processing
Processing large volumes of data in Processing data continuously in real-
Definition
chunks (batches) at scheduled times. time as it arrives.
Example Hadoop MapReduce, Spark Batch Spark Streaming, Apache Flink,
Frameworks API Apache Storm
Data Handling Works on static, stored data Works on live, continuous data streams
Latency High latency (minutes to hours) Low latency (seconds or milliseconds)
Periodic reports, large-scale data Real-time monitoring, fraud detection,
Use Case
analysis live analytics
Spark provides fast batch processing Spark Streaming handles real-time data
Spark Support
using in-memory computation streams using micro-batches
Hadoop MapReduce designed for Limited native streaming support;
Hadoop
batch processing, writing intermediate requires additional tools like Apache
Support
data to disk Storm or Flink

Summary:

 Hadoop mainly focuses on batch processing of big data.


 Spark supports both batch and stream processing, with faster batch and near real-time
stream capabilities.
 Stream processing allows faster insights by handling data continuously, while batch
processing works well for large, offline data sets.

24. Compare NoSQL systems with traditional relational


databases in terms of scalability, schema design, and
consistency.
Point NoSQL Systems Relational Databases (RDBMS)
Vertically scalable (upgrade
1. Scalability Horizontally scalable (add servers)
server)
2. Schema Flexible or schema-less Fixed schema (tables, columns)
Point NoSQL Systems Relational Databases (RDBMS)
Key-value, document, graph, column-
3. Data Model Tables with rows and columns
family
Strong consistency (ACID
4. Consistency Eventual consistency (some systems)
transactions)
5. Query Varies, often no standard (e.g., MongoDB
Standard SQL language
Language uses JSON-like queries)
Banking, ERP, applications
6. Use Case Big data, real-time apps, social networks
needing complex joins
7. Example
MongoDB, Cassandra, Redis MySQL, PostgreSQL, Oracle
Systems
8. Transaction
Limited or eventual consistency Full ACID support
Support

25. Describe the differences between vertical scaling in


RDBMS and horizontal scaling in HDFS.
Point Vertical Scaling (RDBMS) Horizontal Scaling (HDFS)
Add CPU, RAM, storage to one
1. Scale Type Add more servers (nodes) to the cluster
server
2. Growth Limit Limited by hardware max capacity Can grow by adding many nodes
Uses many low-cost commodity
3. Cost Expensive high-end hardware
machines
4. Fault
Single point of failure Data replicated across nodes for safety
Tolerance
5. Performance Boosted by better hardware Boosted by distributing workload
6. Maintenance Harder to upgrade (downtime risk) Easier to add/remove nodes
7. Data Handling Centralized storage Distributed storage
Traditional databases, complex
8. Use Case Big data, batch processing
queries
9. Complexity Simpler architecture More complex cluster management
10. Example Upgrading Oracle or MySQL server Hadoop HDFS with multiple servers

You might also like