UNIT-1
WHAT IS BIG DATA?
Big Data is a collection of huge and complex datasets that cannot be processed using traditional
data processing tools. Big Data means extremely large volumes of data that are too big, too
fast, or too complex for traditional data-processing software to handle.
Key Features (5 V’s of Big Data)
1. Volume – Huge amount of data (TB, PB).
2. Velocity – Data generated very fast (real-time streaming).
3. Variety – Many types: text, images, videos, logs, sensor data, etc.
4. Veracity – Data may be messy, incomplete, or inconsistent.
5. Value – Useful insights that can be extracted from the data.
Examples of Big Data
Social media data from Facebook/Instagram/Twitter
Data from IoT devices like sensors, cameras
Online shopping data from Amazon/Flipkart
Banking transaction logs
Health records
Why Big Data is Important?
Helps companies make better decisions
Predict customer behavior
Detect fraud
Improve business performance
1.1 EVOLUTION OF BIG DATA
The concept of Big Data has evolved over several decades as data generation, storage, and
processing technologies have advanced. Its evolution can be divided into major phases:
1. Pre-Big Data Era (1960s – 1990s)
Main Features:
Data stored mainly in files, databases, and data warehouses.
Limited storage capacity and slow processing.
Structured data only (tables, numbers, records).
Key Technologies:
Relational Databases (RDBMS)
SQL
Data Warehousing (1980s)
2. Early Big Data Concepts (2000 – 2005)
Main Features:
Rapid growth of the internet, social media, and online systems.
More unstructured data (images, videos, logs).
Traditional databases became insufficient.
Major Milestone:
In 2001, analyst Doug Laney introduced the 3 Vs model of Big Data:
o Volume – huge amount of data
o Velocity – speed of data generation
o Variety – different data formats
3. Big Data Era (2006 – 2015)
What Changed?
Explosion of data due to smartphones, IoT, cloud storage, e-commerce, and social
networks.
Need for distributed processing.
Technological Breakthrough:
Google’s MapReduce (2004) → distributed data processing model.
Hadoop (2006) → open-source big data framework.
NoSQL databases → MongoDB, Cassandra for unstructured data.
Important Characteristics Added:
Veracity – trustworthiness of data
Value – turning data into insights
3. Big Data Era (2006 – 2015)
What Changed?
Explosion of data due to smartphones, IoT, cloud storage, e-commerce, and social
networks.
Need for distributed processing.
Technological Breakthrough:
Google’s MapReduce (2004) → distributed data processing model.
Hadoop (2006) → open-source big data framework.
NoSQL databases → MongoDB, Cassandra for unstructured data.
Important Characteristics Added:
Veracity – trustworthiness of data
Value – turning data into insights
1.2 BEST PRACTICES FOR BIG DATA ANALYTICS:
1. Define Clear Business Objectives
Start with a specific problem or goal (e.g., fraud detection, customer segmentation).
Avoid collecting data without purpose.
2. Collect Only Relevant Data
Focus on quality, not just quantity.
Avoid “data overload” by filtering unnecessary data.
Ensure data comes from trustworthy sources.
3. Ensure Data Quality
Remove duplicate, incomplete, or inconsistent data.
Use data cleansing, validation, and preprocessing techniques.
High-quality data improves accuracy of analytics.
4. Use Scalable Storage Solutions
Choose storage that grows with your data:
o Cloud storage (AWS S3, Azure Blob)
o Hadoop HDFS
o Data lakes
Must support structured, semi-structured, and unstructured data.
5. Implement Strong Data Security & Privacy
Encrypt data at rest and in transit.
Use authentication + authorization controls.
Follow regulations like GDPR or local privacy laws.
6. Choose the Right Tools & Technologies
Use proper tools for your workload:
o Batch processing → Hadoop, Spark
o Real-time processing → Kafka, Flink
o Database type → NoSQL, columnar, graph DBs
Consider cost, performance, and ease of integration.
7. Use Distributed Processing
Break large tasks into smaller chunks to process in parallel.
Tools: Spark, MapReduce, Flink.
Ensures high performance and faster results.
8. Build a Skilled Data Team
Data engineers (storage, pipelines)
Data scientists (models, insights)
Analysts (reports, decision-making)
IT/security team
A combination of skills ensures proper handling of big data.
9. Use Automation & Machine Learning
Automate: 11. Maintain Data Governance
Define rules for:
o Who owns the data
o Who has access
o How long data is stored
o Data lifecycle management
Ensures compliance and accountability.
o Data cleaning
o Feature extraction
o Model tuning
Improves efficiency and reduces human errors.
10. Monitor Performance Continuously
Track:
o Data processing speed
o System loading
o Resource usage
o Model accuracy
Use dashboards and log monitoring.
1.3 VALIDATING
What is validating?
Validating means checking whether the data is correct, accurate, and useful before using it
for analysis.
It ensures that the data:
Follows the required format
Has no errors
Is complete
Is reliable
Matches expected rules or conditions
Examples of Data Validation
1. Checking missing values
o Ensuring no important fields are empty.
2. Checking data types
o Age should be numeric, name should be string, etc.
3. Range validation
o Marks must be between 0–100.
4. Format validation
o Email must contain “@”
o Date should be in “YYYY-MM-DD” format.
5. Removing duplicates
o No repeated records.
Why Validation Is Important?
Ensures quality of data
Prevents wrong results in analysis
Helps maintain accuracy, reliability, and trustworthiness
1.4 THE PROMOTION OF THE VALUE OF BIG DATA
Promoting the value of Big Data means highlighting how Big Data can improve decisions,
efficiency, innovation, and business growth. It focuses on making organizations understand the
benefits of using Big Data analytics.
Big Data refers to extremely large and complex data sets generated from various sources such as
social media, sensors, mobile devices, transactions, and IoT systems. Promoting the value of Big
Data means creating awareness of how these massive data sets can be used to improve decision-
making, business performance, and innovation in different sectors. It highlights the need for
organizations to adopt data-driven strategies and utilize analytics tools for gaining insights.
1. Enhances Decision-Making
Big Data analytics helps organizations make accurate and real-time decisions. By analyzing
customer behavior, market trends, and operational data, businesses reduce guesswork and rely on
evidence-based decisions. This leads to improved outcomes and increased efficiency.
2. Improves Customer Experience
Big Data enables companies to understand customer preferences, buying patterns, and feedback.
This helps in creating personalized products, targeted advertisements, and better customer
service. Companies like Amazon and Netflix use Big Data to provide personalized
recommendations.
3. Drives Innovation
Organizations can use Big Data to identify new market opportunities, improve existing products,
and design innovative solutions. It exposes hidden patterns and correlations that can help in
developing smarter products and services.
4. Increases Operational Efficiency
Big Data tools help automate processes, optimize resource usage, and reduce operational costs.
Real-time monitoring of systems allows organizations to detect problems early and improve
productivity.
5. Supports Predictive Analytics
One of the major values of Big Data is the ability to predict future trends using machine learning
and statistical models. Industries like finance, healthcare, and retail use predictive analytics for
fraud detection, demand forecasting, and disease prediction.
6. Strengthens Risk Management
Big Data helps organizations identify risks related to finance, cybersecurity, and operations. By
analyzing historical data and anomalies, companies can implement preventive measures to avoid
losses and ensure safety.
7. Competitive Advantage
Organizations that adopt Big Data gain an edge over competitors. They understand market
changes faster, adapt quickly, and offer better products and services. Data-driven companies
grow faster and perform better than traditional companies.
8. Promotes Data-Driven Culture
Promoting the value of Big Data encourages companies to develop a culture where decisions are
based on data instead of intuition. Employees are trained in data literacy, analytics tools, and
modern technologies, making organizations more future-ready.
9. Encourages Investment in Technology
To unlock Big Data’s value, organizations are motivated to invest in cloud computing, data
warehouses, AI, and analytics platforms. These technologies help store, process, and analyze
massive data efficiently.
10. Benefits Government and Society
Big Data is useful not only in business but also in public services. Governments use Big Data for
smart cities, disaster management, public health monitoring, and transportation planning. It
improves governance and societal well-being.
1.5 BIG DATA USE CASES:
Big Data is used across many industries to solve complex problems, improve operations, and
make better decisions. Below are the most common and impactful use cases:
1. Healthcare
Big Data helps doctors analyze patient history, medical records, and real-time monitoring
data.
Used for:
o Disease prediction
o Early diagnosis
o Tracking pandemics
o Personalized treatment
Example: Predicting heart attacks using sensor data.
2. Banking & Finance
Big Data is widely used to detect fraudulent transactions.
Applications include:
o Credit risk analysis
o Algorithmic trading
o Customer segmentation
o Insurance claim analysis
Example: Banks analyze real-time transactions to stop fraud instantly.
3. Retail & E-commerce
Helps understand customer buying behavior.
Used for:
o Recommendation systems (Amazon, Flipkart)
o Dynamic pricing
o Inventory management
o Personalized marketing
Example: Netflix uses Big Data to recommend movies.
4. Social Media Analytics
Platforms like Facebook, Twitter, and Instagram generate massive data.
Big Data helps in:
o Sentiment analysis
o Trend prediction
o Advertisement targeting
Example: Analysing tweets during elections.
5. Transportation & Logistics
Big Data improves route planning and reduces travel time.
Used for:
o Traffic prediction
o Fleet management
o Vehicle tracking
o Supply chain optimization
Example: Uber uses Big Data for price surge and route optimization.
6. Government & Public Sector
Enhances governance and public safety.
Applications:
o Smart city planning
o Crime prediction
o Disaster management
o Pollution monitoring
Example: CCTV data and sensors used for real-time crime detection.
7. Manufacturing & Industry 4.0
Used for:
o Predictive maintenance of machines
o Quality control
o Automation
o Production optimization
Example: Sensors detect machine failures before they happen.
8. Education
Big Data helps analyze student performance and learning patterns.
Applications:
o Personalized learning
o Attendance monitoring
o Online learning analytics
Example: Tracking student progress in digital platforms.
9. Telecommunications
Used for:
o Network optimization
o Customer churn prediction
o Fraud detection
Example: Telecom companies analyze call data to predict network failures.
10. Energy & Utilities
Used in:
o Smart meters
o Power consumption forecasting
o Renewable energy management
Example: Smart grids use Big Data to balance electricity supply.
1.6 CHARACTERISTICS OF BIG DATA APPLICATIONS
Big Data applications deal with extremely large, complex, and fast-growing datasets.
These applications are typically described using the 5 V’s of Big Data, expanded to 7 V’s
in modern systems.
1. Volume
Refers to the huge amount of data generated from various sources such as social media,
sensors, IoT devices, logs, and transactions.
Data ranges from terabytes to petabytes or even exabytes.
Big Data applications must store, process, and analyze massive datasets using distributed
systems (e.g., Hadoop, Spark).
2. Velocity
Represents the speed at which data is generated, collected, and processed.
Real-time or near–real-time processing is required in applications like stock markets,
fraud detection, and IoT monitoring.
Stream processing tools (Kafka, Flink, Spark Streaming) handle this high-speed data
flow.
3. Variety
Data comes in multiple forms, not just structured tables.
Includes:
o Structured: RDBMS tables
o Semi-structured: JSON, XML
o Unstructured: text, audio, video, images, logs
Big Data applications must integrate and analyze all types of formats.
4. Veracity
Refers to the quality, accuracy, and reliability of data.
Big Data often contains inconsistencies, missing values, noise, or duplication.
Data cleaning, preprocessing, and validation are crucial.
5. Value
The most important characteristic.
Big Data must produce meaningful insights, predictions, or decisions that add
business/organizational value.
Examples: customer insights, fraud detection, recommendation systems.
6. Variability
Data meaning and structure may change frequently.
For example, seasonal trends, unpredictable patterns in social media sentiment.
Big Data applications must adapt to changing data patterns.
7. Visualization
Extracted insights must be presented clearly using dashboards, graphs, and charts.
Tools: Tableau, Power BI, [Link].
Helps decision makers understand complex patterns easily.
1.7 PERCEPTION AND QUANTIFICATION OF VALUE
Big Data has become a strategic asset for modern organizations. However, the true significance
of Big Data lies not only in collecting large volumes of data but also in the value the data
creates. The value dimension of Big Data is understood through two important aspects:
Perception of Value and Quantification of Value. Together, they help organizations recognize
potential benefits and measure the actual impact of Big Data initiatives.
1. Perception of Value
Perception of value refers to the awareness and understanding of the potential benefits that
Big Data can provide within an organization. It answers the question: “What value can Big
Data bring to our business?” This is a subjective assessment and varies from one organization to
another.
a) Role of Business Goals
Perception depends heavily on the organization’s strategic goals.
Retail industries may perceive value in analyzing customer buying trends.
Banking sectors may perceive value in detecting fraud and improving risk management.
Healthcare may perceive value in predicting diseases and improving patient care.
b) Stakeholder Expectations
Different stakeholders view value differently:
Management expects higher profits.
Operations teams expect improved efficiency.
Marketing teams expect better customer targeting.
c) Data Sources and Patterns
Organizations must understand the nature and variety of data they collect.
Perceived value increases when data can uncover hidden patterns, trends, or insights.
d) Industry Context
Different industries have different perceptions of value based on competition, regulations, and
market demands.
For example, in telecommunications, value is seen in reducing customer churn, while in IoT
systems, value lies in real-time monitoring and predictive maintenance.
e) Tangible and Intangible Value
Perception includes both:
Tangible value: revenue growth, cost reduction, resource optimization.
Intangible value: customer satisfaction, improved decision-making, better reputation,
innovation.
Thus, perception of value forms the foundational understanding of why Big Data initiatives
should be undertaken.
2. Quantification of Value
Quantification of value refers to measuring the actual benefits achieved through Big Data
applications. It answers the question: “How much value have we gained from Big Data?” Unlike
perception, quantification relies on numerical metrics and verifiable data.
a) Financial Metrics
These involve direct monetary benefits:
Return on Investment (ROI) from analytics solutions
Increased revenue through improved targeting and personalization
Cost savings through process optimization
Reduction in operational expenses
b) Operational Metrics
Operational impact is measured through improvements in internal processes:
Faster data processing and reduced latency
Increased automation and efficiency
Reduction in system downtime
Improved productivity and resource utilization
c) Customer-Related Metrics
Big Data applications significantly improve customer experience:
Increased customer satisfaction (CSAT)
Higher Net Promoter Score (NPS)
Increased customer retention
Reduced churn rate
These measurements show how well organizations respond to customer needs using
insights from data.
d) Risk and Security Metrics
Especially in finance, IoT, and cybersecurity sectors:
Reduction in fraud incidents
Improved threat detection and response time
Lower probability of security breaches
Better compliance with regulations
Quantifying risk reduction helps organizations evaluate the effectiveness of Big Data in
safeguarding systems.
e) Analytical and Predictive Accuracy
Big Data value can also be measured through:
Model accuracy (e.g., precision, recall, RMSE)
Prediction performance
Insight quality
High analytical accuracy reflects high value in data-driven decision-making.
3. Importance of Linking Perception and Quantification
For Big Data projects to be successful, organizations must align what they expect (perception)
with what they achieve (quantification). When both are connected:
Resource allocation becomes effective
Business outcomes become measurable
Decision-making becomes more transparent
Investments in Big Data technologies become justified
1.8 UNDERSTANDING BIG DATA STORAGE
Big Data Storage refers to the systems, architectures, and technologies used to store extremely
large, diverse, and fast-growing datasets. Traditional storage systems (like RDBMS) are not
capable of handling the Volume, Variety, and Velocity of Big Data. Therefore, specialized
distributed storage solutions are required to ensure scalability, reliability, and high-speed access.
1. Need for Big Data Storage
Big Data storage solutions must handle:
Massive data size (TB, PB, EB)
Unstructured formats (text, images, logs, IoT data)
High throughput for real-time applications
Scalable expansion as data grows
Fault tolerance in case of hardware failures
Because of these needs, traditional single-server storage fails, leading to distributed storage.
2. Key Characteristics of Big Data Storage
a) Distributed Architecture
Data is stored across multiple machines (nodes).
Even if one node fails, data remains accessible because copies are stored elsewhere.
b) Scalability
Storage systems can grow by simply adding more nodes.
Horizontal scaling supports huge data growth.
c) Fault Tolerance
Big Data storage automatically replicates data across multiple nodes.
This ensures no loss even if hardware fails.
d) High Throughput
Systems are optimized for rapid data ingestion and retrieval
(e.g., sensor streams, logs, real-time analytics).
e) Flexible Data Models
Can store:
Structured data
Semi-structured data (JSON, XML)
Unstructured data (videos, text, images)
3. Types of Big Data Storage Technologies
a) Distributed File Systems (DFS)
The most important example is:
HDFS (Hadoop Distributed File System)
o Stores very large files across multiple nodes
o Provides replication and fault tolerance
o Used in Hadoop ecosystem (MapReduce, Hive, Pig)
b) NoSQL Databases
Designed for high scalability and flexible schemas.
Types include:
Key-Value Stores: Redis, DynamoDB
Column-Oriented Stores: HBase, Cassandra
Document Stores: MongoDB, CouchDB
Graph Databases: Neo4j
These databases efficiently store unstructured and semi-structured data.
c) Object Storage
Used for large file storage in cloud and enterprise systems.
Examples:
Amazon S3
Google Cloud Storage
Azure Blob Storage
Object storage uses buckets and objects instead of traditional file/folder structure.
d) Data Lakes
A central repository that stores:
Raw data
Processed data
Structured and unstructured data
Useful for machine learning, IoT, and analytics workloads.
e) Cloud Big Data Storage
Cloud providers offer fully managed storage with:
High availability
Automatic scaling
Global access
Examples: AWS, Azure, Google Cloud.
4. How Big Data Storage Works (Simplified Flow)
+-----------------------------+
| Data Sources (IoT, Web, |
| Social Media, Logs) |
+-------------+--------------+
|
V
+---------------------+
| Data Ingestion |
| (Kafka, Flume) |
+---------------------+
|
V
+-----------------------------------------+
| Distributed Storage (HDFS / NoSQL / |
| Object Storage / Data Lake) |
+-----------------------------------------+
|
V
+-----------------------+
| Processing Layer |
| (Spark, MapReduce) |
+-----------------------+
|
V
+-----------------------+
| Analytics / ML |
+-----------------------+
5. Benefits of Big Data Storage
Handles massive data efficiently
Supports real-time and batch processing
Offers high scalability
Ensures high availability and durability
Reduces cost using commodity hardware
Works well for diverse data sources
1.9 A GENERAL OVERVIEW OF HIGH PERFORMANCE ARCHITECTURE
High Performance Architecture (HPA) refers to the design and organization of computing
systems that deliver extremely high processing speed, large-scale data handling capability, and
maximum system efficiency. These architectures are used in applications requiring rapid
computation such as scientific simulations, big data analytics, machine learning, weather
forecasting, genomic analysis, and enterprise-level transaction processing.
High Performance Architectures focus on parallelism, scalability, throughput, and optimized
resource utilization to perform billions of operations per second.
1. Key Goals of High Performance Architecture
a) High Computational Speed
The primary goal is to execute large and complex computations in minimal time.
b) Scalability
Systems must support growth in data size and computation by adding more processors, memory,
and nodes.
c) Efficient Resource Utilization
Efficient use of CPU, memory, storage, and network resources ensures maximum throughput.
d) High Availability
HPA systems guarantee reliability and continuous performance even under hardware failures.
2. Components of High Performance Architecture
a) Multi-core and Many-core Processors
Modern HPA uses CPUs and GPUs with multiple cores.
CPUs: Best for general-purpose tasks
GPUs: Best for parallel computations (AI, ML, simulations)
b) Parallel Processing Units
High performance systems depend on:
Shared-memory multiprocessing
Distributed-memory multiprocessing
Hybrid parallel systems
This allows them to break a large task into many smaller tasks executed simultaneously.
c) High-Speed Memory Subsystems
HPA uses:
DDR5/DDR6 RAM
High-bandwidth memory (HBM)
Large cache hierarchy
NUMA-based memory organization
These reduce latency and speed up data access.
d) High-Speed Networking
Nodes are interconnected using high-performance networks such as:
InfiniBand
NVLink
High-speed Ethernet
These support fast data transfer between distributed nodes.
e) High Performance Storage
Storage is optimized using:
NVMe SSDs
Parallel File Systems (e.g., Lustre, GPFS)
Distributed Storage Systems
This boosts I/O throughput for big data applications.
3. Types of High Performance Architecture
a) High Performance Computing (HPC) Systems
Supercomputers and clusters used for scientific simulations, weather models, physics research,
etc.
b) High Throughput Computing (HTC) Systems
Designed to process large volumes of independent tasks (e.g., grid computing, batch processing).
c) High Performance Data Analytics (HPDA)
Big Data and analytics systems like Hadoop, Spark, Flink.
d) Cloud-based HPA
Cloud providers offer scalable, on-demand, high performance architectures using:
AWS HPC
Azure CycleCloud
Google Cloud HPC
4. Architectural Models in High Performance Systems
a) Shared Memory Architecture
All processors share a common memory.
Simple to program but limited scalability.
b) Distributed Memory Architecture
Each processor has its own private memory.
Highly scalable and used in HPC clusters.
c) Hybrid Architecture
Combines both shared and distributed memory.
Used in modern supercomputers.
d) Cluster Architecture
Multiple interconnected computers work as a single system.
Provides cost-efficient high performance.
e) GPU & Accelerator-based Architectures
GPUs, TPUs, and FPGAs accelerate parallel workloads.
Widely used in AI/ML applications.
5. Key Features of High Performance Architecture
Parallelism: Exploiting concurrency at hardware, process, and data levels.
Low Latency and High Bandwidth in communication.
Fault Tolerance: Ability to recover from failures without data loss.
Load Balancing: Distributing tasks evenly across processors.
Energy Efficiency: Optimizing performance per watt.
6. Applications of High Performance Architecture
Scientific research and simulations
Artificial Intelligence & Machine Learning
Big Data analytics
Internet of Things (IoT) monitoring
Weather forecasting
Financial modeling (stock markets)
Engineering design (CAD, automotive, aerospace)
Bioinformatics (gene sequencing, protein folding)
1.10 HDFS – HADOOP DISTRIBUTED FILE SYSTEM
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
to store and manage Big Data in a distributed, scalable, and fault-tolerant manner. It is
designed to run on clusters of commodity hardware and handle very large datasets that
cannot be stored on a single machine.
HDFS follows a master–slave architecture and allows high-throughput, reliable access
to data.
1. Key Features of HDFS
a) Distributed Storage
Data is split into blocks and stored across many machines (nodes) in the cluster.
b) Fault Tolerance
Each block is stored in multiple copies (replication).
If one node fails, data is still available from the other nodes.
c) Scalability
HDFS allows horizontal scaling by adding more nodes to the cluster.
d) High Throughput
Optimized for write once, read many workloads — ideal for Big Data analytics.
e) Supports Large Files
HDFS can store files of hundreds of GB or TB in size.
f) Cost-Effective
Runs on low-cost commodity hardware instead of expensive enterprise servers.
2. HDFS Architecture
HDFS uses a Master–Slave architecture, consisting of:
a) NameNode (Master)
The main controller of HDFS.
Manages the file system metadata:
o File names
o Directory structure
o Block locations
o Replication information
Does not store actual data, only metadata.
Must be highly available.
b) DataNodes (Slaves)
Store actual data blocks.
Perform read/write operations requested by clients.
Send periodic heartbeat messages to the NameNode.
Store multiple replicas for data reliability.
c) Secondary NameNode
Misleading name — it is not a backup for the NameNode.
Helps merge NameNode logs and checkpoints.
Reduces burden on NameNode and improves recovery time.
3. How HDFS Stores a File
File → Split into Blocks → Stored on Multiple DataNodes
Process:
1. A client uploads a file.
2. The file is divided into fixed-size blocks (default: 128 MB).
3. Each block is replicated across nodes (default replication: 3).
4. NameNode stores metadata (block locations).
5. DataNodes store the actual data.
4. How HDFS Reads a File
Process:
1. Client requests file from NameNode.
2. NameNode returns locations of all blocks.
3. Client reads blocks directly from DataNodes.
4. HDFS ensures efficient and parallel reading.
5. HDFS Block Concept
Default block size: 128 MB (can be configured).
Large block size reduces overhead and improves throughput.
Each block is replicated for reliability.
6. HDFS Replication
Default replication factor: 3.
Ensures high availability.
HDFS automatically re-replicates blocks if failures occur.
7. Advantages of HDFS
Highly fault-tolerant
Scalable and flexible
Supports parallel processing (with MapReduce, Spark)
Optimized for large-scale data
Cost-effective storage
High data throughput
8. Limitations of HDFS
Not suitable for real-time data processing
Inefficient for storing many small files
Not designed for random updates (only append allowed)
Requires large cluster setup
9. Applications of HDFS
Big Data analytics
Machine learning workloads
Log processing
IoT data storage
Scientific research data
Enterprise data lakes
1.11 MAP REDUCE AND YARN
MapReduce and YARN are two core components of the Hadoop ecosystem.
MapReduce is a programming model used for processing large datasets in parallel.
YARN is the resource management layer that allocates cluster resources and manages
applications.
Together, they enable scalable, distributed processing of Big Data.
1. MAPREDUCE
Definition
MapReduce is a distributed data processing framework in Hadoop that processes large
datasets in parallel across multiple nodes. It breaks tasks into smaller subtasks: Map and
Reduce.
A) Components of MapReduce
1. Map Function
Takes input data and converts it into key–value pairs.
Performs filtering, sorting, and preprocessing.
Output is intermediate data.
2. Shuffle & Sort
Automatically handled by the framework.
Groups intermediate key–value pairs based on keys.
Prepares data for reduction.
3. Reduce Function
Processes grouped data from the map phase.
Produces the final aggregated output.
Example: sum, count, average, join, etc.
B) How MapReduce Works (Simplified Flow)
Input Data
|
V
Map Phase ---> (Key-Value Pairs)
|
V
Shuffle & Sort
|
V
Reduce Phase
|
V
Final Output
C) Features of MapReduce
Highly scalable
Fault-tolerant
Parallel processing
Suitable for batch processing
Works seamlessly with HDFS
D) Limitations of MapReduce
Slow for real-time processing
Disk-based operations increase latency
Not ideal for iterative ML algorithms
2. YARN (Yet Another Resource Negotiator)
Definition
YARN is the cluster resource management layer of Hadoop.
It manages:
CPU and memory resources
Job scheduling
Job monitoring
YARN separates resource management from data processing, making Hadoop more flexible
and efficient.
A) Architecture of YARN
YARN has three key components:
1. ResourceManager (Master)
Allocates resources across the cluster
Schedules applications
Monitors node health
Works with NodeManagers
2. NodeManager (Slave)
Runs on each node
Manages containers
Reports resource availability to the ResourceManager
Executes tasks assigned by the ApplicationMaster
3. ApplicationMaster
One per application
Negotiates resources from ResourceManager
Coordinates tasks running on NodeManagers
Handles task failures
B) YARN Processing Flow
User Submits Application
|
V
ResourceManager
|
V
ApplicationMaster Launched
|
V
NodeManagers Allocate Containers
|
V
Tasks Execute in Parallel
C) Advantages of YARN
Supports multiple processing models (MapReduce, Spark, Hive, etc.)
Better resource utilization
Scalable and highly efficient
Handles diverse workloads
Improves cluster throughput
D) YARN vs MapReduce 1.0
Feature MapReduce v1 YARN
Resource Management Done by JobTracker Done by ResourceManager
Scalability Limited Very high
Feature MapReduce v1 YARN
Supports only MapReduce? Yes No (Spark, Tez, ML frameworks)
Job Failures Handled by JobTracker Handled by ApplicationMaster
3. Relationship Between MapReduce and YARN
In Hadoop 2.x, MapReduce runs on top of YARN.
YARN provides the resources, while MapReduce performs the actual data processing.
This separation increases flexibility and efficiency.
4. Applications of MapReduce and YARN
Large-scale log analysis
Batch processing
Machine learning (large datasets)
-Search indexing
Fraud detection
ETL processing
Social media analytics
1.12 MAP REDUCE PROGRAMMING MODEL
The MapReduce programming model is a core component of the Hadoop ecosystem, designed
to process and analyze large-scale data in a distributed and parallel manner. It was originally
developed by Google and later implemented in open-source through Hadoop. MapReduce
simplifies large data processing by dividing tasks into smaller sub-tasks that run in parallel
across a cluster of machines. It provides a scalable, fault-tolerant, and high-performance
framework for batch processing massive datasets.
1. Introduction to MapReduce
MapReduce follows a divide-and-conquer strategy. Instead of processing huge datasets on a
single machine, the data is split and distributed across multiple nodes in a Hadoop cluster. The
program logic is expressed in terms of two fundamental functions: Map() and Reduce().
The Map function processes input data and transforms it into key/value pairs.
The Reduce function aggregates or summarizes the mapped data.
This model abstracts the complexities of parallelization, data distribution, synchronization, and
fault tolerance, allowing developers to focus only on business logic.
2. Phases of the MapReduce Programming Model
The MapReduce workflow consists of several well-defined phases:
(a) Input Splitting
Input data stored in HDFS is divided into input splits, usually equal to HDFS block size
(128 MB or 256 MB).
Each split is processed by a separate mapper.
This ensures parallelism.
(b) Mapping Phase
The Map function takes input in the form of key-value pairs.
It applies user-defined logic to generate intermediate key-value pairs.
Example: counting words → (word, 1)
(c) Shuffling and Sorting
The output of the map phase is automatically shuffled, meaning keys are grouped
together.
Hadoop ensures that:
o All values related to a specific key are sent to the same reducer.
o Data is sorted by keys before reaching the reducers.
(d) Reducing Phase
The Reduce function receives a key and a list of values.
It combines these values (e.g., sum, average, max, min).
Final output key-value pairs are generated.
Example: (word, total_count)
(e) Output Phase
The reducer output is written to HDFS.
Output is stored in multiple files (one per reducer).
3. Key Components
1. Mapper
User-defined function.
Converts input into intermediate key-value pairs.
Runs in parallel across multiple nodes.
2. Reducer
Aggregates values for each key.
Produces final output.
3. Combiner (Optional)
Mini-reducer running on mapper nodes.
Reduces data transfer during shuffle.
Used for optimization (e.g., summation).
4. Partitioner
Decides which reducer processes which key.
Ensures balanced workload distribution.
4. Advantages of MapReduce
1. Scalability
Can scale to thousands of nodes.
Suitable for petabyte-level data.
2. Fault Tolerance
If a node fails:
o Tasks automatically rerun on another node.
Data replication in HDFS ensures reliability.
3. Parallel Processing
Splits work across many machines.
Achieves high-speed processing.
4. Simplified Programming Model
Developers only write Map and Reduce functions.
Hadoop manages cluster complexities.
5. Cost-Effective
Works on commodity hardware.
Supports distributed storage and computation.
5. Use Cases of MapReduce
Log analysis
Word count and text processing
Clickstream analysis
Machine learning preprocessing
Large-scale indexing (used by early Google search engines)
Recommendation systems
ETL operations in big data pipelines
6. Example: Word Count Using MapReduce
Map Phase Output
Input line: "Big data is big"
Mapper outputs:
(big, 1)
(data, 1)
(is, 1)
(big, 1)
Shuffle and Sort
(big, 1, 1)
(data, 1)
(is, 1)
Reduce Phase
(big, 2)
(data, 1)
(is, 1)