0% found this document useful (0 votes)
10 views29 pages

Understanding Big Data: Key Concepts & Practices

R programming unit 1

Uploaded by

gvsena89
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views29 pages

Understanding Big Data: Key Concepts & Practices

R programming unit 1

Uploaded by

gvsena89
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT-1

WHAT IS BIG DATA?

Big Data is a collection of huge and complex datasets that cannot be processed using traditional
data processing tools. Big Data means extremely large volumes of data that are too big, too
fast, or too complex for traditional data-processing software to handle.

Key Features (5 V’s of Big Data)

1. Volume – Huge amount of data (TB, PB).


2. Velocity – Data generated very fast (real-time streaming).
3. Variety – Many types: text, images, videos, logs, sensor data, etc.
4. Veracity – Data may be messy, incomplete, or inconsistent.
5. Value – Useful insights that can be extracted from the data.

Examples of Big Data

 Social media data from Facebook/Instagram/Twitter


 Data from IoT devices like sensors, cameras
 Online shopping data from Amazon/Flipkart
 Banking transaction logs
 Health records

Why Big Data is Important?

 Helps companies make better decisions


 Predict customer behavior
 Detect fraud
 Improve business performance

1.1 EVOLUTION OF BIG DATA

The concept of Big Data has evolved over several decades as data generation, storage, and
processing technologies have advanced. Its evolution can be divided into major phases:

1. Pre-Big Data Era (1960s – 1990s)

Main Features:

 Data stored mainly in files, databases, and data warehouses.


 Limited storage capacity and slow processing.
 Structured data only (tables, numbers, records).
Key Technologies:

 Relational Databases (RDBMS)


 SQL
 Data Warehousing (1980s)

2. Early Big Data Concepts (2000 – 2005)

Main Features:

 Rapid growth of the internet, social media, and online systems.


 More unstructured data (images, videos, logs).
 Traditional databases became insufficient.

Major Milestone:

 In 2001, analyst Doug Laney introduced the 3 Vs model of Big Data:


o Volume – huge amount of data
o Velocity – speed of data generation
o Variety – different data formats

3. Big Data Era (2006 – 2015)

What Changed?

 Explosion of data due to smartphones, IoT, cloud storage, e-commerce, and social
networks.
 Need for distributed processing.

Technological Breakthrough:

 Google’s MapReduce (2004) → distributed data processing model.


 Hadoop (2006) → open-source big data framework.
 NoSQL databases → MongoDB, Cassandra for unstructured data.

Important Characteristics Added:

 Veracity – trustworthiness of data


 Value – turning data into insights

3. Big Data Era (2006 – 2015)

What Changed?

 Explosion of data due to smartphones, IoT, cloud storage, e-commerce, and social
networks.
 Need for distributed processing.

Technological Breakthrough:

 Google’s MapReduce (2004) → distributed data processing model.


 Hadoop (2006) → open-source big data framework.
 NoSQL databases → MongoDB, Cassandra for unstructured data.

Important Characteristics Added:

 Veracity – trustworthiness of data


 Value – turning data into insights

1.2 BEST PRACTICES FOR BIG DATA ANALYTICS:

1. Define Clear Business Objectives

 Start with a specific problem or goal (e.g., fraud detection, customer segmentation).
 Avoid collecting data without purpose.

2. Collect Only Relevant Data

 Focus on quality, not just quantity.


 Avoid “data overload” by filtering unnecessary data.
 Ensure data comes from trustworthy sources.

3. Ensure Data Quality

 Remove duplicate, incomplete, or inconsistent data.


 Use data cleansing, validation, and preprocessing techniques.
 High-quality data improves accuracy of analytics.

4. Use Scalable Storage Solutions

 Choose storage that grows with your data:


o Cloud storage (AWS S3, Azure Blob)
o Hadoop HDFS
o Data lakes
 Must support structured, semi-structured, and unstructured data.

5. Implement Strong Data Security & Privacy

 Encrypt data at rest and in transit.


 Use authentication + authorization controls.
 Follow regulations like GDPR or local privacy laws.
6. Choose the Right Tools & Technologies

 Use proper tools for your workload:


o Batch processing → Hadoop, Spark
o Real-time processing → Kafka, Flink
o Database type → NoSQL, columnar, graph DBs
 Consider cost, performance, and ease of integration.

7. Use Distributed Processing

 Break large tasks into smaller chunks to process in parallel.


 Tools: Spark, MapReduce, Flink.
 Ensures high performance and faster results.

8. Build a Skilled Data Team

 Data engineers (storage, pipelines)


 Data scientists (models, insights)
 Analysts (reports, decision-making)
 IT/security team

A combination of skills ensures proper handling of big data.

9. Use Automation & Machine Learning

Automate: 11. Maintain Data Governance

 Define rules for:


o Who owns the data
o Who has access
o How long data is stored
o Data lifecycle management
 Ensures compliance and accountability.

o Data cleaning
o Feature extraction
o Model tuning
 Improves efficiency and reduces human errors.

10. Monitor Performance Continuously

 Track:
o Data processing speed
o System loading
o Resource usage
o Model accuracy
 Use dashboards and log monitoring.

1.3 VALIDATING

What is validating?

Validating means checking whether the data is correct, accurate, and useful before using it
for analysis.

It ensures that the data:

 Follows the required format


 Has no errors
 Is complete
 Is reliable
 Matches expected rules or conditions

Examples of Data Validation

1. Checking missing values


o Ensuring no important fields are empty.
2. Checking data types
o Age should be numeric, name should be string, etc.
3. Range validation
o Marks must be between 0–100.
4. Format validation
o Email must contain “@”
o Date should be in “YYYY-MM-DD” format.
5. Removing duplicates
o No repeated records.

Why Validation Is Important?

 Ensures quality of data


 Prevents wrong results in analysis
 Helps maintain accuracy, reliability, and trustworthiness

1.4 THE PROMOTION OF THE VALUE OF BIG DATA

Promoting the value of Big Data means highlighting how Big Data can improve decisions,
efficiency, innovation, and business growth. It focuses on making organizations understand the
benefits of using Big Data analytics.

Big Data refers to extremely large and complex data sets generated from various sources such as
social media, sensors, mobile devices, transactions, and IoT systems. Promoting the value of Big
Data means creating awareness of how these massive data sets can be used to improve decision-
making, business performance, and innovation in different sectors. It highlights the need for
organizations to adopt data-driven strategies and utilize analytics tools for gaining insights.

1. Enhances Decision-Making

Big Data analytics helps organizations make accurate and real-time decisions. By analyzing
customer behavior, market trends, and operational data, businesses reduce guesswork and rely on
evidence-based decisions. This leads to improved outcomes and increased efficiency.

2. Improves Customer Experience

Big Data enables companies to understand customer preferences, buying patterns, and feedback.
This helps in creating personalized products, targeted advertisements, and better customer
service. Companies like Amazon and Netflix use Big Data to provide personalized
recommendations.

3. Drives Innovation

Organizations can use Big Data to identify new market opportunities, improve existing products,
and design innovative solutions. It exposes hidden patterns and correlations that can help in
developing smarter products and services.

4. Increases Operational Efficiency

Big Data tools help automate processes, optimize resource usage, and reduce operational costs.
Real-time monitoring of systems allows organizations to detect problems early and improve
productivity.

5. Supports Predictive Analytics

One of the major values of Big Data is the ability to predict future trends using machine learning
and statistical models. Industries like finance, healthcare, and retail use predictive analytics for
fraud detection, demand forecasting, and disease prediction.

6. Strengthens Risk Management

Big Data helps organizations identify risks related to finance, cybersecurity, and operations. By
analyzing historical data and anomalies, companies can implement preventive measures to avoid
losses and ensure safety.
7. Competitive Advantage

Organizations that adopt Big Data gain an edge over competitors. They understand market
changes faster, adapt quickly, and offer better products and services. Data-driven companies
grow faster and perform better than traditional companies.

8. Promotes Data-Driven Culture

Promoting the value of Big Data encourages companies to develop a culture where decisions are
based on data instead of intuition. Employees are trained in data literacy, analytics tools, and
modern technologies, making organizations more future-ready.

9. Encourages Investment in Technology

To unlock Big Data’s value, organizations are motivated to invest in cloud computing, data
warehouses, AI, and analytics platforms. These technologies help store, process, and analyze
massive data efficiently.

10. Benefits Government and Society

Big Data is useful not only in business but also in public services. Governments use Big Data for
smart cities, disaster management, public health monitoring, and transportation planning. It
improves governance and societal well-being.

1.5 BIG DATA USE CASES:

Big Data is used across many industries to solve complex problems, improve operations, and
make better decisions. Below are the most common and impactful use cases:

1. Healthcare

 Big Data helps doctors analyze patient history, medical records, and real-time monitoring
data.
 Used for:
o Disease prediction
o Early diagnosis
o Tracking pandemics
o Personalized treatment
 Example: Predicting heart attacks using sensor data.

2. Banking & Finance

 Big Data is widely used to detect fraudulent transactions.


 Applications include:
o Credit risk analysis
o Algorithmic trading
o Customer segmentation
o Insurance claim analysis
 Example: Banks analyze real-time transactions to stop fraud instantly.

3. Retail & E-commerce

 Helps understand customer buying behavior.


 Used for:
o Recommendation systems (Amazon, Flipkart)
o Dynamic pricing
o Inventory management
o Personalized marketing
 Example: Netflix uses Big Data to recommend movies.

4. Social Media Analytics

 Platforms like Facebook, Twitter, and Instagram generate massive data.


 Big Data helps in:
o Sentiment analysis
o Trend prediction
o Advertisement targeting
 Example: Analysing tweets during elections.

5. Transportation & Logistics

 Big Data improves route planning and reduces travel time.


 Used for:
o Traffic prediction
o Fleet management
o Vehicle tracking
o Supply chain optimization
 Example: Uber uses Big Data for price surge and route optimization.

6. Government & Public Sector

 Enhances governance and public safety.


 Applications:
o Smart city planning
o Crime prediction
o Disaster management
o Pollution monitoring
 Example: CCTV data and sensors used for real-time crime detection.
7. Manufacturing & Industry 4.0

 Used for:
o Predictive maintenance of machines
o Quality control
o Automation
o Production optimization
 Example: Sensors detect machine failures before they happen.

8. Education

 Big Data helps analyze student performance and learning patterns.


 Applications:
o Personalized learning
o Attendance monitoring
o Online learning analytics
 Example: Tracking student progress in digital platforms.

9. Telecommunications

 Used for:
o Network optimization
o Customer churn prediction
o Fraud detection
 Example: Telecom companies analyze call data to predict network failures.

10. Energy & Utilities

 Used in:
o Smart meters
o Power consumption forecasting
o Renewable energy management
 Example: Smart grids use Big Data to balance electricity supply.

1.6 CHARACTERISTICS OF BIG DATA APPLICATIONS

 Big Data applications deal with extremely large, complex, and fast-growing datasets.
These applications are typically described using the 5 V’s of Big Data, expanded to 7 V’s
in modern systems.

1. Volume

 Refers to the huge amount of data generated from various sources such as social media,
sensors, IoT devices, logs, and transactions.
 Data ranges from terabytes to petabytes or even exabytes.
 Big Data applications must store, process, and analyze massive datasets using distributed
systems (e.g., Hadoop, Spark).

2. Velocity

 Represents the speed at which data is generated, collected, and processed.


 Real-time or near–real-time processing is required in applications like stock markets,
fraud detection, and IoT monitoring.
 Stream processing tools (Kafka, Flink, Spark Streaming) handle this high-speed data
flow.

3. Variety

 Data comes in multiple forms, not just structured tables.


 Includes:
o Structured: RDBMS tables
o Semi-structured: JSON, XML
o Unstructured: text, audio, video, images, logs
 Big Data applications must integrate and analyze all types of formats.

4. Veracity

 Refers to the quality, accuracy, and reliability of data.


 Big Data often contains inconsistencies, missing values, noise, or duplication.
 Data cleaning, preprocessing, and validation are crucial.

5. Value

 The most important characteristic.


 Big Data must produce meaningful insights, predictions, or decisions that add
business/organizational value.
 Examples: customer insights, fraud detection, recommendation systems.

6. Variability

 Data meaning and structure may change frequently.


 For example, seasonal trends, unpredictable patterns in social media sentiment.
 Big Data applications must adapt to changing data patterns.

7. Visualization

 Extracted insights must be presented clearly using dashboards, graphs, and charts.
 Tools: Tableau, Power BI, [Link].
 Helps decision makers understand complex patterns easily.
1.7 PERCEPTION AND QUANTIFICATION OF VALUE

Big Data has become a strategic asset for modern organizations. However, the true significance
of Big Data lies not only in collecting large volumes of data but also in the value the data
creates. The value dimension of Big Data is understood through two important aspects:
Perception of Value and Quantification of Value. Together, they help organizations recognize
potential benefits and measure the actual impact of Big Data initiatives.

1. Perception of Value

Perception of value refers to the awareness and understanding of the potential benefits that
Big Data can provide within an organization. It answers the question: “What value can Big
Data bring to our business?” This is a subjective assessment and varies from one organization to
another.

a) Role of Business Goals

Perception depends heavily on the organization’s strategic goals.

 Retail industries may perceive value in analyzing customer buying trends.


 Banking sectors may perceive value in detecting fraud and improving risk management.
 Healthcare may perceive value in predicting diseases and improving patient care.

b) Stakeholder Expectations

Different stakeholders view value differently:

 Management expects higher profits.


 Operations teams expect improved efficiency.
 Marketing teams expect better customer targeting.

c) Data Sources and Patterns

Organizations must understand the nature and variety of data they collect.
Perceived value increases when data can uncover hidden patterns, trends, or insights.

d) Industry Context

Different industries have different perceptions of value based on competition, regulations, and
market demands.
For example, in telecommunications, value is seen in reducing customer churn, while in IoT
systems, value lies in real-time monitoring and predictive maintenance.
e) Tangible and Intangible Value

Perception includes both:

 Tangible value: revenue growth, cost reduction, resource optimization.


 Intangible value: customer satisfaction, improved decision-making, better reputation,
innovation.

Thus, perception of value forms the foundational understanding of why Big Data initiatives
should be undertaken.

2. Quantification of Value

Quantification of value refers to measuring the actual benefits achieved through Big Data
applications. It answers the question: “How much value have we gained from Big Data?” Unlike
perception, quantification relies on numerical metrics and verifiable data.

a) Financial Metrics

These involve direct monetary benefits:

 Return on Investment (ROI) from analytics solutions


 Increased revenue through improved targeting and personalization
 Cost savings through process optimization
 Reduction in operational expenses

b) Operational Metrics

Operational impact is measured through improvements in internal processes:

 Faster data processing and reduced latency


 Increased automation and efficiency
 Reduction in system downtime
 Improved productivity and resource utilization

c) Customer-Related Metrics

Big Data applications significantly improve customer experience:

 Increased customer satisfaction (CSAT)


 Higher Net Promoter Score (NPS)
 Increased customer retention
 Reduced churn rate
These measurements show how well organizations respond to customer needs using
insights from data.
d) Risk and Security Metrics

Especially in finance, IoT, and cybersecurity sectors:

 Reduction in fraud incidents


 Improved threat detection and response time
 Lower probability of security breaches
 Better compliance with regulations
Quantifying risk reduction helps organizations evaluate the effectiveness of Big Data in
safeguarding systems.

e) Analytical and Predictive Accuracy

Big Data value can also be measured through:

 Model accuracy (e.g., precision, recall, RMSE)


 Prediction performance
 Insight quality
High analytical accuracy reflects high value in data-driven decision-making.

3. Importance of Linking Perception and Quantification

For Big Data projects to be successful, organizations must align what they expect (perception)
with what they achieve (quantification). When both are connected:

 Resource allocation becomes effective


 Business outcomes become measurable
 Decision-making becomes more transparent
 Investments in Big Data technologies become justified

1.8 UNDERSTANDING BIG DATA STORAGE

Big Data Storage refers to the systems, architectures, and technologies used to store extremely
large, diverse, and fast-growing datasets. Traditional storage systems (like RDBMS) are not
capable of handling the Volume, Variety, and Velocity of Big Data. Therefore, specialized
distributed storage solutions are required to ensure scalability, reliability, and high-speed access.

1. Need for Big Data Storage

Big Data storage solutions must handle:

 Massive data size (TB, PB, EB)


 Unstructured formats (text, images, logs, IoT data)
 High throughput for real-time applications
 Scalable expansion as data grows
 Fault tolerance in case of hardware failures

Because of these needs, traditional single-server storage fails, leading to distributed storage.

2. Key Characteristics of Big Data Storage

a) Distributed Architecture

Data is stored across multiple machines (nodes).


Even if one node fails, data remains accessible because copies are stored elsewhere.

b) Scalability

Storage systems can grow by simply adding more nodes.


Horizontal scaling supports huge data growth.

c) Fault Tolerance

Big Data storage automatically replicates data across multiple nodes.


This ensures no loss even if hardware fails.

d) High Throughput

Systems are optimized for rapid data ingestion and retrieval


(e.g., sensor streams, logs, real-time analytics).

e) Flexible Data Models

Can store:

 Structured data
 Semi-structured data (JSON, XML)
 Unstructured data (videos, text, images)

3. Types of Big Data Storage Technologies

a) Distributed File Systems (DFS)

The most important example is:

 HDFS (Hadoop Distributed File System)


o Stores very large files across multiple nodes
o Provides replication and fault tolerance
o Used in Hadoop ecosystem (MapReduce, Hive, Pig)

b) NoSQL Databases

Designed for high scalability and flexible schemas.


Types include:

 Key-Value Stores: Redis, DynamoDB


 Column-Oriented Stores: HBase, Cassandra
 Document Stores: MongoDB, CouchDB
 Graph Databases: Neo4j

These databases efficiently store unstructured and semi-structured data.

c) Object Storage

Used for large file storage in cloud and enterprise systems.


Examples:

 Amazon S3
 Google Cloud Storage
 Azure Blob Storage
Object storage uses buckets and objects instead of traditional file/folder structure.

d) Data Lakes

A central repository that stores:

 Raw data
 Processed data
 Structured and unstructured data
Useful for machine learning, IoT, and analytics workloads.

e) Cloud Big Data Storage

Cloud providers offer fully managed storage with:

 High availability
 Automatic scaling
 Global access
Examples: AWS, Azure, Google Cloud.

4. How Big Data Storage Works (Simplified Flow)

+-----------------------------+
| Data Sources (IoT, Web, |
| Social Media, Logs) |
+-------------+--------------+
|
V
+---------------------+
| Data Ingestion |
| (Kafka, Flume) |
+---------------------+
|
V
+-----------------------------------------+
| Distributed Storage (HDFS / NoSQL / |
| Object Storage / Data Lake) |
+-----------------------------------------+
|
V
+-----------------------+
| Processing Layer |
| (Spark, MapReduce) |
+-----------------------+
|
V
+-----------------------+
| Analytics / ML |
+-----------------------+

5. Benefits of Big Data Storage

 Handles massive data efficiently


 Supports real-time and batch processing
 Offers high scalability
 Ensures high availability and durability
 Reduces cost using commodity hardware
 Works well for diverse data sources

1.9 A GENERAL OVERVIEW OF HIGH PERFORMANCE ARCHITECTURE

High Performance Architecture (HPA) refers to the design and organization of computing
systems that deliver extremely high processing speed, large-scale data handling capability, and
maximum system efficiency. These architectures are used in applications requiring rapid
computation such as scientific simulations, big data analytics, machine learning, weather
forecasting, genomic analysis, and enterprise-level transaction processing.
High Performance Architectures focus on parallelism, scalability, throughput, and optimized
resource utilization to perform billions of operations per second.

1. Key Goals of High Performance Architecture

a) High Computational Speed

The primary goal is to execute large and complex computations in minimal time.

b) Scalability

Systems must support growth in data size and computation by adding more processors, memory,
and nodes.

c) Efficient Resource Utilization

Efficient use of CPU, memory, storage, and network resources ensures maximum throughput.

d) High Availability

HPA systems guarantee reliability and continuous performance even under hardware failures.

2. Components of High Performance Architecture

a) Multi-core and Many-core Processors

Modern HPA uses CPUs and GPUs with multiple cores.

 CPUs: Best for general-purpose tasks


 GPUs: Best for parallel computations (AI, ML, simulations)

b) Parallel Processing Units

High performance systems depend on:

 Shared-memory multiprocessing
 Distributed-memory multiprocessing
 Hybrid parallel systems
This allows them to break a large task into many smaller tasks executed simultaneously.

c) High-Speed Memory Subsystems

HPA uses:

 DDR5/DDR6 RAM
 High-bandwidth memory (HBM)
 Large cache hierarchy
 NUMA-based memory organization
These reduce latency and speed up data access.

d) High-Speed Networking

Nodes are interconnected using high-performance networks such as:

 InfiniBand
 NVLink
 High-speed Ethernet
These support fast data transfer between distributed nodes.

e) High Performance Storage

Storage is optimized using:

 NVMe SSDs
 Parallel File Systems (e.g., Lustre, GPFS)
 Distributed Storage Systems
This boosts I/O throughput for big data applications.

3. Types of High Performance Architecture

a) High Performance Computing (HPC) Systems

Supercomputers and clusters used for scientific simulations, weather models, physics research,
etc.

b) High Throughput Computing (HTC) Systems

Designed to process large volumes of independent tasks (e.g., grid computing, batch processing).

c) High Performance Data Analytics (HPDA)

Big Data and analytics systems like Hadoop, Spark, Flink.

d) Cloud-based HPA

Cloud providers offer scalable, on-demand, high performance architectures using:

 AWS HPC
 Azure CycleCloud
 Google Cloud HPC
4. Architectural Models in High Performance Systems

a) Shared Memory Architecture

 All processors share a common memory.


 Simple to program but limited scalability.

b) Distributed Memory Architecture

 Each processor has its own private memory.


 Highly scalable and used in HPC clusters.

c) Hybrid Architecture

 Combines both shared and distributed memory.


 Used in modern supercomputers.

d) Cluster Architecture

 Multiple interconnected computers work as a single system.


 Provides cost-efficient high performance.

e) GPU & Accelerator-based Architectures

 GPUs, TPUs, and FPGAs accelerate parallel workloads.


 Widely used in AI/ML applications.

5. Key Features of High Performance Architecture

 Parallelism: Exploiting concurrency at hardware, process, and data levels.


 Low Latency and High Bandwidth in communication.
 Fault Tolerance: Ability to recover from failures without data loss.
 Load Balancing: Distributing tasks evenly across processors.
 Energy Efficiency: Optimizing performance per watt.

6. Applications of High Performance Architecture

 Scientific research and simulations


 Artificial Intelligence & Machine Learning
 Big Data analytics
 Internet of Things (IoT) monitoring
 Weather forecasting
 Financial modeling (stock markets)
 Engineering design (CAD, automotive, aerospace)
 Bioinformatics (gene sequencing, protein folding)

1.10 HDFS – HADOOP DISTRIBUTED FILE SYSTEM

 HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
to store and manage Big Data in a distributed, scalable, and fault-tolerant manner. It is
designed to run on clusters of commodity hardware and handle very large datasets that
cannot be stored on a single machine.
 HDFS follows a master–slave architecture and allows high-throughput, reliable access
to data.

1. Key Features of HDFS

a) Distributed Storage

Data is split into blocks and stored across many machines (nodes) in the cluster.

b) Fault Tolerance

Each block is stored in multiple copies (replication).


If one node fails, data is still available from the other nodes.

c) Scalability

HDFS allows horizontal scaling by adding more nodes to the cluster.

d) High Throughput

Optimized for write once, read many workloads — ideal for Big Data analytics.

e) Supports Large Files

HDFS can store files of hundreds of GB or TB in size.

f) Cost-Effective

Runs on low-cost commodity hardware instead of expensive enterprise servers.

2. HDFS Architecture

HDFS uses a Master–Slave architecture, consisting of:


a) NameNode (Master)

 The main controller of HDFS.


 Manages the file system metadata:
o File names
o Directory structure
o Block locations
o Replication information
 Does not store actual data, only metadata.
 Must be highly available.

b) DataNodes (Slaves)

 Store actual data blocks.


 Perform read/write operations requested by clients.
 Send periodic heartbeat messages to the NameNode.
 Store multiple replicas for data reliability.

c) Secondary NameNode

 Misleading name — it is not a backup for the NameNode.


 Helps merge NameNode logs and checkpoints.
 Reduces burden on NameNode and improves recovery time.

3. How HDFS Stores a File

File → Split into Blocks → Stored on Multiple DataNodes


Process:

1. A client uploads a file.


2. The file is divided into fixed-size blocks (default: 128 MB).
3. Each block is replicated across nodes (default replication: 3).
4. NameNode stores metadata (block locations).
5. DataNodes store the actual data.

4. How HDFS Reads a File

Process:

1. Client requests file from NameNode.


2. NameNode returns locations of all blocks.
3. Client reads blocks directly from DataNodes.
4. HDFS ensures efficient and parallel reading.
5. HDFS Block Concept

 Default block size: 128 MB (can be configured).


 Large block size reduces overhead and improves throughput.
 Each block is replicated for reliability.

6. HDFS Replication

 Default replication factor: 3.


 Ensures high availability.
 HDFS automatically re-replicates blocks if failures occur.

7. Advantages of HDFS

 Highly fault-tolerant
 Scalable and flexible
 Supports parallel processing (with MapReduce, Spark)
 Optimized for large-scale data
 Cost-effective storage
 High data throughput

8. Limitations of HDFS

 Not suitable for real-time data processing


 Inefficient for storing many small files
 Not designed for random updates (only append allowed)
 Requires large cluster setup

9. Applications of HDFS

 Big Data analytics


 Machine learning workloads
 Log processing
 IoT data storage
 Scientific research data
 Enterprise data lakes

1.11 MAP REDUCE AND YARN

MapReduce and YARN are two core components of the Hadoop ecosystem.

 MapReduce is a programming model used for processing large datasets in parallel.


 YARN is the resource management layer that allocates cluster resources and manages
applications.

Together, they enable scalable, distributed processing of Big Data.


1. MAPREDUCE

Definition

MapReduce is a distributed data processing framework in Hadoop that processes large


datasets in parallel across multiple nodes. It breaks tasks into smaller subtasks: Map and
Reduce.

A) Components of MapReduce

1. Map Function

 Takes input data and converts it into key–value pairs.


 Performs filtering, sorting, and preprocessing.
 Output is intermediate data.

2. Shuffle & Sort

 Automatically handled by the framework.


 Groups intermediate key–value pairs based on keys.
 Prepares data for reduction.

3. Reduce Function

 Processes grouped data from the map phase.


 Produces the final aggregated output.
 Example: sum, count, average, join, etc.

B) How MapReduce Works (Simplified Flow)

Input Data
|
V
Map Phase ---> (Key-Value Pairs)
|
V
Shuffle & Sort
|
V
Reduce Phase
|
V
Final Output

C) Features of MapReduce

 Highly scalable
 Fault-tolerant
 Parallel processing
 Suitable for batch processing
 Works seamlessly with HDFS

D) Limitations of MapReduce

 Slow for real-time processing


 Disk-based operations increase latency
 Not ideal for iterative ML algorithms

2. YARN (Yet Another Resource Negotiator)

Definition

YARN is the cluster resource management layer of Hadoop.


It manages:

 CPU and memory resources


 Job scheduling
 Job monitoring

YARN separates resource management from data processing, making Hadoop more flexible
and efficient.

A) Architecture of YARN

YARN has three key components:

1. ResourceManager (Master)

 Allocates resources across the cluster


 Schedules applications
 Monitors node health
 Works with NodeManagers
2. NodeManager (Slave)

 Runs on each node


 Manages containers
 Reports resource availability to the ResourceManager
 Executes tasks assigned by the ApplicationMaster

3. ApplicationMaster

 One per application


 Negotiates resources from ResourceManager
 Coordinates tasks running on NodeManagers
 Handles task failures

B) YARN Processing Flow

User Submits Application


|
V
ResourceManager
|
V
ApplicationMaster Launched
|
V
NodeManagers Allocate Containers
|
V
Tasks Execute in Parallel

C) Advantages of YARN

 Supports multiple processing models (MapReduce, Spark, Hive, etc.)


 Better resource utilization
 Scalable and highly efficient
 Handles diverse workloads
 Improves cluster throughput

D) YARN vs MapReduce 1.0

Feature MapReduce v1 YARN

Resource Management Done by JobTracker Done by ResourceManager

Scalability Limited Very high


Feature MapReduce v1 YARN

Supports only MapReduce? Yes No (Spark, Tez, ML frameworks)

Job Failures Handled by JobTracker Handled by ApplicationMaster

3. Relationship Between MapReduce and YARN

 In Hadoop 2.x, MapReduce runs on top of YARN.


 YARN provides the resources, while MapReduce performs the actual data processing.
 This separation increases flexibility and efficiency.

4. Applications of MapReduce and YARN

 Large-scale log analysis


 Batch processing
 Machine learning (large datasets)
-Search indexing
 Fraud detection
 ETL processing
 Social media analytics

1.12 MAP REDUCE PROGRAMMING MODEL

The MapReduce programming model is a core component of the Hadoop ecosystem, designed
to process and analyze large-scale data in a distributed and parallel manner. It was originally
developed by Google and later implemented in open-source through Hadoop. MapReduce
simplifies large data processing by dividing tasks into smaller sub-tasks that run in parallel
across a cluster of machines. It provides a scalable, fault-tolerant, and high-performance
framework for batch processing massive datasets.

1. Introduction to MapReduce

MapReduce follows a divide-and-conquer strategy. Instead of processing huge datasets on a


single machine, the data is split and distributed across multiple nodes in a Hadoop cluster. The
program logic is expressed in terms of two fundamental functions: Map() and Reduce().

 The Map function processes input data and transforms it into key/value pairs.
 The Reduce function aggregates or summarizes the mapped data.

This model abstracts the complexities of parallelization, data distribution, synchronization, and
fault tolerance, allowing developers to focus only on business logic.
2. Phases of the MapReduce Programming Model

The MapReduce workflow consists of several well-defined phases:

(a) Input Splitting

 Input data stored in HDFS is divided into input splits, usually equal to HDFS block size
(128 MB or 256 MB).
 Each split is processed by a separate mapper.
 This ensures parallelism.

(b) Mapping Phase

 The Map function takes input in the form of key-value pairs.


 It applies user-defined logic to generate intermediate key-value pairs.
 Example: counting words → (word, 1)

(c) Shuffling and Sorting

 The output of the map phase is automatically shuffled, meaning keys are grouped
together.
 Hadoop ensures that:
o All values related to a specific key are sent to the same reducer.
o Data is sorted by keys before reaching the reducers.

(d) Reducing Phase

 The Reduce function receives a key and a list of values.


 It combines these values (e.g., sum, average, max, min).
 Final output key-value pairs are generated.
 Example: (word, total_count)

(e) Output Phase

 The reducer output is written to HDFS.


 Output is stored in multiple files (one per reducer).

3. Key Components

1. Mapper

 User-defined function.
 Converts input into intermediate key-value pairs.
 Runs in parallel across multiple nodes.
2. Reducer

 Aggregates values for each key.


 Produces final output.

3. Combiner (Optional)

 Mini-reducer running on mapper nodes.


 Reduces data transfer during shuffle.
 Used for optimization (e.g., summation).

4. Partitioner

 Decides which reducer processes which key.


 Ensures balanced workload distribution.

4. Advantages of MapReduce

1. Scalability

 Can scale to thousands of nodes.


 Suitable for petabyte-level data.

2. Fault Tolerance

 If a node fails:
o Tasks automatically rerun on another node.
 Data replication in HDFS ensures reliability.

3. Parallel Processing

 Splits work across many machines.


 Achieves high-speed processing.

4. Simplified Programming Model

 Developers only write Map and Reduce functions.


 Hadoop manages cluster complexities.

5. Cost-Effective

 Works on commodity hardware.


 Supports distributed storage and computation.
5. Use Cases of MapReduce

 Log analysis
 Word count and text processing
 Clickstream analysis
 Machine learning preprocessing
 Large-scale indexing (used by early Google search engines)
 Recommendation systems
 ETL operations in big data pipelines

6. Example: Word Count Using MapReduce

Map Phase Output


Input line: "Big data is big"
Mapper outputs:
(big, 1)
(data, 1)
(is, 1)
(big, 1)
Shuffle and Sort
(big, 1, 1)
(data, 1)
(is, 1)
Reduce Phase
(big, 2)
(data, 1)
(is, 1)

You might also like