BIG DATA ANALYTICS
BCS714D
Suggested Learning Resources:
Books:
1. Seema Acharya and Subhashini Chellappan “Big data and Analytics”, Wiley India
Publishers, 2nd Edition, 2019.
2. Rajkamal and Preeti Saxena, “Big Data Analytics, Introduction to Hadoop, Spark and
Machine Learning”, McGraw Hill Publication, 2019.
Reference Books:
3. Adam Shook and Donald Mine, “MapReduce Design Patterns: Building Effective
Algorithms and Analytics for Hadoop and Other Systems” - O'Reilly 2012
• Course objectives:
1. To implement MapReduce programs for processing big data.
2. To realize storage and processing of big data using MongoDB, Pig, Hive and
Spark.
3. To analyze big data using machine learning techniques.
• Course outcomes (Course Skill Set): At the end of the course, the
student will be able to:
• Illustrate Big Data concepts, tools and applications.
• Develop programs using HADOOP framework.
• Use Hadoop Cluster to deploy Map Reduce jobs, PIG,HIVE and Spark
programs.
• Analyze the given data set to identify deep insights.
• MODULE-1 :Classification of data, Big Data Analytics:
• MODULE-2: Introduction to Hadoop: Introduction to Map Reduce
Programming:
• MODULE- 3:Introduction to MongoDB:
• MODULE-4: Introduction to Hive, Introduction to Pig.
• MODULE-5: Spark and Big Data Analytics, Text, Web Content and
Link Analytics:
MODULE-1
• Classification of data, Characteristics, Evolution and definition of Big data, What
is Big data, Why Big data,
• Traditional Business Intelligence Vs Big Data,Typical data warehouse and
Hadoop environment.
• Big Data Analytics: What is Big data Analytics, Classification of Analytics,
Importance of Big Data Analytics, Technologies used in Big data Environments,
Few Top Analytical Tools , NoSQL, Hadoop.
• TB1: Ch 1: 1.1, Ch2: 2.1-2.5,2.7,2.9-2.11, Ch3: 3.2,3.5,3.8,3.12, Ch4:
4.1,4.2
• What is Big Data?
• According to Gartner, the definition of Big Data – “Big data” is high-volume, velocity, and
variety information assets that demand cost-effective, innovative forms of information
processing for enhanced insight and decision making.”
OR
• Big Data refers to complex and large data sets that have to be processed and analyzed to
uncover valuable information that can benefit businesses and organizations.
OR
• simpler way to answer what is Big Data:
• It refers to a massive amount of data that keeps on growing exponentially with time.
• It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
• It includes data mining, data storage, data analysis, data sharing, and data visualization.
• The term is an all-comprehensive one including data, data frameworks, along with the tools
and techniques used to process and analyze the data.
The History of Big Data
• 1960s and '70s when the world of data was just getting started with the first data centers and the
development of the relational database.
• Around 2005, data generated through Facebook, YouTube, and other online services. - Hadoop,
NoSQL (also began to gain) was developed (an open-source framework created specifically to
store and analyze big data sets).
• With the advent of the Internet of Things (IoT), more objects and devices are connected to the
internet, gathering data on customer usage patterns and product performance.
• The emergence of machine learning has produced still more data. While big data has come far, its
usefulness is only just beginning.
• Cloud computing has expanded big data possibilities even further. The cloud offers truly elastic
scalability, where developers can simply spin up ad hoc clusters to test a subset of data.
Classification of Digital Data
Data is classified into 3 types
Characteristics of Data
1. Composition:
• It deals with the structure of data that is
• the sources of data, the granularity, the types, and the nature of data as to
whether it is static or real time streaming
2. Condition:
• the condition of data deals with the state of data that is
• Can one use this data as is for analysis? Or
• does it require cleansing for further enhancement and enrichment
3. Context:
• The context of data deals with Where has this data been generated ?,
• why was this data generated?,
• How sensitive is this data?
• What are the events associated with this data and soon
• Big data is
• Part-1
Anything beyond the human and technical infrastructure needed to
support storage, processing and analysis
• Today’s BIG may be tomarrow’s NORMAL
• Terabytes or petabytes or zetabytes
• I think it is about 3Vs Which was proposed by Gartner Part-2
• Part-1: is about voluminous data that may have a great
variety(structured, semi-structured and unstructured) and will
require good speed/pace for storage, preparation ,processing and
analysis
• Part-2: talks about ambracing new techniques and technologies
Part-3
to capture, store, process, persist, integrate and visualize the
high volume, high velocity and high variety data
• Psrt-3: talks about deriving deeper, richer, and meaningful
insights and then using these insights to make faster and better
decisions to gain business value
Challenges with big data
• 1. Batch Processing
Processing large volumes of data at once, after it's collected and stored over a period.
• 2. Periodic Processing (Mini-batch)
Data is processed in small batches at regular intervals (e.g., every 10
minutes, hourly).
• 3. Near Real-Time Processing
Data is processed almost immediately, with slight delays (typically seconds to a
minute).
• Real-Time Processing
Data is processed immediately as it arrives — with milliseconds of delay.
• Other characteristics of data which are nto definitional traits of big
data
• Veracity and validity:
• Refers to biases, noise and abnormality in data.
• The key question here is:” is all the data is stored, mined and analyzed
meaningful and pertinent to the problem under consideration”
• Validity refers to the accuracy and correctness of the data. Any data that is
picked up for analysis needs to be accurate
• Volatility: it deals with how long should it be stored
• Some data that is required for long term decisions and remains valid for
longer periods of time
• However, there are also pieces of data that is quickly become obsolete
minutes after generation.
Why Big data
Traditional Business Intelligence(BI versus Big data)
Aspect Traditional Business Intelligence (BI) Big Data
Structured data (e.g., from relational Structured, semi-structured, and
Data Type
databases) unstructured data
Data Volume GBs to TBs TBs to PBs and beyond
Diverse (social media, sensors, web logs,
Data Source Limited (internal databases, ERP, CRM)
IoT, etc.)
Batch, near real-time, and real-time
Processing Type Batch processing
processing
Distributed systems (Hadoop HDFS,
Storage System Relational Databases (RDBMS)
NoSQL, etc.)
Scalability Limited vertical scaling High horizontal scalability
Aspect Traditional Business Intelligence (BI) Big Data
SQL, Excel, OLAP tools, traditional Hadoop, Spark, Hive, Pig, NoSQL
Tools Used
dashboards databases
Descriptive analytics (what
Analysis Approach Predictive, prescriptive, real-time analytics
happened?)
Data scientists, engineers, analysts,
User Analysts and decision-makers
businesses
Speed Slower for large data sets Faster with distributed parallel processing
Open-source tools reduce cost but need
Cost Often high (licensing, hardware)
expertise
Summary:
• Traditional BI focuses on historical data and structured reporting for strategic decisions.
• Big Data deals with massive, varied, and fast-moving data to extract insights in real time or near-
real time, often using modern distributed technologies.
A typical warehouse environment
A Data Warehouse is a centralized repository that stores structured
data collected from different sources to support reporting, querying,
and data analysis.
It is specially designed for decision-making, business intelligence (BI),
and analytical processing, rather than for day-to-day operations.
• Consolidates diverse data into one location
• Supports accurate and timely decision-making Enables advanced analytics and business intelligence
Data Sources (Input)
These are the systems where raw data originates from:
[Link] (Enterprise Resource Planning)
1. Contains business operations data like finance, HR, inventory, etc.
[Link] (Customer Relationship Management)
1. Stores customer interactions, sales leads, and service data.
[Link] Systems
1. Older systems still in use that store historical or transactional data.
[Link]-party Apps
1. External software or services that provide relevant data (e.g., market data, web
analytics, cloud platforms).
These sources send data into the data warehouse using ETL or ELT
processes.
Data Usage (Output)
These represent the analytics and business intelligence tools used to
extract value from the data warehouse:
[Link] / Dashboarding
1. Tools like Power BI, Tableau, etc., generate visual reports for business insights.
[Link] (Online Analytical Processing)
1. Supports multidimensional analysis such as drill-down, roll-up, slice & dice.
[Link] Hoc Querying
1. Analysts can run custom queries as needed, without predefined reports.
[Link]
1. Data scientists and analysts create predictive models using historical data.
A typical Hadoop environment.
This diagram represents a typical Hadoop environment, showcasing the data flow from raw
sources to storage and analytics systems using Hadoop and its MapReduce processing model.
1. Input Sources (on the left):
These are various types of raw, unstructured or semi-structured data being ingested into the Hadoop
system:
• Web logs – Server logs, clickstream data from websites.
• Images and videos – Multimedia content, often large in size.
• Social media – Data from platforms like Twitter, Facebook, etc.
• Docs & PDFs – Document-based data from reports, articles, etc.
2. Hadoop Core (Center Block):
Hadoop is the central processing unit of this ecosystem.
• It includes:
• HDFS (Hadoop Distributed File System) – For scalable and fault-tolerant storage.
• MapReduce – For parallel processing and analyzing large data sets.
In this block:
• The input data is stored in HDFS.
• Then it is processed using MapReduce jobs, which distribute the computation across multiple nodes.
3. Output Targets (on the right):
Once data is processed, results are sent to different systems for further
use:
• HDFS – For storing the processed output if needed for long-term use.
• Operational systems – Business applications that may use the cleaned/analyzed data.
• Data warehouse – Structured storage system for analytics and reporting.
• Data marts – Subsets of data warehouses for specific departments (e.g., sales, finance).
• ODS (Operational Data Store) – For real-time or near-real-time reporting and integration.
Summary:
• This diagram shows how Hadoop acts as a bridge between raw data sources and
analytical systems.
• It takes in various forms of data, processes them using MapReduce, and
distributes the results to different data storage and business intelligence systems.
What is Big Data Analytics
Big Data Analytics is..." and lists six key points, elaborating on what it involves:
[Link]-enabled analytics:
1. It mentions the availability of various data analytics and visualization tools from vendors like IBM,
Tableau, SAS, R Analytics, Statistica, and World Programming Systems (WPS) that help process and
analyze big data.
[Link] deeper business insights:
1. It's about understanding customer demographics and behaviors to improve business strategies like cross-
selling and up-selling by leveraging vendor services.
2. Author’s Experience: An anecdote is shared where the author received clothing recommendations from
an online retailer that matched their past shopping preferences (brand and color). This demonstrated how
data from previous purchases was stored and used effectively to personalize suggestions.
[Link] edge:
1. Big data analytics offers an advantage over competitors by enabling faster and better decision-making.
4. Collaboration across communities:
1. It emphasizes the integration of IT, business users, and data scientists, referring to it as a
"tight handshake" among these three groups.
5. Large volume and variety of datasets:
2. Highlights the challenge of handling data that exceeds current storage and processing
capabilities of traditional infrastructure.
6. Moving code to data:
3. Describes how it's more efficient to move small programs (a few KBs) to the data instead of
moving massive data (in terabytes, petabytes, or even exabytes/zettabytes in the future) to
the programs.
The section provides both technical details and real-world context to explain how
big data analytics works and why it’s important.
"WHAT BIG DATA ANALYTICS ISN’T?"
• It aims to clarify some common misconceptions about Big Data Analytics. Here are the key points:
1. Big Data is not just about volume: Many people associate "Big Data" solely with the idea of
large amounts of data. However, the text emphasizes that volume is only one aspect. Two other
critical dimensions are:
• Variety: The different types of data (structured, unstructured, etc.).
• Velocity: T2. Big Data Analytics is not exclusive to tech giants:It is a myth that only large online
companies like Google or Amazon use Big Data Analytics.
2. In reality, any business or industry that seeks to derive actionable insights from its data—
whether internal or external—can and should use Big Data Analytics.
• In summary, Big Data Analytics isn't just about handling huge volumes of data or reserved
for big tech companies; it's about making meaningful use of various types of data, processed
at speed, for any organization seeking [Link] speed at which data is generated and
processed.
3.4 WHY THIS SUDDEN HYPE AROUND BIG DATA ANALYTICS?
• It addresses the growing attention and excitement around Big Data Analytics, and presents three main reasons for this hype:---
1. Explosive Growth of Data
• Data growth rate: 40% compound annual growth rate.
• Projected size: Nearly 45 Zettabytes (ZB) by 2020.
• In 2010: ~1.2 trillion gigabytes of data was generated.
• By 2012: This doubled to 2.4 trillion gigabytes.
• By 2014: Reached 5 trillion gigabytes.
• Business data is expected to double every 1.2 years.
• Example: Wal-Mart processes 1 million customer transactions per hour.
• Social media examples:
• 500 million tweets posted daily on Twitter.
• 2.7 billion likes and comments posted on Facebook daily.
• Daily data generation: 2.5 quintillion bytes.
• 90% of the world’s data was created in the past 2 years alone.
• Sources:(a) Intel infographic:
• [Link]
• IBM: [Link]
2. Lower Storage Costs:
The cost per gigabyte of storage has significantly dropped, making it cheaper and more feasible
to store large amounts of data.---
3. Availability of Analytics Tools
There is now an overwhelming number of user-friendly analytics tools available in the market,
which makes it easier for businesses and individuals to analyze data without needing deep
technical expertise.---
This section explains why Big Data Analytics has become a major trend in
recent years by highlighting data volume growth, reduced costs, and improved
accessibility of tools.
Classification of Big Data
• Two main classifications of analytics exist: By type:
basic, operational, advanced, and monetized
analytics
• By generation: Analytics 1.0, Analytics 2.0, and
Analytics 3.0
Importance of Big Data Analytics
1. Reactive – Business Intelligence (BI)
• Purpose: To help businesses make better and faster decisions by
delivering the right information to the right people in the right format.
• Nature: Works on historical or past data only.
• Output: Dashboards, alerts, notifications, static reports, and ad hoc
queries.
• Scope: Small to medium datasets; relies on traditional BI tools.
2. Reactive – Big Data Analytics
• Purpose: Similar to BI, but works on huge datasets (big data).
• Nature: Still reactive because it analyzes static historical data, not
predicting future events.
• Difference from #1: Focus is on the scale of data rather than new
analytical techniques.
3. Proactive – Analytics
• Purpose: To support futuristic decision-making using advanced
techniques like:
• Data mining
• Predictive modeling
• Text mining
• Statistical analysis
• Nature: Proactive because it anticipates future trends and scenarios.
• Limitation: Works with traditional databases, so it struggles with
huge datasets due to storage and processing constraints.
4. Proactive – Big Data Analytics
• Purpose: Combines predictive, proactive analysis with the power of
big data.
• Nature: Can sift through massive data volumes (terabytes, petabytes,
exabytes) to:
• Identify relevant patterns quickly
• Solve complex problems
• Deliver high-performance insights
• Strength: Most powerful approach — uses large-scale data and
advanced analytics to make forward-looking, data-driven decisions.
Key Idea
• Reactive approaches → Respond to what has already happened.
• Proactive approaches → Prepare for what’s likely to happen next.
• Big data → Increases scale and capability, whether reactive or
proactive.
3.12. Terminologies used in Big Data Analytics
1. In-Memory Analytics
2. In-Database Processing
3. Symmetric Multiprocessor System (SMP)
4. Massive Parallel Processing
5. Difference between Parallel and Distributed Systems
6. Shared Nothing Architecture
7. CAP Theorem
Terminologies used in Big Data Analytics
1. In-Memory Analytics (3.12.1)
• Accessing data from non-volatile storage (like hard disks) is slow. The more data that needs to be
fetched from disk, the slower the processing.
• Pre-compute and store aggregated data (cubes, summary tables, query sets, etc.) so the CPU
fetches only a small subset.
• Requires thinking in advance to know what data will be needed. If new data is required, the whole process must be
repeated.
• In-memory solution:
• This problem has been addressed using in memory analytics. Here all the relevant data is stored
in RAM (primary storage) instead of on disk.
• Advantages:
• Much faster data access
• Rapid deployment
• Better insights
• Minimal IT involvement
2. In-Database Processing (3.12.2)
• Also known as: In-database analytics.
• Combines data warehouses with analytical systems so that the analytics can run
inside the database instead of exporting data.
• Data typically comes from OLTP (Online Transaction Processing) systems →
goes through ETL (Extract, Transform, Load) → stored in EDW (Enterprise Data
Warehouse) or data marts.
• The huge datasets are then exported to analytical programs. Within database
processing, the database itself runs the computations, saving time and resources
• Benefit:
• Avoids exporting large datasets to separate analytics tools.
• Useful for large businesses with massive datasets.
3. Symmetric Multiprocessor System (SMP)
In an SMP system, there is one main memory that is shared by two or more identical processors.
Key Features
[Link] Main Memory
1. All processors access the same main memory.
2. Each processor also has its own high-speed cache memory to store frequently used data.
[Link] Access
1. All processors have equal access to I/O devices.
2. Controlled by a single operating system instance.
[Link] Coupled
1. Processors are connected closely and communicate through a system bus.
[Link] Arbiter
1. Manages access to the system bus to avoid conflicts when multiple processors try to use it at the same
time.
Diagram Explanation (Figure 3.9)
• Bus Arbiter: Controls and coordinates data traffic on the
system bus.
• Processors (1 to n): Perform computations; each has its own
cache memory.
• System Bus: Common communication pathway linking
processors, main memory, and I/O devices.
• Main Memory: Central storage accessible by all processors.
• I/O: Input/Output devices for external communication.
dvantages of SMP
• High performance for parallel tasks.
• Shared resources simplify programming.
• Better load balancing among processors.
Limitations
• System bus can become a bottleneck if too many processors
try to access it at once
Massive Parallel Processing
• It refers to coordinated processing of programs by a number of
processors working parallel
• Each processor have their own operating systems and dedicated
memory. they work on different parts of the same program
• They communicate using some sort of messaging
Difference between Parallel and Distributed Systems
• Parallel database system is a tightly
coupled system
• The processor cooperate for query
processing
• The user unaware of the parallelism
since he/she has no access to a specific
processor of the system.
• Either the processor has have access to a
common memory as shown in fig or
make use of message passing for
communication
• Distributed systems are also known
as to be loosely coupled and are
composed by individual machines
as shown in fig 3.12.
• Each of the machine can run their
individual application and serve
their own respective user.
• The data is distributed across
machines, which are accessed to
answer to a user query
3.12.6 Shared Nothing Architecture
• 3 common types of architecture for multiprocessor high transaction rate
systems
• Shared Memory(SM): it has common central memory is shared by multiple
processors
• Shared Disk(SD): multiple processors share a common collection of disks while having
their own private memory.
• Shared Nothing(SN): neither memory nor disk is shared among multiple processors.
• Advntages
• Fault Isolation: a fault in a single node is contained and confined to that node exclusively and
exposed only through messages
• Scalability: different nodes will have to take turns to access the critical data. This impose a limit on
how many nodes can be added to the distributed sharded disk systems, thus compromising on
scalability.
CAP theorem
3.12.7. CAP theorem
Few Top Analytical Tools:NoSQL, Hadoop.
• Introduction to NoSQL
• Term coined by Carlo Strozzi in 1998 for lightweight, open-source, relational DB without standard
SQL interface.
• Reintroduced by Johan Oskarsson in 2009 for open-source distributed networks.
• Hashtag #NoSQL coined by Eric Evans to describe non-relational databases.
• Features of NoSQL Databases
• Open source
• Non-relational
• Distributed
• Schema-less
• Cluster friendly
• Born out of 21st-century web applications
Where is NoSQL Used?
• Big data applications
• Real-time web applications
• Log data storage and analysis
• Social media data storage
• Handling unstructured and semi-structured data
What is NoSQL?
• Non-relational, open-source, distributed database
• Scales out horizontally
• Handles structured, semi-structured, and unstructured
data
• Supports key-value, document, column-oriented, or
graph databases
• Below fig shows additional features of NoSQL.
• NoSQL databases
• Are non relational:
• They do not adhere to relational data model
• They are either key-value pairs or document-oriented or column oriented or
graph based databases
• Are distributed:
• Data is distributed across several nodes in a cluster constituted of low cost
commodity h/w
• Offer no support for ACID (atomicity, consistency, isolation,
durability)properties.:
• They do not support ACID properties instead they have adeharance to
Brewer’s CAP theorem and are often seen compromising consistency in favor
of availability and partition- tolerance
• Provide no fixed schema:
• NoSQLdb becoming popular owing to their support for flexibility to the
schema
• They do not mandate for the data to adhere to any schema structure at the
time of storage
Types of NoSQL Databases
• Are classified into
• Key-value of the big hash table
• Schema-less:
• Key-value:
• Maintains Big hash tableof keys and values
• Ex:Dynamo,redis,[Link]
• Sample key-value pair in kaey-value database
• Document: maintains data in collections constituted of documents.
• Ex:MongDB,Apavhe CouchDB,couchbase,MarkLogic,etc
• Sample Document in Document Database
• Column:
• Each storage block has data from only one column
• Ex:Cassandra,Hbase,etc
• Graph: they are also called network database.
• Graph stores data in nodes
• Ex:Neo4j,hyperGraphDB,etc
• Sample graph in Graph Database
• Below table shows a popular schema-less databases
Why NoSQL:
• It has scaleout architecture instead of the monolithic arcgitecture of relational
databases
• It can house large volumes of structured,semi-structured and unstructured data
• Dynamic Schema:
• Nosql database allows insertion of data without a pre-defined schema
• It facilitates application changes in real time, which supports faster development, esay
code integration and requires less database administration
• Auto Sharding:
• It automatically spreads data across multiple of servers.
• The application doesn’t need to know the exact server layout. Load is balanced across
servers, and if one fails, it is replaced quickly without major disruption.
• Replication – Strong replication support ensures high availability, fault
tolerance, and disaster recovery.
Advantages of NoSQL
1. Can easly scale up and down
a. Cluster scale: It allows distribution of database
across 100+nodes often in multiple data centers
b. Performance scale: It sustains over 100,000+
database reads and writes per scond
c. Data Scale: Data supports housing of 1 billion
documents in the database
2. Does not require a pre defined schema
3. Cheap, easy to implement
4. Relaxes the data consistency requirements
5. Data can be replicated to multiple nodes
and can be partitioned
1. Sharding
2. Replication
What we miss with NoSQL
Use of NoSql in Industry
New SQL
• Problem:
• NoSQL offers scalability and flexibility but lacks some traditional RDBMS features (like ACID properties,
joins).
• Traditional RDBMS offers ACID guarantees and SQL support but doesn’t scale like NoSQL for large OLTP
workloads.
• Solution: NewSQL
• Supports relational data model and SQL as the primary query interface.
• Offers the scalable performance of NoSQL for OLTP (Online Transaction Processing) systems.
• Maintains ACID guarantees of traditional databases.
Characteristics of NewSQl
Hadoop
Key Idea
• Hadoop is designed to run on clusters of commodity hardware.
• When you need more performance or storage, you add more nodes rather than buying a
bigger, more powerful server (which would be scale-up).
• This approach makes the system more cost-effective, fault-tolerant, and flexible.
How It Works in Hadoop
• Hadoop stores data across HDFS (Hadoop Distributed File System), splitting files into
blocks and distributing them to multiple nodes.
• When you add new nodes:
• Storage capacity increases (more HDFS space).
• Processing capacity increases (more CPU/memory for running MapReduce, Spark, etc.).
• The NameNode manages metadata, while DataNodes handle storage.
• Hadoop automatically balances and replicates data to the new nodes.
• Example
• You have a 4-node Hadoop cluster.
• It’s running slower due to heavy workloads.
• Instead of buying a super-powerful single server, you add 4 more
nodes.
• Now, the workload is split among 8 nodes, giving you:
• Faster parallel processing.
• More storage.
• Better fault tolerance.
Hadoop
•Open-source Project: Developed under the Apache Foundation.
•Is a framework Written in Java: Created by Doug Cutting in 2005, named after
his son’s toy elephant.
•Origins: Initially developed while Doug Cutting was working with Yahoo to
support distribution for “Nutch” (a text search engine).
•Hadoop uses Google’s MapReduce and Google File System (GFS). Technologies
as its foundation
•Now Hadoop is a core of computing infrastructure for companies like Yahoo,
Facebook, LinkedIn, Twitter, etc.
Features of Hadoop
1. It is optimized to handle Massive quantities of structured, semi-structured, and unstructured data on
low-cost hardware.
2. Has a Shared-Nothing Architecture:
3. Data Replication: Data is replicated across multiple machines, ensuring fault tolerance.
4. High Throughput rather than Low Latency: Designed for batch processing large volumes of data
(response time is not immediate).
5. OLTP/OLAP Complement: Supports online transaction and analytical processing but does not
replace relational databases.
6. Not Suitable for Non-Parallelizable Work: Inefficient when tasks have dependencies that prevent
parallel execution.
7. Not Good for Small Files: Works best with large datasets rather than many small files.
Key Advantages of Hadoop
Key Advantages of Hadoop
[Link] Data in Native Format
• Hadoop Distributed File System (HDFS) stores data without imposing a fixed schema.
• It is schema-less when storing data; structure is applied only when processing is required.
[Link]
• Can store and distribute very large datasets (terabytes or more) on hundreds of low-cost
servers working in parallel.
• Proven scalability in real-world companies like Facebook & Yahoo.
[Link]-Effective
1. Due to its scale-out architecture, Hadoop provides a low cost per terabyte for storage
and processing.
Key Advantages of Hadoop
4. Resilient to Failure
• Fault-tolerant: Data is replicated across multiple nodes.
• If a node fails, another copy of the data is available for use, ensuring high availability.
5. Flexibility
• Can process structured, semi-structured, and unstructured data.
• Supports various use cases like log analysis, data mining, recommendation systems, clickstream
analysis, social media insights, market campaign analytics, etc.
6. Fast
• High-speed processing due to Hadoop’s move code to data approach (processing occurs where
the data is stored rather than moving the data to the computation).
• Significantly faster compared to many traditional systems.
Overview of Hadoop Ecosystem
• Following are the components that collectively form a Hadoop ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing(A programming model for processing large
datasets in parallel across the Hadoop cluster.)
• FLUME:A distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of log data.
• Sqoop:A tool for transferring data between Hadoop and relational databases.
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Schedu
• Oozie
• Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit.
• There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs.
• Oozie workflow is the jobs that need to be executed in a sequentially ordered
manner whereas Oozie Coordinator jobs are those that are triggered when
some data or external stimulus is given to it.
Zookeeper
• There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop
which resulted in inconsistency, often.
• Zookeeper overcame all the problems by performing synchronization,
inter-component based communication, grouping, and maintenance.
• Apache Mahout
• Mahout, allows Machine Learnability to a system or application.
• Machine Learning, as the name suggests helps the system to develop itself
based on some patterns, user/environmental interaction or on the basis of
algorithms.
• It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine learning.
• It allows invoking algorithms as per our need with the help of its own libraries.