JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY ANANTAPUR
[Link] (CSE)– III-II Sem L T P C
3 0 0 3
(19A05602T)
BIG DATA ANALYTICS
(Common to CSE & IT)
The course is designed to
Understand the basic concepts and importance of Big Data
Familiarize with the installation of Hadoop and how to analyze the Big Data
Understand the design concepts of HDFS
Provide good insight for developing a MapReduce applications
Understand Hadoop environment.
Explore the concepts of Pig, Hive, Spark and HBase
UNIT-I
Introduction to Big Data:What is Big Data? Why Big Data is Important? Meet Hadoop, Data,
Data Storage and Analysis, Comparison with other systems, History of Apache Hadoop, Hadoop
Ecosystem, VMWare Installation of Hadoop. Analyzing the Data with Hadoop, Scaling Out.
Learning Outcomes:
At the end of the unit, students will be able to:
Identify the characteristics of datasets. (L3)
Compare trivial data and big data for various applications. (L4)
Choose and implement various ways of selecting suitable model parameters.(L1)
UNIT- II
HDFS: The Design of HDFS, HDFS Concepts, The Command-Line Interface, Hadoop File
systems, The Java Interface, Data flow.
MapReduce: Developing a MapReduce application, The Configuration API, Setting up the
Development Environment, Running Locally on Test Data, Running on a Cluster
Learning Outcomes:
At the end of the unit, students will be able to:
● Understand and apply scaling up Hadoop techniques and associated technologies.(L2)
● Estimate suitable test data. (L5)
● Apply the MapReduce application on a cluster.(L3)
UNIT-III
How MapReduce Works: Anatomy of a MapReduce, Job Run, Failures, Shuffle and Sort, Task
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Execution.
MapReduce Types and Formats:MapReduce Types, Input formats, output formats.
Learning Outcomes:
At the end of the unit, students will be able to:
● Explore the Anatomy of MapReduce. (L5)
● Illustrate various input and output formats of MapReduce. (L2)
● List various MapReduce types.(L1)
UNIT-IV
Hadoop Environment: Setting up a Hadoop Cluster, Cluster specification, Cluster Setup and
Installation, Hadoop Configuration, Security.
Pig: Installing and Running Pig, an Example, Comparison with Databases, Pig Latin, User-
Defined Functions, Data Processing Operators.
Learning Outcomes:
At the end of the unit, students will be able to:
● Show the cluster setup and installation.(L2)
● Demonstrate the Configure the Hadoop.(L2)
● Compare Hadoop with various Databases.(L5)
UNIT-V
Hive: Installing Hive, Running Hive, Comparison with traditional Databases, HiveQL, Tables,
Querying Data.
Spark: Installing Spark, Resilient Distributed Datasets, Shared Variables, Anatomy of a Spark
Job Run.
HBase: HBasics, Installation, clients, Building an Online Query Application.
Learning Outcomes:
At the end of the unit, students will be able to:
● Explain various frameworks of Big Data. (L2)
● Compare Hive with traditional Databases.(L4)
● Learn how to build an online query application.(L1)
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Course Outcomes:
Upon completion of the course, the students should be able to:
Explain the concepts and challenges of big data (L2)
Determine why existing technologies are inadequate to analyze the large data. (L5)
Outline the operations viz. Collect, manage, store, query, and analyze various forms of
big data. (L2)
Apply large-scale analytic tools to solve some of the open big data problems. (L3)
Analyze the impact of big data for business decisions and strategies.(L4)
Design different big data applications. (L6)
Text Books:
1. Tom White, “Hadoop: The Definitive Guide”Fourth Edition, O’reilly Media, 2015.
2. Big Data, Big Analytics: Emerging business intelligence and analytic trends for today’s
businesses, Michael Minnelli, Michelle Chambers, and Ambiga Dhiraj, Wiley Cio Series
Reference Books:
1. Glenn J. Myatt, Making Sense of Data , John Wiley & Sons, 2007 Pete Warden,Big Data
Glossary, O’Reilly, 2011.
2. Michael Berthold, David [Link], Intelligent Data Analysis, Spingers, 2007.
3. Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos,Uderstanding Big
Data : Analytics for Enterprise Class Hadoop and Streaming Data, McGraw Hill Publishing,
2012.
4. Anand Rajaraman and Jeffrey David UIIman, Mining of Massive Datasets Cambridge
University Press, 2012.
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
UNIT-I
Introduction to Big Data:What is Big Data? Why Big Data is Important? Meet Hadoop, Data,
Data Storage and Analysis, Comparison with other systems, History of Apache Hadoop, Hadoop
Ecosystem, VMWare Installation of Hadoop. Analyzing the Data with Hadoop, Scaling Out.
What is Big Data?
Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15
byte size is called Big Data. It is stated that almost 90% of today's data has been
generated in the past 3 years.
Examples of Big Data
o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
o Telecom Company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through
its daily transaction.
Types of Bigdata
Characteristics of Bigdata
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
What is Big Data Analytics?
The process of analysis of large volumes of diverse data sets, using advanced analytic techniques is
referred to as Big Data Analytics.
Types of Big Data Analytics
Why Big Data is Important?
Big Data initiatives were rated as “extremely important” to 93% of companies. Big Data analytics
solution helps organizations to unlock the strategic values and take full advantage of their assets.
It helps organizations:
To understand Where, When and Why their customers buy
Protect the company’s client base with improved loyalty programs
Seizing cross-selling and upselling opportunities
Provide targeted promotional information
Optimize Workforce planning and operations
Improve inefficiencies in the company’s supply chain
Predict market trends
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Predict future needs
Make companies more innovative and competitive
It helps companies to discover new sources of revenue
Meet Hadoop
What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis
of that HDFS was developed. It states that the files will be broken into blocks and stored in
nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts it
into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
Data
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
We live in the data age.
Estimates 0.18 ZB in 2006 and forecasting a tenfold growth by 2011 to 1.8
ZB
21
• 1 ZB = 10 bytes = 1,000 EB = 1,000,000 PB =
1,000,000,000 TB
The flood of data is coming from many sources
New York Stock Exchange generates 1 TB of new trade data per day
Facebook hosts about 10 billion photos taking up 1 PB (=1,000 TB) of
storage
Internet Archive stores around 2 PB, and is growing at a rate of 20 TB per
month
‘Big Data’ can affects smaller organizations or individuals
Digital photos, individual’s interactions – phone calls, emails, documents
– are captured and stored for later access
The amount of data generated by machines will be even greater than that
generated by people
Machine logs, RFID readers, sensor networks, vehicle GPS traces,
retail transactions
Data can be shared for anyone to download and analyze
Public Data Sets on Amazon Web Services, [Link], [Link]
[Link] project
• Watches the astrometry group on Flickr for new photos of
the night sky
• Analyzes each image and identifies the sky.
Data Storage and Analysis
The storage capacities have increased but access speeds haven’t kept up
• Writing is even slower!
Solution : Read and write data in parallel to/from multiple disks
Problem
To solve hardware failure replication
• RAID : Redundant copies of the data are kept in case of
failure
To combine the data in a disk with the others
What Hadoop provides
A reliable shared storage (HDFS)
Efficient analysis (MapReduce)
Comparison with Other Systems
– RDBMS
RDBMS
B-Tree index
Optimized for accessing and updating a small proportion of records
MapReduce
Efficient for updating the large data, uses Sort/Merge to rebuild the DB
Good for the needs to analyze the whole dataset in a batch fashion
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Structured vs. Semi- or Unstructured Data
Structured data : particular predefined schema RDBMS
Semi- or Unstructured data : looser or no particular internal structure
MapReduce
Normalization
To retain the integrity and remove redundancy, relational data is often
normalized
MapReduce performs high-speed streaming reads and writes, and records
that is not normalized are well-suited to analysis with MapReduce.
RDBMS vs. MapReduce
Co-evolution of RDBMS and MapReduce systems
RDBs start incorporating some of the ideas from Map
– Grid Computing
Grid Computing
High Performance Computing(HPC) and Grid Computing communities
have been doing large-scale data processing
• Using APIs as Message Passing Interface(MPI)
HPC
• Distribute the work across a cluster of machines, which
access a shared filesystem, hosted by a SAN
• Works well for compute-intensive jobs
• Meets a problem when nodes need to access larger data
volumes – hundreds of GB, since the network bandwidth is
the bottleneck and compute nodes become idle
Data locality, the heart of MapReduce
MapReduce collocates the data with the compute node, so data access is
fast since it is local
MPI vs. MapReduce
MPI programmers need to handle the mechanics of the data flow
MapReduce programmers think in terms of functions of key and value
pairs, and the data flow is implicit
Partial failure
MapReduce is a shared-nothing architecture tasks have no dependence on one
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
other. the order in which the tasks run doesn’t matter.
MPI programs have to manage the check-pointing and recovery
– Volunteer Computing
Volunteer computing projects
Breaking the problem into chunks called work units
Sending to computers around the world to be analyzed
The Results are sent back to the server when the analysis is completed
The client gets another work unit
SETI@home
to analyze radio telescope data for signs of intelligent life outside earth
SETI@home vs. MapReduce
SETI@home
• Very CPU-intensive, which makes it suitable for running on
hundreds of thousands of computers across the world.
Volunteers are donating CPU cycles, not bandwidth
• Runs a perpetual computation on untrusted machines on the
Internet with highly variable connection speeds and no data
locality
MapReduce
• Designed to run jobs that last minutes or hours on HW
running in a single data center with very high aggregate
bandwidth interconnects
History of Apache Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google
File System paper, published by Google.
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Year Event
2003 Google released the paper, Google File System (GFS).
2004 Google released a white paper on Map Reduce.
2006 o Hadoop introduced.
o Hadoop 0.1.0 released.
o Yahoo deploys 300 machines and within this year reaches 600 machines.
2007 o Yahoo runs 2 clusters of 1000 machines.
o Hadoop includes HBase.
2008 o YARN JIRA opened
o Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
o Yahoo clusters loaded with 10 terabytes per day.
o Cloudera was founded as a Hadoop distributor.
2009 o Yahoo runs 17 clusters of 24,000 machines.
o Hadoop becomes capable enough to sort a petabyte.
o MapReduce and HDFS become separate subproject.
2010 o Hadoop added the support for Kerberos.
o Hadoop operates 4,000 nodes with 40 petabytes.
o Apache Hive and Pig released.
2011 o Apache Zookeeper released.
o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage.
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
2012 Apache Hadoop 1.0 version released.
2013 Apache Hadoop 2.2 version released.
2014 Apache Hadoop 2.6 version released.
2015 Apache Hadoop 2.7 version released.
2017 Apache Hadoop 3.0 version released.
2018 Apache Hadoop 3.1 version released.
HadoopEcosystem
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining
the metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making Hadoop
cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application manager
works as an interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data sets
into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines those tuples
into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just
the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of
the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large
data sets. However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the
name suggests helps the system to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.
ApacheSpark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive
or iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.
ApacheHBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything
of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big
Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry out a
huge task in order to make Hadoop capable of processing large datasets. They are as follows:
Solr, Lucene: These are the two services that perform the task of searching and indexing with
the help of some java libraries, especially Lucene is based on Java which allows spell check
mechanism, as well. However, Lucene is driven by Solr.
Zookeeper: There was a huge issue of management of coordination and synchronization
among the resources or the components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchronization, inter-component based
communication, grouping, and maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator
jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or external
stimulus is given to it.
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
VMWare Installation of Hadoop
How to Install Hadoop on a Linux Virtual Machine on Windows 10
Refer Lab Experiment 1: Click Here
Analyzing the Data with Hadoop
MapReduce is a programming model for data processing.
Hadoop can run MapReduce programs written in various languages like Java, Ruby,
Python, and C++.
Example: A Weather Dataset
Program that mines weather data
Weather sensors collect data every hour at many locations across the
globe
They gather a large volume of log data, which is good candidate for
analysis with MapReduce
Data Format
Data from the National Climate Data Center(NCDC)
Stored using a line-oriented ASCII format, in which each line is a
record
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Data files are organized by date and weather station.
There is a directory for each year from 1901 to 2001, each containing a
gzipped file for each weather station with its readings for that year.
The whole dataset is made up of a large number of relatively small files
since there are tens of thousands of weather station.
The data was preprocessed so that each year’s readings were concatenated
into a single file.
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Map and Reduce
MapReduce works by breaking the processing into 2 phases: the map and
the reduce.
Both map and reduce phases have key-value pairs as input and output.
Programmers have to specify two functions: map and reduce function.
The input to the map phase is the raw NCDC data.
• Here, the key is the offset of the beginning of the line and the
value is each line of the data set.
The map function pulls out the year and the air temperature from each
input value.
The reduce function takes <year, temperature> pairs as input and produces
the maximum temperature for each year as the result.
The whole data flow
Having run through how the MapReduce program works, express it in code
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
A map function, a reduce function, and some code to run the job are
needed.
Map function
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Reduce function
Main function for running the MapReduce job or Driver
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
A test run
The output is written to the output directory, which contains one output file per
reducer
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Scaling Out
To scale out, we need to store the data in a distributed filesystem, HDFS.
Hadoop moves the MapReduce computation to each machine hosting a part of the
data.
Data Flow
A MapReduce job consists of the input data, the MapReduce program, and
configuration information.
Hadoop runs the job by dividing it into 2 types of tasks, map and reduce
tasks.
Two types of nodes, 1 jobtracker and several tasktrackers
• Jobtracker : coordinates and schedules tasks to run on
tasktrakers.
• Tasktrackers : run tasks and send progr ess report to the
jobtracker.
Hadoop divides the input into fixed-size pieces, called input splits, or just
splits.
Hadoop creates one map task for each split, which runs the user-defined
map function for each record in the split.
The quality of the load balancing increases as the splits become more
fine-grained.
• Default size : 1 HDFS block, 64MB
Map tasks write their output to the local disk, not to HDFS.
If the node running a map task fails, Hadoop will automatically rerun the
map task on another node to re-create the map output.
Data Flow – single reduce task
Reduce tasks don’t have the advantage of data locality – the input to a single
reduce task is normally the output from all mappers.
All map outputs are merged across the network and passed to the user-defined
reduce function.
The output of the reduce is normally stored in HDFS.
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Data Flow – multiple reduce tasks
The number of reduce tasks is specified independently not governed by the input
size.
The map tasks partition their output by keys, each creating one partition for each
reduce task.
There can be many keys and their associated values in each partition, but the
records for any key are all in a single partition.
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Data Flow – zero reduce task
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
Combiner Functions
The function calls on the temperature values can be expressed as follows:
• Max(0, 20, 10, 25, 15) = max( max(0, 20, 10), max(25, 15) )
= max(20, 25) = 25
Calculating ‘mean’ temperatures couldn’t use the mean as the combiner
function
• mean(0, 20, 10, 25, 15) = 14
• mean( mean(0, 20, 10),
mean(25, 15) ) = mean(10, 20) = 15.
The combiner function doesn’t replace the reduce function.
It can help cut down the amount of data shuffled between the maps and the
reduces
Combiner Functions
Specifying a combiner function
• The combiner function is defined using the Reducer interface
• It is the same implementation as the reducer function in
MaxTemperatureReducer.
• The only change is to set the combiner class on the JobConf.
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN
[Link] SAGAR, ASSISTANT PROFESSOR CSE, SVCN