Big Data Analytics Ecosystem
Dr. Ashish Kumar
1
Big Data Analytics Ecosystem
• Where processing is hosted?
– Distributed Servers / Cloud (e.g. Amazon EC2)
• Where data is stored?
– Distributed Storage (e.g. Amazon S3)
• What is the programming model?
– Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
– Analytic / Semantic Processing
2
Big Data Analytics Ecosystem
These is not any standard combination for the above question.
It varies from project to project.
Whereas go check this website for Big data tools and create
your own ecosystem depends upon your project:
[Link]
3
Hadoop Distributions / Ecosystem
It's a distributed framework for running
applications on large clusters of commodity
hardware which produces huge data and to
process it.
Apache Software Foundation Project
Java-based implementation of MapReduce
Open source
“Core” Hadoop Components :
Hadoop Common (formerly Hadoop Core)
Hadoop MapReduce
Hadoop YARN (MapReduce 2.0)
Hadoop Distributed File System (HDFS)
Concept : Moving computation is more
4
Why Hadoop?
Need to process Multi Petabyte Datasets
Expensive to build reliability in each
application.
Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not
constant.
Need common infrastructure
– Efficient, reliable, Open Source Apache License
The above goals are same as Condor, but
Workloads are IO bound and not CPU bound
5
The Hadoop Ecosystem
6
The Hadoop Ecosystem - New
7
The Hadoop Ecosystem
Following are the components that collectively form
a Hadoop ecosystem:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm
libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
8
Typical Hadoop Cluster
Aggregation
switch
Rack
switch
Typically in 2 level architecture
40 nodes/rack, 1000-4000 nodes in cluster
1 Gbps bandwidth within rack, 8 Gbps out of
rack
Node specs (Yahoo terasort): 8 x 2GHz cores, 8
GB RAM, 4 disks (= 4 TB?)
9
Typical Hadoop Cluster
10
Hadoop Distributed File System
Files split into 128MB blocks
Blocks replicated across several datanodes
(usually 3)
Single namenode stores metadata (file names,
block locations, etc)
Optimized for large files, sequential reads
Files are append-only
HDFS
11
MapReduce
MapReduce is a software framework and programming model
used for processing huge amounts of data.
MapReduce program work in two phases:
Map- tasks deal with splitting and mapping of data.
Reduce- tasks shuffle and reduce the data.
MapReduce programs are parallel in nature. Very useful for
performing large-scale data analysis using multiple machines in
the cluster.
MapReduce is a programming model or pattern within the
Hadoop framework that is used to access big data stored in the
Hadoop File System (HDFS).
It is a core component, integral to the functioning of the Hadoop
framework.
12
MapReduce
Goal: count the number of books in the
library.
Map:
You count up shelf #1, I count up shelf #2.
(The more people we get, the faster this
part goes)
Reduce:
We all get together and add up our
individual counts.
13
MapReduce In A Nutshell
14
A Word Count Example of MapReduce
Let us understand, how a MapReduce works
by taking an example where I have a text file
called [Link] whose contents are as
follows:
Deer, Bear, River, Car, Car, River, Deer, Car
and Bear
Now, suppose, we must perform a word count
on the [Link] using MapReduce. So, we
will be finding the unique words and the
number of occurrences of those unique 15
A Word Count Example of MapReduce
16
Another Example
17
MapReduce Jobs
Input Splits: An input to a MapReduce job is
divided into fixed-size pieces called input
splits Input split is a chunk of the input that is
consumed by a single map
Mapping: This is the very first phase in the
execution of map-reduce program. In this phase
data in each split is passed to a mapping function
to produce output values.
Shuffling: This phase consumes the output of
Mapping phase. Its task is to consolidate the
relevant records from Mapping phase output.
Reducing: In this phase, output values from the
Shuffling phase are aggregated. This phase
combines values from Shuffling phase and18
MapReduce Tasks:
Hadoop divides these jobs into tasks.
1. Map tasks (Splits & Mapping)
2. Reduce tasks (Shuffling, Reducing)
19
The MapReduce Paradigm
Platform for reliable, scalable parallel
computing (Programing)
Abstracts issues of distributed and parallel
environment from programmer.
Runs over distributed file systems
Google File System
Hadoop Distributed File System (HDFS)
20
MapReduce Programming Model
Inspired from map and reduce operations
commonly used in functional programming
languages like Lisp.
Input: a set of key/value pairs
User supplies two functions:
map(k,v) list(k1,v1)
reduce(k1, list(v1)) v2
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs
21
MapReduce: The Map Step
22
MapReduce: The Reduce Step
23
Pseudo-code
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
// Group by step done by system on key of intermediate Emit
above, and // reduce called on list of values in each group.
reduce(String output_key, Iterator
intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
24
MapReduce: Execution Overview
25
MapReduce Characteristics
Very large scale data: peta, exa bytes
Write once and read many data: allows for parallelism
without mutexes
Map and Reduce are the main operations: simple code
There are other supporting operations such as combine and
partition (out of the scope of this talk).
All the map should be completed before reduce operation
starts.
Map and reduce operations are typically performed by the
same physical processor.
Number of map tasks and reduce tasks are configurable.
Operations are provisioned near the data.
Commodity hardware and storage.
Runtime takes care of splitting and moving data for
operations.
Special distributed file system. Example: Hadoop26
Classes of problems “MapReducable”
Benchmark for comparing: Jim Gray’s challenge
on data-intensive computing. Ex: “Sort”
Google uses it (we think) for wordcount,
adwords, pagerank, indexing data.
Simple algorithms such as grep, text-indexing,
reverse indexing
Bayesian classification: data mining domain
Facebook uses it for various operations:
demographics
Financial services use it for analytics
Astronomy: Gaussian analysis for locating extra-
terrestrial objects.
Expected to play a critical role in semantic web
27
Hadoop Cluster Architecture
28
MapReduce vs YARN
29
When to use Hadoop?
Generally, always when “standard tools”
don’t work anymore because of sheer data
size
(rule of thumb: if your data fits on a
regular hard drive, your better off sticking
to Python/SQL/Bash/etc.!)
Aggregation across large data sets: use
the power of Reducers!
Large-scale ETL operations (extract,
transform, load) 30
What is Apache Pig?
Making Hadoop Easy.
Pig Latin, a high level data processing language.
An engine that executes Pig Latin locally or on a
Hadoop cluster.
Pig is Hadoop Subproject.
Apache Incubator: October’07-October’08
Higher level languages:
Increase programmer productivity
Decrease duplication of effort
Open the system to more users
Pig insulates you against hadoop complexity
Hadoop version upgrades
JobConf configuration tuning
Job chains
31
An Example Problem
• Data
– User records
– Pages served
• Question: the 5
pages most
visited by users
aged 18 - 25.
32
In Map Reduce
import
import
[Link];
[Link]; }
[Link]("OK"); [Link]([Link]);
[Link]([Link]);
import [Link]; [Link]([Link]);
import [Link]; // Do the cross product and collect the values [Link](lp, new
for (String s1 : first) { Path("/user/gates/pages"));
import [Link]; for (String s2 : second) { [Link](lp,
import [Link]; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages"));
import [Link]; [Link](null, new Text(outval)); [Link](0);
import [Link]; [Link]("OK"); Job loadPages = new Job(lp);
import [Link]; }
import [Link]; } JobConf lfu = new JobConf([Link]);
import [Link]; } lfu.s
e tJobName("Load and Filter Users");
import [Link]; } [Link]([Link]);
import [Link]; public static class LoadJoined extends MapReduceBase [Link]([Link]);
import org.a
p [Link]; implements Mapper<Text, Text, Text, LongWritable> { [Link]([Link]);
import [Link]; [Link]([Link]);
import [Link]; public void map( [Link]
InputPath(lfu, new
import [Link]; Text k, Path("/user/gates/users"));
import [Link]; Text val, [Link](lfu,
import [Link]; OutputColle
ctor<Text, LongWritable> oc, new Path("/user/gates/tmp/filtered_users"));
import [Link]; Reporter reporter) throws IOException { [Link](0);
import [Link]; // Find the url Job loadUsers = new Job(lfu);
import [Link]; String line = [Link]();
import [Link]; int firstComma = [Link](','); JobConf join = new JobConf(
[Link]);
import [Link]
ontrol; int secondComma = [Link](',', first
Comma); [Link]("Join Users and Pages");
import [Link]; String key = [Link](firstComma, secondComma); [Link]([Link]);
// drop the rest of the record, I don't need it anymore, [Link]([Link]);
public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. [Link]([Link]);
public static class LoadPages extends MapReduceBase Text outKey = new Text(key); [Link](IdentityMap
[Link]);
implements Mapper<LongWritable, Text, Text, Text> { [Link](outKey, new LongWritable(1L)); [Link]([Link]);
} [Link](join, new
public void map(LongWritable k, Text val, } Path("/user/gates/tmp/indexed_pages"));
OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase [Link](join, new
Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparable, Path("/user/gates/tmp/filtered_users"));
// Pull the key out Writable> { [Link]
tOutputPath(join, new
String line = [Link](); Path("/user/gates/tmp/joined"));
int firstComma = [Link](','); public void reduce( [Link](50);
String key = [Link]
string(0, firstComma); Text ke
y, Job joinJob = new Job(join);
String value = [Link](firstComma + 1); Iterator<LongWritable> iter, [Link](loadPages);
Text outKey = new Text(key); OutputCollector<WritableComparable, Writable> oc, [Link](loadUsers);
// Prepend an index to the value so we know which file Reporter reporter) throws IOException {
// it came from. // Add up all the values we see JobConf group = new JobConf(MRE
[Link]);
Text outVal = new Text("1
" + value); [Link]("Group URLs");
[Link](outKey, outVal); long sum = 0; [Link]([Link]);
} wh
ile ([Link]()) { [Link]([Link]);
} sum += [Link]().get(); [Link]([Link]);
public static class LoadAndFilterUsers extends MapReduceBase [Link]("OK"); [Link](SequenceFi
[Link]);
implements Mapper<LongWritable, Text, Text, Text> { } [Link]([Link]);
[Link]([Link]);
public void map(LongWritable k, Text val, [Link](key, new LongWritable(sum)); [Link]([Link]);
OutputCollector<Text, Text> oc, } [Link](group, new
Reporter reporter) throws IOException { } Path("/user/gates/tmp/joined"));
// Pull the key out public static class LoadClicks extends MapReduceBase [Link](group, new
String line = [Link](); i
mplements Mapper<WritableComparable, Writable, LongWritable, Path("/user/gates/tmp/grouped"));
int firstComma = [Link](','); Text> { [Link](50);
String value = [Link](
firstComma + 1); Job groupJob = new Job(group);
int age = [Link](value); public void map( [Link](joinJob);
if (age < 18 || age > 25) return; WritableComparable key,
String key = [Link](0, firstComma); Writable val, JobConf top100 = new JobConf([Link]);
Text outKey = new Text(key); OutputCollector<LongWritable, Text> oc, [Link]("Top 100 sites");
// Prepend an index to the value esoknow
w which file Reporter reporter)
throws IOException { [Link]([Link]);
// it came from. [Link]((LongWritable)val, (Text)key); [Link]([Link]);
Text outVal = new Text("2" + value); } [Link]([Link]);
[Link](outKey, outVal); } [Link](SequenceFileOutputF
[Link]);
} public static class LimitClicks extends MapReduceBase [Link]([Link]);
} implements Reducer<LongWritable, Text, LongWritable, Text> { [Link]([Link]);
public static class Join extends MapReduceBase [Link]([Link]);
implements Reducer<Text, Text, Text, Text> { int count = 0; [Link](top100, new
publicvoid reduce( Path("/user/gates/tmp/grouped"));
public void reduce(Text key, LongWritable key, [Link](top100, new
Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25"));
OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, [Link](1);
Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100);
// For each value, figure out which file it's from and [Link](groupJob);
store it // Only output the first 100 records
// accordingly. while (count
< 100 && [Link]()) { JobControl jc = new JobControl("Find 100
top sites for users
List<String> first = new ArrayList<String>(); [Link](key, [Link]()); 18 to 25");
List<String> second = new ArrayList<String>(); count++; [Link](loadPages);
} [Link](loadUsers);
while ([Link]()) { } [Link](joinJob);
Text t = [Link](); } [Link](groupJob);
String value = [Link]
String(); public static void main(String[] args) throws IOException { [Link](limit);
if ([Link](0) == '1') JobConf lp = new JobConf([Link]); [Link]();
[Link]([Link](1)); [Link]
t JobName("Load Pages"); }
else [Link]([Link](1)); [Link]([Link]); }
33
In Pig Latin
Users = load ‘users’ as (name, age);
Fltrd = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group,
COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’;
34
How it Works
Pig Latin script is translated to a set of
operators which are placed in one or more
MR jobs and executed.
Filter $1 > 0
A =
load ‘myfile’;
B =
filter A by $1 > 0;
C =
group B by $0;
D =
foreach C generate Combiner
group, COUNT(B) as cnt; COUNT(B)
E = filter D by cnt > 5;
dump E;
Reducer
SUM(COUNT(B))
Filter cnt > 5
35
Ease of Translation
Notice how naturally the components of
the job translate into Pig Latin.
Load Load
Users Pages
Users = load …
Filter by
age
Fltrd = filter …
Pages = load …
Join on Jnd = join …
name
Group on
Grpd = group …
url Smmd = … COUNT()…
Count Srtd = order …
clicks Top100 = limit …
Order by
clicks
Take top 5 36
Apache Pig Architecture
[Link]: When a Pig Latin script is sent to
Hadoop Pig, it is first handled by the parser.
The parser is responsible for checking the
syntax of the script, along with other
miscellaneous checks. Parser gives an
output in the form of a Directed Acyclic
Graph (DAG) that contains Pig Latin
statements, together with other logical
operators represented as nodes.
[Link]: After the output from the
parser is retrieved, a logical plan for DAG is
passed to a logical optimizer. The optimizer
is responsible for carrying out the logical
optimizations.
[Link]: The role of the compiler comes
in when the output from the optimizer is
received. The compiler compiles the logical
plan sent by the optimizing The logical plan
is then converted into a series of
MapReduce tasks or jobs.
[Link] Engine: After the logical plan is
37
converted to MapReduce jobs, these jobs
Pig Vs Map Reduce
Faster development time
Data flow versus programming logic
Many standard data operations (e.g. join)
included
Manages all the details of connecting jobs
and data flow
Copes with Hadoop version change issues
38
Pig Commands
load Read data from file system.
store Write data to file system.
foreach Apply expression to each record and output one or more
records.
filter Apply predicate and remove records that do not return true.
group/cogroup Collect records with the same key from one or more inputs.
join Join two or more inputs based on a key. Various join
algorithms available.
order Sort records based on a key.
distinct Remove duplicate records.
union Merge two data sets.
split Split data into 2 or more sets, based on filter conditions.
stream Send all records through a user provided executable.
sample Read a random sample of the data.
limit Limit the number of records.
39
Users Extending Pig: PigPy
Created by Mashall Weir at Zattoo
Uses Python to create Pig Latin scripts on
the fly
Enables looping
Branching based on job results
Submits Pig jobs from Python scripts
Cache intermediate calculations
Avoid variable name collisions in large
scripts
40
Thank You
41