0% found this document useful (0 votes)
7 views41 pages

Big Data Analytics and Hadoop Ecosystem

The document outlines the Big Data Analytics Ecosystem, detailing components such as distributed servers, storage, programming models, and processing operations. It emphasizes the Hadoop ecosystem, including its core components like HDFS, YARN, and MapReduce, and explains the MapReduce programming model for processing large datasets. Additionally, it discusses when to use Hadoop and introduces Apache Pig as a higher-level data processing language to simplify Hadoop's complexity.

Uploaded by

MOHIT SOLANKI
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views41 pages

Big Data Analytics and Hadoop Ecosystem

The document outlines the Big Data Analytics Ecosystem, detailing components such as distributed servers, storage, programming models, and processing operations. It emphasizes the Hadoop ecosystem, including its core components like HDFS, YARN, and MapReduce, and explains the MapReduce programming model for processing large datasets. Additionally, it discusses when to use Hadoop and introduces Apache Pig as a higher-level data processing language to simplify Hadoop's complexity.

Uploaded by

MOHIT SOLANKI
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Big Data Analytics Ecosystem

Dr. Ashish Kumar

1
Big Data Analytics Ecosystem
• Where processing is hosted?
– Distributed Servers / Cloud (e.g. Amazon EC2)

• Where data is stored?


– Distributed Storage (e.g. Amazon S3)

• What is the programming model?


– Distributed Processing (e.g. MapReduce)

• How data is stored & indexed?


– High-performance schema-free databases (e.g. MongoDB)

• What operations are performed on data?


– Analytic / Semantic Processing
2
Big Data Analytics Ecosystem
 These is not any standard combination for the above question.
 It varies from project to project.
 Whereas go check this website for Big data tools and create
your own ecosystem depends upon your project:
[Link]

3
Hadoop Distributions / Ecosystem
It's a distributed framework for running
applications on large clusters of commodity
hardware which produces huge data and to
process it.
Apache Software Foundation Project
Java-based implementation of MapReduce
Open source
“Core” Hadoop Components :
 Hadoop Common (formerly Hadoop Core)
 Hadoop MapReduce
 Hadoop YARN (MapReduce 2.0)
 Hadoop Distributed File System (HDFS)

Concept : Moving computation is more


4
Why Hadoop?
Need to process Multi Petabyte Datasets
Expensive to build reliability in each
application.
Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not
constant.
Need common infrastructure
– Efficient, reliable, Open Source Apache License
The above goals are same as Condor, but
Workloads are IO bound and not CPU bound

5
The Hadoop Ecosystem

6
The Hadoop Ecosystem - New

7
The Hadoop Ecosystem
Following are the components that collectively form
a Hadoop ecosystem:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm
libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
8
Typical Hadoop Cluster
Aggregation
switch

Rack
switch

 Typically in 2 level architecture


 40 nodes/rack, 1000-4000 nodes in cluster
 1 Gbps bandwidth within rack, 8 Gbps out of
rack
 Node specs (Yahoo terasort): 8 x 2GHz cores, 8
GB RAM, 4 disks (= 4 TB?)
9
Typical Hadoop Cluster

10
Hadoop Distributed File System
Files split into 128MB blocks
Blocks replicated across several datanodes
(usually 3)
Single namenode stores metadata (file names,
block locations, etc)
Optimized for large files, sequential reads
Files are append-only

HDFS

11
MapReduce
MapReduce is a software framework and programming model
used for processing huge amounts of data.
MapReduce program work in two phases:
 Map- tasks deal with splitting and mapping of data.
Reduce- tasks shuffle and reduce the data.
MapReduce programs are parallel in nature. Very useful for
performing large-scale data analysis using multiple machines in
the cluster.
MapReduce is a programming model or pattern within the
Hadoop framework that is used to access big data stored in the
Hadoop File System (HDFS).
It is a core component, integral to the functioning of the Hadoop
framework.
12
MapReduce
Goal: count the number of books in the
library.

Map:
You count up shelf #1, I count up shelf #2.
(The more people we get, the faster this
part goes)

Reduce:
We all get together and add up our
individual counts.
13
MapReduce In A Nutshell

14
A Word Count Example of MapReduce
Let us understand, how a MapReduce works
by taking an example where I have a text file
called [Link] whose contents are as
follows:

Deer, Bear, River, Car, Car, River, Deer, Car


and Bear

Now, suppose, we must perform a word count


on the [Link] using MapReduce. So, we
will be finding the unique words and the
number of occurrences of those unique 15
A Word Count Example of MapReduce

16
Another Example

17
MapReduce Jobs
Input Splits: An input to a MapReduce job is
divided into fixed-size pieces called input
splits Input split is a chunk of the input that is
consumed by a single map
Mapping: This is the very first phase in the
execution of map-reduce program. In this phase
data in each split is passed to a mapping function
to produce output values.
Shuffling: This phase consumes the output of
Mapping phase. Its task is to consolidate the
relevant records from Mapping phase output.
Reducing: In this phase, output values from the
Shuffling phase are aggregated. This phase
combines values from Shuffling phase and18
MapReduce Tasks:
Hadoop divides these jobs into tasks.

1. Map tasks (Splits & Mapping)

2. Reduce tasks (Shuffling, Reducing)

19
The MapReduce Paradigm
Platform for reliable, scalable parallel
computing (Programing)
Abstracts issues of distributed and parallel
environment from programmer.
Runs over distributed file systems
 Google File System
 Hadoop Distributed File System (HDFS)

20
MapReduce Programming Model
Inspired from map and reduce operations
commonly used in functional programming
languages like Lisp.
Input: a set of key/value pairs
User supplies two functions:
map(k,v)  list(k1,v1)
reduce(k1, list(v1))  v2
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs

21
MapReduce: The Map Step

22
MapReduce: The Reduce Step

23
Pseudo-code
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
// Group by step done by system on key of intermediate Emit
above, and // reduce called on list of values in each group.
reduce(String output_key, Iterator
intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));

24
MapReduce: Execution Overview

25
MapReduce Characteristics
 Very large scale data: peta, exa bytes
 Write once and read many data: allows for parallelism
without mutexes
 Map and Reduce are the main operations: simple code
 There are other supporting operations such as combine and
partition (out of the scope of this talk).
 All the map should be completed before reduce operation
starts.
 Map and reduce operations are typically performed by the
same physical processor.
 Number of map tasks and reduce tasks are configurable.
 Operations are provisioned near the data.
 Commodity hardware and storage.
 Runtime takes care of splitting and moving data for
operations.
 Special distributed file system. Example: Hadoop26
Classes of problems “MapReducable”
Benchmark for comparing: Jim Gray’s challenge
on data-intensive computing. Ex: “Sort”
Google uses it (we think) for wordcount,
adwords, pagerank, indexing data.
Simple algorithms such as grep, text-indexing,
reverse indexing
Bayesian classification: data mining domain
Facebook uses it for various operations:
demographics
Financial services use it for analytics
Astronomy: Gaussian analysis for locating extra-
terrestrial objects.
Expected to play a critical role in semantic web
27
Hadoop Cluster Architecture

28
MapReduce vs YARN

29
When to use Hadoop?
Generally, always when “standard tools”
don’t work anymore because of sheer data
size
(rule of thumb: if your data fits on a
regular hard drive, your better off sticking
to Python/SQL/Bash/etc.!)

Aggregation across large data sets: use


the power of Reducers!

Large-scale ETL operations (extract,


transform, load) 30
What is Apache Pig?
Making Hadoop Easy.
Pig Latin, a high level data processing language.
An engine that executes Pig Latin locally or on a
Hadoop cluster.
Pig is Hadoop Subproject.
Apache Incubator: October’07-October’08
Higher level languages:
 Increase programmer productivity
 Decrease duplication of effort
 Open the system to more users
Pig insulates you against hadoop complexity
 Hadoop version upgrades
 JobConf configuration tuning
 Job chains
31
An Example Problem
• Data
– User records
– Pages served
• Question: the 5
pages most
visited by users
aged 18 - 25.

32
In Map Reduce
import
import
[Link];
[Link]; }
[Link]("OK"); [Link]([Link]);
[Link]([Link]);
import [Link]; [Link]([Link]);
import [Link]; // Do the cross product and collect the values [Link](lp, new
for (String s1 : first) { Path("/user/gates/pages"));
import [Link]; for (String s2 : second) { [Link](lp,
import [Link]; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages"));
import [Link]; [Link](null, new Text(outval)); [Link](0);
import [Link]; [Link]("OK"); Job loadPages = new Job(lp);
import [Link]; }
import [Link]; } JobConf lfu = new JobConf([Link]);
import [Link]; } lfu.s
e tJobName("Load and Filter Users");
import [Link]; } [Link]([Link]);
import [Link]; public static class LoadJoined extends MapReduceBase [Link]([Link]);
import org.a
p [Link]; implements Mapper<Text, Text, Text, LongWritable> { [Link]([Link]);
import [Link]; [Link]([Link]);
import [Link]; public void map( [Link]
InputPath(lfu, new
import [Link]; Text k, Path("/user/gates/users"));
import [Link]; Text val, [Link](lfu,
import [Link]; OutputColle
ctor<Text, LongWritable> oc, new Path("/user/gates/tmp/filtered_users"));
import [Link]; Reporter reporter) throws IOException { [Link](0);
import [Link]; // Find the url Job loadUsers = new Job(lfu);
import [Link]; String line = [Link]();
import [Link]; int firstComma = [Link](','); JobConf join = new JobConf(
[Link]);
import [Link]
ontrol; int secondComma = [Link](',', first
Comma); [Link]("Join Users and Pages");
import [Link]; String key = [Link](firstComma, secondComma); [Link]([Link]);
// drop the rest of the record, I don't need it anymore, [Link]([Link]);
public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. [Link]([Link]);
public static class LoadPages extends MapReduceBase Text outKey = new Text(key); [Link](IdentityMap
[Link]);
implements Mapper<LongWritable, Text, Text, Text> { [Link](outKey, new LongWritable(1L)); [Link]([Link]);
} [Link](join, new
public void map(LongWritable k, Text val, } Path("/user/gates/tmp/indexed_pages"));
OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase [Link](join, new
Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparable, Path("/user/gates/tmp/filtered_users"));
// Pull the key out Writable> { [Link]
tOutputPath(join, new
String line = [Link](); Path("/user/gates/tmp/joined"));
int firstComma = [Link](','); public void reduce( [Link](50);
String key = [Link]
string(0, firstComma); Text ke
y, Job joinJob = new Job(join);
String value = [Link](firstComma + 1); Iterator<LongWritable> iter, [Link](loadPages);
Text outKey = new Text(key); OutputCollector<WritableComparable, Writable> oc, [Link](loadUsers);
// Prepend an index to the value so we know which file Reporter reporter) throws IOException {
// it came from. // Add up all the values we see JobConf group = new JobConf(MRE
[Link]);
Text outVal = new Text("1
" + value); [Link]("Group URLs");
[Link](outKey, outVal); long sum = 0; [Link]([Link]);
} wh
ile ([Link]()) { [Link]([Link]);
} sum += [Link]().get(); [Link]([Link]);
public static class LoadAndFilterUsers extends MapReduceBase [Link]("OK"); [Link](SequenceFi
[Link]);
implements Mapper<LongWritable, Text, Text, Text> { } [Link]([Link]);
[Link]([Link]);
public void map(LongWritable k, Text val, [Link](key, new LongWritable(sum)); [Link]([Link]);
OutputCollector<Text, Text> oc, } [Link](group, new
Reporter reporter) throws IOException { } Path("/user/gates/tmp/joined"));
// Pull the key out public static class LoadClicks extends MapReduceBase [Link](group, new
String line = [Link](); i
mplements Mapper<WritableComparable, Writable, LongWritable, Path("/user/gates/tmp/grouped"));
int firstComma = [Link](','); Text> { [Link](50);
String value = [Link](
firstComma + 1); Job groupJob = new Job(group);
int age = [Link](value); public void map( [Link](joinJob);
if (age < 18 || age > 25) return; WritableComparable key,
String key = [Link](0, firstComma); Writable val, JobConf top100 = new JobConf([Link]);
Text outKey = new Text(key); OutputCollector<LongWritable, Text> oc, [Link]("Top 100 sites");
// Prepend an index to the value esoknow
w which file Reporter reporter)
throws IOException { [Link]([Link]);
// it came from. [Link]((LongWritable)val, (Text)key); [Link]([Link]);
Text outVal = new Text("2" + value); } [Link]([Link]);
[Link](outKey, outVal); } [Link](SequenceFileOutputF
[Link]);
} public static class LimitClicks extends MapReduceBase [Link]([Link]);
} implements Reducer<LongWritable, Text, LongWritable, Text> { [Link]([Link]);
public static class Join extends MapReduceBase [Link]([Link]);
implements Reducer<Text, Text, Text, Text> { int count = 0; [Link](top100, new
publicvoid reduce( Path("/user/gates/tmp/grouped"));
public void reduce(Text key, LongWritable key, [Link](top100, new
Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25"));
OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, [Link](1);
Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100);
// For each value, figure out which file it's from and [Link](groupJob);
store it // Only output the first 100 records
// accordingly. while (count
< 100 && [Link]()) { JobControl jc = new JobControl("Find 100
top sites for users
List<String> first = new ArrayList<String>(); [Link](key, [Link]()); 18 to 25");
List<String> second = new ArrayList<String>(); count++; [Link](loadPages);
} [Link](loadUsers);
while ([Link]()) { } [Link](joinJob);
Text t = [Link](); } [Link](groupJob);
String value = [Link]
String(); public static void main(String[] args) throws IOException { [Link](limit);
if ([Link](0) == '1') JobConf lp = new JobConf([Link]); [Link]();
[Link]([Link](1)); [Link]
t JobName("Load Pages"); }
else [Link]([Link](1)); [Link]([Link]); }

33
In Pig Latin
Users = load ‘users’ as (name, age);
Fltrd = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group,
COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’;

34
How it Works
Pig Latin script is translated to a set of
operators which are placed in one or more
MR jobs and executed.
Filter $1 > 0
A =
load ‘myfile’;
B =
filter A by $1 > 0;
C =
group B by $0;
D =
foreach C generate Combiner
group, COUNT(B) as cnt; COUNT(B)
E = filter D by cnt > 5;
dump E;
Reducer
SUM(COUNT(B))
Filter cnt > 5

35
Ease of Translation
Notice how naturally the components of
the job translate into Pig Latin.
Load Load
Users Pages
Users = load …
Filter by
age
Fltrd = filter …
Pages = load …
Join on Jnd = join …
name
Group on
Grpd = group …
url Smmd = … COUNT()…
Count Srtd = order …
clicks Top100 = limit …
Order by
clicks
Take top 5 36
Apache Pig Architecture
[Link]: When a Pig Latin script is sent to
Hadoop Pig, it is first handled by the parser.
The parser is responsible for checking the
syntax of the script, along with other
miscellaneous checks. Parser gives an
output in the form of a Directed Acyclic
Graph (DAG) that contains Pig Latin
statements, together with other logical
operators represented as nodes.
[Link]: After the output from the
parser is retrieved, a logical plan for DAG is
passed to a logical optimizer. The optimizer
is responsible for carrying out the logical
optimizations.
[Link]: The role of the compiler comes
in when the output from the optimizer is
received. The compiler compiles the logical
plan sent by the optimizing The logical plan
is then converted into a series of
MapReduce tasks or jobs.
[Link] Engine: After the logical plan is
37
converted to MapReduce jobs, these jobs
Pig Vs Map Reduce
Faster development time
Data flow versus programming logic
Many standard data operations (e.g. join)
included
Manages all the details of connecting jobs
and data flow
Copes with Hadoop version change issues

38
Pig Commands
load Read data from file system.
store Write data to file system.
foreach Apply expression to each record and output one or more
records.
filter Apply predicate and remove records that do not return true.
group/cogroup Collect records with the same key from one or more inputs.
join Join two or more inputs based on a key. Various join
algorithms available.
order Sort records based on a key.
distinct Remove duplicate records.
union Merge two data sets.
split Split data into 2 or more sets, based on filter conditions.
stream Send all records through a user provided executable.
sample Read a random sample of the data.
limit Limit the number of records.
39
Users Extending Pig: PigPy
Created by Mashall Weir at Zattoo
Uses Python to create Pig Latin scripts on
the fly
Enables looping
Branching based on job results
Submits Pig jobs from Python scripts
Cache intermediate calculations
Avoid variable name collisions in large
scripts

40
Thank You

41

You might also like