0% found this document useful (0 votes)

3 views11 pages

Apache Spark and Big Data Overview

Apache Spark is an open-source system for fast processing of large data in a distributed manner, supporting various tasks like batch processing, real-time streaming, and machine learning. It outperforms the older MapReduce method, which has limitations such as difficulty in writing and maintaining code. The document also covers data warehousing, NoSQL databases, NewSQL, big data management, machine learning, and graph analytics, highlighting their applications and advantages.

Uploaded by

lagharimujahid07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views11 pages

Apache Spark and Big Data Overview

Uploaded by

lagharimujahid07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Apache Spark

 Apache Spark is an open-source system used for processing large data fast.
 It works in a distributed way, meaning it can run on many computers at once.
 Spark uses memory (RAM) to handle data, so
 It supports many types of tasks like:
o Batch processing
o Real-time streaming
o Machine Learning
o Graph-based processing
Limitations of MapReduce
MapReduce is an older method used in Hadoop, but it has some problems:
 Not good for interactive tasks.
 Bad for repeating tasks.
 Writing code is hard and long.
 Maintaining the code is also difficult.
Key Features of Spark
 Speed: Up to 100x faster than MapReduce.
 Easy to use: Works with Java, Python, Scala, R.
 Advanced tools: Supports SQL, streaming, ML, and graph processing.
Different Components of Spark
1. Spark Streaming
o Used for real-time data.
2. Spark SQL
o Allows to run SQL queries on big data.
3. MLlib
o includes algorithms like clustering, classification, etc.
4. GraphX
o Used for graph data and graph-related calculations
Languages used in spark
• Scala (Spark scala)
• Python(pyspark)
• Java(spark java)
• R(sparkR)
• SQL

Data Warehouse
 Central place to store clean, structured data.
 Used for reports and business analysis.
 Fast but expensive.
Data Warehouse Applications
• Business intelligence (BI) and reporting.
• Historical data analysis.
• Structured data analysis with predefined queries.

What is NoSQL?
 It is a type of database that does not store data in fixed rows and columns like
traditional relational databases (RDBMS).
 NoSQL stands for “Not Only SQL”.
 These databases are designed to handle a large amount of unstructured or semi-
structured data (like documents, videos, images, logs, etc.).
 Examples: MongoDB, Cassandra, HBase
 It's cheaper, scalable, and good for fault-tolerance (it keeps working even if some parts
fail).
Types of NoSQL Databases
1. Column-Oriented Databases
o Store data in columns instead of rows.
o Example: Apache Cassandra, HBase.
2. Document Databases
o Store data as documents (usually in JSON or XML format).
o Example: MongoDB, CouchDB.
3. Key-Value Databases
o Simple storage using a key (name) and value (data).
o Example: Redis, DynamoDB.
4. Graph Databases
o Store data as nodes and relationships (edges).
o Example: Neo4j, Amazon Neptune.

NewSQL
 NoSQL has drawbacks — often lacks indexing, can be slow for big queries, and doesn’t
always follow ACID rules.
 Solution: NewSQL combines the benefits of RDBMS and NoSQL.
 Distributed like NoSQL.
 Supports SQL queries and ACID transactions like RDBMS.
 Good for online transactions with high availability and performance.
Searching and Indexing in Big Data
 This is the process of quickly finding data in huge datasets.
 Normal search methods don’t work well for massive, distributed data.
Big data uses special tools like Lucene or Splunk for real-time, large-scale searching and
indexing.
Data Storage in Big Data:
1. RDBMS: Works well for structured data but not for very large or fast-changing datasets.
2. MapReduce + Distributed File Systems (e.g., HDFS, GFS):
 Can process data in parallel across multiple machines.
 Works well for semi-structured/unstructured data.
3. NoSQL/NewSQL Storage:
 Flexible data structure (no fixed format).
 Can grow easily by adding more machines.
Big Data Management and Storage
Managing and storing very large and fast-growing data.
Challenges include:
 Storing heterogeneous data.
 Handling real-time data efficiently.

Big Data Analysis and Management

 Big Data Analysis and Management is the process of handling very large, fast-growing,
and varied types of data.
 Traditional databases and analytics tools cannot process such huge datasets.
New technologies like MapReduce and Hadoop were created to store, manage, and
process big data in a distributed way.
Big Data with Data Mining
 Data mining is the process of extracting useful patterns from data.
 It involves steps like cleaning data, combining data from different sources, selecting
what is needed, transforming it, mining it for patterns, and showing results.

Machine Learning (ML)

Machine learning (ML) is a field of artificial intelligence (AI) that focuses on enabling computers
to learn from experience using data instead of fixed programming.
In simple words, ML teaches the systems to think and understand like humans by learning from
the data.
It finds patterns and makes predictions or decisions.
Why Learn ML
We should learn ML because:
 Used in healthcare, finance, e-commerce, etc.
 Helps in predictions and decision-making.
 Handles huge data that humans can’t process.
 Useful for fraud detection and recommendations.
When to Use ML
 No human expertise exists.
 Need custom models.
 Data is too large to handle manually.
Applications of ML
• Facial Recognition
• Self-driving cars
• Virtual assistants
• Traffic Predictions
• Speech Recognition
• Online Fraud Detection
• Email Spam Filtering
• Product Recommendations
Common Terms
 Model: a machine learning model is the mathematical representation of a real-world
process
 Feature: A property of the data.
 Feature Vector: Set of features.
 Training: Teaching the model using known data.
 Prediction: Model’s guess for new data.
 Target: The output value the model should predict.
Seven steps of Machine Learning
 Gathering Data
 Preparing that data
 Choosing a model
 Training
 Evaluation
 Hyper parameter tuning
 Prediction

Types of ML
Supervised Learning
A learning method where the model is trained using labeled data (input + correct output).
Goal: Learn the relationship between inputs (X) and outputs (Y) so it can predict Y for new X.
Example: Predicting house prices from data about house size and location.
Steps:
1. Prepare data.
2. Choose a suitable algorithm.
3. Fit the model.
4. Validate the model for accuracy.
Advantages:
 Works well with large datasets.
 Can make accurate predictions.
Types:
 Classification: Predict categories (e.g., spam/not spam).
 Regression: Predict continuous values (e.g., predicting sales).
Common Algorithms:
 Linear Regression: Predicts numeric values.
 Support Vector Machine (SVM): Separates data into classes.
 Naive Bayes: Uses probability to classify data..
 Decision Tree: Uses rules in a tree structure to make decisions.
Unsupervised Learning
A learning method where the model is trained using unlabeled data (only inputs, no correct
outputs).
The system finds hidden patterns or groups in the data.
Common Algorithms in Unsupervised Learning
 Clustering – Groups similar data points together.
o K-Means: Divides data into k groups.

o Hierarchical: Builds clusters step by step (agglomerative or divisive).

o Bayesian: Uses probability distributions for grouping.

 Dimensionality Reduction – Reduces the number of features while keeping important

information.
o Example methods: PCA

Advantages:
 No need for labeled data.
 Finds hidden patterns in big datasets.

Reinforcement Learning
A trial-and-error learning method where an agent interacts with an environment and learns
from rewards or penalties.
Used in:
 Game playing.
 Robotics.
 Customer behavior prediction.
Process:
1. Agent takes an action.
2. Receives feedback (reward/penalty).
3. Updates strategy for better future results.
Advantages:
 Learns automatically from past experiences.
 Handles large and complex datasets.

Graph Analytics for Big Data

It is a way to study and understand the connections between different things (data) quickly.
It focuses on connections (edges) between data points (nodes) to discover patterns faster than
traditional databases.

Graph Database
 Stores data as nodes (entities) and edges (relationships).
 Similar to ER diagrams, but directly uses connected structures instead of tables and
joins.
 Example: In a company graph, “Person” and “Department” are nodes, and “works in” is
the edge.
Nodes and Edges
 Nodes: Entities (e.g., person, city).
 Edges: Connections between entities (e.g., friend of, located in).

Traditional vs Graph Database

 Traditional DB: Stores data in tables; joins are used for relationships.
 Graph DB: Stores data as nodes and edges, which makes finding relationships much
faster and easier.

What is Graph Analytics?

 Analyzing the data stored in a graph database.
 Uses edges and nodes for fast relationship-based queries.
Types of Graph Analytics
1. Node Strength Analysis – Measures importance of a node in the network.
2. Edge Strength Analysis – Checks how strong or weak the connection between two
nodes is.
3. Clustering – Groups nodes with similar characteristics.
4. Path Analysis – Finds shortest/widest path between nodes.
5. Predictive Graph Analysis – Predicts future connections or nodes using past graph data.
Real-world Applications
 Social Network Analysis – Find influencers, friends, or talent on LinkedIn/Facebook.
 Recommendation Engines – “You may know/like” suggestions in social media or
streaming apps.
 Compliance – Detect unauthorized transactions or banned entities.
 Fraud Detection – Identify fake accounts, hacked transactions.
 Operations Optimization – Find shortest delivery or travel routes.
 National Security & Defense – Find and stop criminal or terrorist networks.

8. Graphical Models & Bayesian Networks

 Graphical Models: Combine probability theory and graph theory to handle uncertainty
and complexity in systems.
 Bayesian Networks (Directed graphs):
o Show cause-and-effect relationships between variables.
o Use Conditional Probability Tables (CPT) to give probabilities.
o Example: "Grass is wet" can be caused by "Rain" or "Sprinkler On.
 Markov Random Fields (undirected graphs): Show links without a direction.
Key Ideas in Bayesian Networks
 Conditional Independence: A node depends only on its parent nodes, not all others.
 Explaining Away: If two causes can explain something, knowing one cause makes the
other less likely.
 Reasoning Types:
o Bottom-up: From effect to cause (diagnosis).
o Top-down: From cause to effect (prediction).
 Causality: Sometimes can tell if one thing causes another instead of just being related

Extra Topics (Optional)

CAP Theorem
Definition:
The CAP Theorem says that in any distributed database system, it is impossible to guarantee all
three properties at the same time:
1. Consistency (C) – All nodes see the same data at the same time.
2. Availability (A) – Every request gets a response, even if some nodes fail.
3. Partition Tolerance (P) – The system works even if there is a network failure that splits
communication between nodes.
Fault Tolerance:
It is the ability of a system to keep working even when some parts fail.
In simple words: If something goes wrong (like a server crashes), the system doesn’t stop.

What is Hadoop?
 It is an open-source platform by Apache that stores and processes big data.
 It splits large data into parts and stores them across many computers.
 It uses MapReduce to process data in parallel (faster).
 Best for: Handling large volumes of structured and unstructured data.
Used for:
o Data mining

o Machine learning

o Analytics

Hadoop Architecture
HDFS (Hadoop Distributed File System)
 It's used to store large amounts of data.
 It breaks files into blocks and stores them across different machines (nodes).
 There are two main components in HDFS:
o NameNode: Manages metadata (file names, locations).
o DataNode: Stores actual data blocks.
MapReduce Engine:
 A programming model used to process data in parallel.
 It has two main Components.
o JobTracker: Splits data into smaller tasks(“Map”) and sends it to the
TaskTracker.
o TaskTracker: Combines and processes the output from Map and give final
result.
21. Real-Life Use of Hadoop
 Companies using Hadoop:
o Yahoo
o Facebook
o Amazon
o Netflix
Why Hadoop?
Because it is:
 Distributed
 Fault-tolerant
 Open format
 Flexible schema
 Easy to query data

Class Notes and Questions
No ratings yet
Class Notes and Questions
38 pages
Data Science: NoSQL & Machine Learning Basics
No ratings yet
Data Science: NoSQL & Machine Learning Basics
10 pages
Big Data Analytics and Hadoop Evolution
No ratings yet
Big Data Analytics and Hadoop Evolution
31 pages
AI and Big Data Fundamentals Guide
No ratings yet
AI and Big Data Fundamentals Guide
23 pages
Data Science: Types, Storage, and Analysis
No ratings yet
Data Science: Types, Storage, and Analysis
21 pages
BIG DATA ANALYTICS: Introduction To Hadoop, Spark, and Machine-Learning Raj Kamal Ebook Reader Ready Version
100% (3)
BIG DATA ANALYTICS: Introduction To Hadoop, Spark, and Machine-Learning Raj Kamal Ebook Reader Ready Version
51 pages
Understanding Big Data and Its Applications
No ratings yet
Understanding Big Data and Its Applications
42 pages
Big Data ML Techniques Overview
No ratings yet
Big Data ML Techniques Overview
17 pages
DataScience StudyMaterial
No ratings yet
DataScience StudyMaterial
13 pages
BIGDATAU1
No ratings yet
BIGDATAU1
11 pages
Inbound 6105159768696891617
No ratings yet
Inbound 6105159768696891617
9 pages
Data Science Techniques and Processes
No ratings yet
Data Science Techniques and Processes
10 pages
Comprehensive Guide to Data Science
No ratings yet
Comprehensive Guide to Data Science
6 pages
Introduction To Machine Learning: Example
No ratings yet
Introduction To Machine Learning: Example
4 pages
Unit2 DataAnalytics Notes
No ratings yet
Unit2 DataAnalytics Notes
7 pages
Data Engineering Challenges and Solutions
No ratings yet
Data Engineering Challenges and Solutions
30 pages
L02 - Big Data Techniques
No ratings yet
L02 - Big Data Techniques
46 pages
Data Science and Deep Learning Overview
No ratings yet
Data Science and Deep Learning Overview
36 pages
BC, Ai&ml Unit V
No ratings yet
BC, Ai&ml Unit V
20 pages
Topic Wise Data Science Big Data Notes
No ratings yet
Topic Wise Data Science Big Data Notes
3 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
11 pages
Overview of AI and Machine Learning
No ratings yet
Overview of AI and Machine Learning
43 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
28 pages
Bda U1 - 251111 - 170758
No ratings yet
Bda U1 - 251111 - 170758
22 pages
Bda Mte
No ratings yet
Bda Mte
13 pages
IoT Data Analytics and Machine Learning
No ratings yet
IoT Data Analytics and Machine Learning
69 pages
Understanding Data Science and Big Data
No ratings yet
Understanding Data Science and Big Data
27 pages
Unit 4
No ratings yet
Unit 4
49 pages
Understanding Big Data and Analytics
No ratings yet
Understanding Big Data and Analytics
6 pages
Understanding Machine Learning Types
No ratings yet
Understanding Machine Learning Types
7 pages
Big Data - 065417
No ratings yet
Big Data - 065417
11 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
38 pages
Overview of Big Data Analytics
No ratings yet
Overview of Big Data Analytics
134 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
14 pages
Overview of Big Data Technologies
No ratings yet
Overview of Big Data Technologies
89 pages
Detailed Tutorial Notes On Data Mining Concepts
No ratings yet
Detailed Tutorial Notes On Data Mining Concepts
15 pages
K-Means Clustering in Big Data Analysis
No ratings yet
K-Means Clustering in Big Data Analysis
5 pages
Key Data Science Concepts Explained
No ratings yet
Key Data Science Concepts Explained
3 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
6 pages
Apache Concept of Self-Reliance
No ratings yet
Apache Concept of Self-Reliance
71 pages
Understanding Deep Learning Basics
No ratings yet
Understanding Deep Learning Basics
27 pages
Big Data Analytics and AI Applications
No ratings yet
Big Data Analytics and AI Applications
37 pages
Big Data Insights by Gulnaz Banu
No ratings yet
Big Data Insights by Gulnaz Banu
14 pages
AI and Data Science Overview Guide
No ratings yet
AI and Data Science Overview Guide
6 pages
Big Data Interview Notes Abhijeet
No ratings yet
Big Data Interview Notes Abhijeet
2 pages
Introduction to Database Management Systems
No ratings yet
Introduction to Database Management Systems
13 pages
Understanding NoSQL and NewSQL Databases
No ratings yet
Understanding NoSQL and NewSQL Databases
53 pages
Codd's 12 Rules for RDBMS Explained
No ratings yet
Codd's 12 Rules for RDBMS Explained
18 pages
Overview of Relational Database Systems
No ratings yet
Overview of Relational Database Systems
14 pages
Big Data Systems Course Overview
No ratings yet
Big Data Systems Course Overview
5 pages
Hadoop Overview and MapReduce Basics
No ratings yet
Hadoop Overview and MapReduce Basics
34 pages
Dhruba Borthakur on Hadoop and Hive
No ratings yet
Dhruba Borthakur on Hadoop and Hive
32 pages
Attachment 0
No ratings yet
Attachment 0
57 pages
Understanding Hadoop Archives (HAR)
No ratings yet
Understanding Hadoop Archives (HAR)
55 pages
IOT Scheme
No ratings yet
IOT Scheme
12 pages
GCP Data Lake Overview and Services
No ratings yet
GCP Data Lake Overview and Services
31 pages
Data Analytics Training on AWS
No ratings yet
Data Analytics Training on AWS
160 pages
Data Science Pipeline in Hadoop Ecosystem
No ratings yet
Data Science Pipeline in Hadoop Ecosystem
8 pages
Spark SQL Tutorial PDF
100% (1)
Spark SQL Tutorial PDF
35 pages
p652 Baeza Yates
No ratings yet
p652 Baeza Yates
4 pages
R For Programmers PDF
No ratings yet
R For Programmers PDF
370 pages
Data Engineering Lifecycle Overview
100% (1)
Data Engineering Lifecycle Overview
16 pages
Big Data Analytics Using Hadoop Framework
No ratings yet
Big Data Analytics Using Hadoop Framework
5 pages
Big Data Theory Exam Questions 2022
No ratings yet
Big Data Theory Exam Questions 2022
1 page
Swetha G: Senior Data Engineer Profile
No ratings yet
Swetha G: Senior Data Engineer Profile
9 pages
Hadoop Architecture and Cloud Technologies
No ratings yet
Hadoop Architecture and Cloud Technologies
33 pages
Hadoop Basics: Data Formats & Analysis
No ratings yet
Hadoop Basics: Data Formats & Analysis
17 pages
Experiment-9 - Program 9 Experiment-9 - Program 9
No ratings yet
Experiment-9 - Program 9 Experiment-9 - Program 9
5 pages
Understanding Apache Hadoop MapReduce
No ratings yet
Understanding Apache Hadoop MapReduce
11 pages
Software Engineer & Data Scientist Profile
No ratings yet
Software Engineer & Data Scientist Profile
4 pages
AWS Certified Solutions Architect Q&A
No ratings yet
AWS Certified Solutions Architect Q&A
29 pages
Hadoop Overview for Big Data Students
No ratings yet
Hadoop Overview for Big Data Students
30 pages
Hive vs Pig in Big Data Processing
No ratings yet
Hive vs Pig in Big Data Processing
10 pages
BigQueryTechnicalWP PDF
No ratings yet
BigQueryTechnicalWP PDF
12 pages
Page Rank and Web Measurement Insights
No ratings yet
Page Rank and Web Measurement Insights
5 pages
Implementation Issues of A Cloud Computing Platform
No ratings yet
Implementation Issues of A Cloud Computing Platform
8 pages
Installing Apache Pig on Windows
No ratings yet
Installing Apache Pig on Windows
24 pages
Introduction to Big Data Course Overview
No ratings yet
Introduction to Big Data Course Overview
5 pages
Big Data Analysis Exam Questions
No ratings yet
Big Data Analysis Exam Questions
4 pages

Apache Spark and Big Data Overview

Uploaded by

Apache Spark and Big Data Overview

Uploaded by

Apache Spark

Big Data Analysis and Management

Machine Learning (ML)

o Hierarchical: Builds clusters step by step (agglomerative or divisive).

o Bayesian: Uses probability distributions for grouping.

 Dimensionality Reduction – Reduces the number of features while keeping important

Graph Analytics for Big Data

Traditional vs Graph Database

What is Graph Analytics?

8. Graphical Models & Bayesian Networks

Extra Topics (Optional)

You might also like