0% found this document useful (0 votes)
3 views11 pages

Apache Spark and Big Data Overview

Apache Spark is an open-source system for fast processing of large data in a distributed manner, supporting various tasks like batch processing, real-time streaming, and machine learning. It outperforms the older MapReduce method, which has limitations such as difficulty in writing and maintaining code. The document also covers data warehousing, NoSQL databases, NewSQL, big data management, machine learning, and graph analytics, highlighting their applications and advantages.

Uploaded by

lagharimujahid07
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views11 pages

Apache Spark and Big Data Overview

Apache Spark is an open-source system for fast processing of large data in a distributed manner, supporting various tasks like batch processing, real-time streaming, and machine learning. It outperforms the older MapReduce method, which has limitations such as difficulty in writing and maintaining code. The document also covers data warehousing, NoSQL databases, NewSQL, big data management, machine learning, and graph analytics, highlighting their applications and advantages.

Uploaded by

lagharimujahid07
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Apache Spark

 Apache Spark is an open-source system used for processing large data fast.
 It works in a distributed way, meaning it can run on many computers at once.
 Spark uses memory (RAM) to handle data, so
 It supports many types of tasks like:
o Batch processing
o Real-time streaming
o Machine Learning
o Graph-based processing
Limitations of MapReduce
MapReduce is an older method used in Hadoop, but it has some problems:
 Not good for interactive tasks.
 Bad for repeating tasks.
 Writing code is hard and long.
 Maintaining the code is also difficult.
Key Features of Spark
 Speed: Up to 100x faster than MapReduce.
 Easy to use: Works with Java, Python, Scala, R.
 Advanced tools: Supports SQL, streaming, ML, and graph processing.
Different Components of Spark
1. Spark Streaming
o Used for real-time data.
2. Spark SQL
o Allows to run SQL queries on big data.
3. MLlib
o includes algorithms like clustering, classification, etc.
4. GraphX
o Used for graph data and graph-related calculations
Languages used in spark
• Scala (Spark scala)
• Python(pyspark)
• Java(spark java)
• R(sparkR)
• SQL

Data Warehouse
 Central place to store clean, structured data.
 Used for reports and business analysis.
 Fast but expensive.
Data Warehouse Applications
• Business intelligence (BI) and reporting.
• Historical data analysis.
• Structured data analysis with predefined queries.

What is NoSQL?
 It is a type of database that does not store data in fixed rows and columns like
traditional relational databases (RDBMS).
 NoSQL stands for “Not Only SQL”.
 These databases are designed to handle a large amount of unstructured or semi-
structured data (like documents, videos, images, logs, etc.).
 Examples: MongoDB, Cassandra, HBase
 It's cheaper, scalable, and good for fault-tolerance (it keeps working even if some parts
fail).
Types of NoSQL Databases
1. Column-Oriented Databases
o Store data in columns instead of rows.
o Example: Apache Cassandra, HBase.
2. Document Databases
o Store data as documents (usually in JSON or XML format).
o Example: MongoDB, CouchDB.
3. Key-Value Databases
o Simple storage using a key (name) and value (data).
o Example: Redis, DynamoDB.
4. Graph Databases
o Store data as nodes and relationships (edges).
o Example: Neo4j, Amazon Neptune.

NewSQL
 NoSQL has drawbacks — often lacks indexing, can be slow for big queries, and doesn’t
always follow ACID rules.
 Solution: NewSQL combines the benefits of RDBMS and NoSQL.
 Distributed like NoSQL.
 Supports SQL queries and ACID transactions like RDBMS.
 Good for online transactions with high availability and performance.
Searching and Indexing in Big Data
 This is the process of quickly finding data in huge datasets.
 Normal search methods don’t work well for massive, distributed data.
Big data uses special tools like Lucene or Splunk for real-time, large-scale searching and
indexing.
Data Storage in Big Data:
1. RDBMS: Works well for structured data but not for very large or fast-changing datasets.
2. MapReduce + Distributed File Systems (e.g., HDFS, GFS):
 Can process data in parallel across multiple machines.
 Works well for semi-structured/unstructured data.
3. NoSQL/NewSQL Storage:
 Flexible data structure (no fixed format).
 Can grow easily by adding more machines.
Big Data Management and Storage
Managing and storing very large and fast-growing data.
Challenges include:
 Storing heterogeneous data.
 Handling real-time data efficiently.

Big Data Analysis and Management


 Big Data Analysis and Management is the process of handling very large, fast-growing,
and varied types of data.
 Traditional databases and analytics tools cannot process such huge datasets.
New technologies like MapReduce and Hadoop were created to store, manage, and
process big data in a distributed way.
Big Data with Data Mining
 Data mining is the process of extracting useful patterns from data.
 It involves steps like cleaning data, combining data from different sources, selecting
what is needed, transforming it, mining it for patterns, and showing results.

Machine Learning (ML)


Machine learning (ML) is a field of artificial intelligence (AI) that focuses on enabling computers
to learn from experience using data instead of fixed programming.
In simple words, ML teaches the systems to think and understand like humans by learning from
the data.
It finds patterns and makes predictions or decisions.
Why Learn ML
We should learn ML because:
 Used in healthcare, finance, e-commerce, etc.
 Helps in predictions and decision-making.
 Handles huge data that humans can’t process.
 Useful for fraud detection and recommendations.
When to Use ML
 No human expertise exists.
 Need custom models.
 Data is too large to handle manually.
Applications of ML
• Facial Recognition
• Self-driving cars
• Virtual assistants
• Traffic Predictions
• Speech Recognition
• Online Fraud Detection
• Email Spam Filtering
• Product Recommendations
Common Terms
 Model: a machine learning model is the mathematical representation of a real-world
process
 Feature: A property of the data.
 Feature Vector: Set of features.
 Training: Teaching the model using known data.
 Prediction: Model’s guess for new data.
 Target: The output value the model should predict.
Seven steps of Machine Learning
 Gathering Data
 Preparing that data
 Choosing a model
 Training
 Evaluation
 Hyper parameter tuning
 Prediction

Types of ML
Supervised Learning
A learning method where the model is trained using labeled data (input + correct output).
Goal: Learn the relationship between inputs (X) and outputs (Y) so it can predict Y for new X.
Example: Predicting house prices from data about house size and location.
Steps:
1. Prepare data.
2. Choose a suitable algorithm.
3. Fit the model.
4. Validate the model for accuracy.
Advantages:
 Works well with large datasets.
 Can make accurate predictions.
Types:
 Classification: Predict categories (e.g., spam/not spam).
 Regression: Predict continuous values (e.g., predicting sales).
Common Algorithms:
 Linear Regression: Predicts numeric values.
 Support Vector Machine (SVM): Separates data into classes.
 Naive Bayes: Uses probability to classify data..
 Decision Tree: Uses rules in a tree structure to make decisions.
Unsupervised Learning
A learning method where the model is trained using unlabeled data (only inputs, no correct
outputs).
The system finds hidden patterns or groups in the data.
Common Algorithms in Unsupervised Learning
 Clustering – Groups similar data points together.
o K-Means: Divides data into k groups.

o Hierarchical: Builds clusters step by step (agglomerative or divisive).

o Bayesian: Uses probability distributions for grouping.

 Dimensionality Reduction – Reduces the number of features while keeping important


information.
o Example methods: PCA

Advantages:
 No need for labeled data.
 Finds hidden patterns in big datasets.

Reinforcement Learning
A trial-and-error learning method where an agent interacts with an environment and learns
from rewards or penalties.
Used in:
 Game playing.
 Robotics.
 Customer behavior prediction.
Process:
1. Agent takes an action.
2. Receives feedback (reward/penalty).
3. Updates strategy for better future results.
Advantages:
 Learns automatically from past experiences.
 Handles large and complex datasets.

Graph Analytics for Big Data


It is a way to study and understand the connections between different things (data) quickly.
It focuses on connections (edges) between data points (nodes) to discover patterns faster than
traditional databases.

Graph Database
 Stores data as nodes (entities) and edges (relationships).
 Similar to ER diagrams, but directly uses connected structures instead of tables and
joins.
 Example: In a company graph, “Person” and “Department” are nodes, and “works in” is
the edge.
Nodes and Edges
 Nodes: Entities (e.g., person, city).
 Edges: Connections between entities (e.g., friend of, located in).

Traditional vs Graph Database


 Traditional DB: Stores data in tables; joins are used for relationships.
 Graph DB: Stores data as nodes and edges, which makes finding relationships much
faster and easier.

What is Graph Analytics?


 Analyzing the data stored in a graph database.
 Uses edges and nodes for fast relationship-based queries.
Types of Graph Analytics
1. Node Strength Analysis – Measures importance of a node in the network.
2. Edge Strength Analysis – Checks how strong or weak the connection between two
nodes is.
3. Clustering – Groups nodes with similar characteristics.
4. Path Analysis – Finds shortest/widest path between nodes.
5. Predictive Graph Analysis – Predicts future connections or nodes using past graph data.
Real-world Applications
 Social Network Analysis – Find influencers, friends, or talent on LinkedIn/Facebook.
 Recommendation Engines – “You may know/like” suggestions in social media or
streaming apps.
 Compliance – Detect unauthorized transactions or banned entities.
 Fraud Detection – Identify fake accounts, hacked transactions.
 Operations Optimization – Find shortest delivery or travel routes.
 National Security & Defense – Find and stop criminal or terrorist networks.

8. Graphical Models & Bayesian Networks


 Graphical Models: Combine probability theory and graph theory to handle uncertainty
and complexity in systems.
 Bayesian Networks (Directed graphs):
o Show cause-and-effect relationships between variables.
o Use Conditional Probability Tables (CPT) to give probabilities.
o Example: "Grass is wet" can be caused by "Rain" or "Sprinkler On.
 Markov Random Fields (undirected graphs): Show links without a direction.
Key Ideas in Bayesian Networks
 Conditional Independence: A node depends only on its parent nodes, not all others.
 Explaining Away: If two causes can explain something, knowing one cause makes the
other less likely.
 Reasoning Types:
o Bottom-up: From effect to cause (diagnosis).
o Top-down: From cause to effect (prediction).
 Causality: Sometimes can tell if one thing causes another instead of just being related

Extra Topics (Optional)

CAP Theorem
Definition:
The CAP Theorem says that in any distributed database system, it is impossible to guarantee all
three properties at the same time:
1. Consistency (C) – All nodes see the same data at the same time.
2. Availability (A) – Every request gets a response, even if some nodes fail.
3. Partition Tolerance (P) – The system works even if there is a network failure that splits
communication between nodes.
Fault Tolerance:
It is the ability of a system to keep working even when some parts fail.
In simple words: If something goes wrong (like a server crashes), the system doesn’t stop.

What is Hadoop?
 It is an open-source platform by Apache that stores and processes big data.
 It splits large data into parts and stores them across many computers.
 It uses MapReduce to process data in parallel (faster).
 Best for: Handling large volumes of structured and unstructured data.
Used for:
o Data mining

o Machine learning

o Analytics

Hadoop Architecture
HDFS (Hadoop Distributed File System)
 It's used to store large amounts of data.
 It breaks files into blocks and stores them across different machines (nodes).
 There are two main components in HDFS:
o NameNode: Manages metadata (file names, locations).
o DataNode: Stores actual data blocks.
MapReduce Engine:
 A programming model used to process data in parallel.
 It has two main Components.
o JobTracker: Splits data into smaller tasks(“Map”) and sends it to the
TaskTracker.
o TaskTracker: Combines and processes the output from Map and give final
result.
21. Real-Life Use of Hadoop
 Companies using Hadoop:
o Yahoo
o Facebook
o Amazon
o Netflix
Why Hadoop?
Because it is:
 Distributed
 Fault-tolerant
 Open format
 Flexible schema
 Easy to query data

You might also like