Hadoop File Management Tasks: Separate Blocks Instead of A Flowing Answer Continuity Marks Smooth, Connected Narrative
Hadoop File Management Tasks: Separate Blocks Instead of A Flowing Answer Continuity Marks Smooth, Connected Narrative
Let me fix it into a smooth, connected narrative (while still keeping commands and
structure clear).
1. Aim
The aim of this experiment is to understand and implement file management operations in
Hadoop Distributed File System (HDFS), including creating directories, uploading files,
retrieving files, and deleting files using Hadoop commands.
2. Software Requirements
To perform this experiment, the following software setup is required. A system with Apache
Hadoop (version 2.x or 3.x) installed and configured is essential. Java (JDK 8 or above) must
be available since Hadoop is Java-based. A Linux/Ubuntu environment is preferred for
smooth execution, along with access to a terminal or command line interface.
3. Introduction to Hadoop
Apache Hadoop is an open-source framework designed to store and process large datasets in
a distributed environment. Instead of relying on a single system, Hadoop distributes data
across multiple machines, allowing efficient storage and parallel processing.
The Hadoop ecosystem mainly consists of HDFS for storage, YARN for resource
management, and MapReduce for data processing. When a file is stored in Hadoop, it is
divided into smaller blocks and distributed across different machines, ensuring both
scalability and fault tolerance.
This distributed nature makes Hadoop highly reliable, as even if one node fails, data can still
be retrieved from other nodes.
4. HDFS Architecture
HDFS follows a master-slave architecture. At the center of this system is the NameNode,
which acts as the master and manages all metadata such as file names, directory structure,
and block locations.
The actual data is stored in DataNodes, which are worker nodes responsible for storing and
retrieving data blocks. These DataNodes continuously communicate with the NameNode
through heartbeat signals to indicate that they are functioning properly.
When a file is stored, it is split into blocks and replicated across multiple DataNodes. This
replication ensures fault tolerance and data reliability.
These operations can be understood step by step in a logical flow, starting from creating
directories, then adding files, retrieving them, and finally deleting them.
Once the directory is created, files can be uploaded from the local system into HDFS using:
At this stage, Hadoop splits the file into blocks and distributes them across multiple
DataNodes. These blocks are also replicated to ensure data safety.
If the user wants to directly view the content of a file without downloading it, the command
is:
During retrieval, the NameNode identifies where the file blocks are stored, and the
DataNodes send those blocks to the client, where they are combined to reconstruct the
original file.
To delete a file:
When a file is deleted, its blocks are removed from DataNodes, and the NameNode updates
the metadata accordingly.
6. Result
Thus, the file management operations in Hadoop HDFS were successfully performed,
including creating directories, uploading files, retrieving files, listing contents, and deleting
files.
7. Conclusion
From this experiment, it can be concluded that Hadoop provides an efficient and reliable way
to manage large-scale data using HDFS. The system’s distributed architecture ensures high
availability, fault tolerance, and scalability. Using simple commands, users can perform
complex file operations seamlessly in a distributed environment.
Good catch — calling it “Terminologies” sounds like theory class, not a lab record. We’ll
convert that section into something more practical and exam-friendly like “Components
Involved in Execution” and keep the flow intact.
AIM:
To run a basic Word Count MapReduce program and understand the MapReduce paradigm in
Hadoop.
Software Requirements:
Windows / Linux Operating System
Cloudera (Hadoop Environment)
Eclipse IDE
Java (JDK 8 or above)
Definition of MapReduce:
MapReduce is the processing framework of Apache Hadoop used for handling large-scale
data in a distributed environment. It is based on two main operations: Map and Reduce.
The Map phase converts input data into key-value pairs, and the Reduce phase processes
these pairs to generate the final output. This approach enables parallel processing, making it
efficient for handling big data.
Stages of MapReduce:
The MapReduce process consists of multiple stages that ensure efficient distributed
processing.
1. Input Splitting
The input file stored in HDFS is divided into smaller chunks called input splits. Each split is
processed independently, enabling parallel execution.
2. Mapping Phase
In this phase, each split is processed by a Mapper. The Mapper reads the data and converts it
into key-value pairs.
4. Reducing Phase
In this phase, grouped data is processed to produce final results. The Reducer sums up values
for each key and generates the total count.
5. Output Phase
The final output is written back into HDFS as result files.
SOURCE CODE:
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
while ([Link]()) {
[Link]([Link]());
[Link](word, one);
sum += [Link]();
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link]([Link](true) ? 0 : 1);
Result:
The Word Count MapReduce program was successfully executed, and the frequency of each
word was generated and stored in HDFS.
Conclusion:
Thus, the MapReduce paradigm was successfully implemented using the Word Count
program. It demonstrates how Hadoop processes large datasets in a distributed and parallel
manner, ensuring efficiency and scalability.
year,temperature
Example:
2020,35
2020,40
2021,38
The program processes this data and outputs the maximum temperature for each year.
👉 Output format:
(year, temperature)
Example:
(2020, 35)
(2020, 40)
(2021, 38)
Step 4: Shuffle and Sort
Example:
👉 Output:
(2020, 40)
(2021, 38)
Algorithm
1. Start the program
2. Load dataset into HDFS
3. Split input into smaller chunks
4. Mapper emits (year, temperature) pairs
5. Shuffle groups data by year
6. Reducer finds maximum temperature
7. Store results in HDFS
8. End program
Output:
2020 40
2021 38
Applications
Weather data analysis
Climate monitoring systems
Time-series data processing
Data aggregation tasks
import [Link];
import [Link];
import [Link].*;
import [Link].*;
import [Link];
import [Link];
// Mapper
public static class MaxTempMapper
extends Mapper<Object, Text, Text, IntWritable> {
// Reducer
public static class MaxTempReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
// Driver
public static void main(String[] args) throws Exception {
[Link]([Link]);
// Set classes
[Link]([Link]);
[Link]([Link]);
// Combiner (optional)
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link](true) ? 0 : 1);
}
}
Exercise -4
import [Link];
import [Link];
import [Link];
import [Link].*;
import [Link].*;
import [Link];
import [Link];
// Mapper Class
public static class MyMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
int sum = 0;
int count = 0;
for (IntWritable val : values) {
sum += [Link]();
count++;
}
[Link](key,
new Text("Total Salary = " + sum +
", Average Salary = " + avg));
}
}
// Driver/Main Method
public static void main(String[] args) throws Exception {
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link](true) ? 0 : 1);
}
}
Input ([Link])
1 John Male 50000
2 Anita Female 60000
3 Ravi Male 45000
4 Priya Female 55000
5 Arun Male 70000
Compile
hadoop [Link] [Link]
jar cf [Link] SalaryGender*.class
Run
hadoop jar [Link] SalaryGender input output
Output
Female Total Salary = 115000, Average Salary = 57500.0
Male Total Salary = 165000, Average Salary = 55000.0
FINAL full Experiment 5 (Apache Pig) answer with everything included except errors,
comparison, and limitations, cleanly structured and continuous so you can directly write it
in exam/lab record.
AIM:
To implement Word Count using Apache Pig and understand data processing using Pig Latin
scripts.
Software Requirements:
Hadoop (Cloudera Environment)
Apache Pig
Linux Terminal
HDFS
2. Parser
Checks syntax of the script
Validates structure and correctness
3. Logical Plan
Converts script into logical operations
Example: LOAD → TOKENIZE → GROUP → COUNT
4. Optimizer
Optimizes execution plan
Reduces unnecessary data movement
Improves performance
5. Physical Plan
Converts logical plan into executable steps
6. Execution Engine
Converts Pig script into MapReduce jobs
Executes on Hadoop cluster using HDFS
HDFS Commands
1. Upload Input File
hdfs dfs -put [Link] /user/cloudera/
2. Verify File
hdfs dfs -ls /user/cloudera
Execution Steps
1. Start Pig using terminal:
pig
2. Enter Pig script commands
3. Execute script
4. Output is generated in HDFS
Output:
hello,2
world,1
hadoop,2
pig,1
Applications
Log file analysis
Text processing
Data transformation
ETL operations
Big data analytics
Result
The Word Count operation was successfully implemented using Apache Pig, and the
frequency of words was generated and stored in HDFS.
Conclusion
Apache Pig provides a simple and efficient way to process large datasets using high-level
scripts. It reduces the complexity of writing MapReduce programs and enables faster
development of big data applications.
🧪 Experiment No: 7
Title:
Write Pig Latin scripts to perform filter, group, join, aggregation, and sorting operations
on datasets.
🎯 Aim
To design and execute Pig Latin scripts that perform data selection, transformation,
aggregation, integration, and ordering using large datasets stored in HDFS, thereby
understanding how high-level data flow operations are executed in Hadoop using Apache
Pig.
Filtering is a data reduction operation used to extract only relevant records from a dataset
based on a condition.
It minimizes data size before further processing
Improves performance by avoiding unnecessary computation
Equivalent to WHERE clause in SQL
👉 In this experiment:
We filter employees where age > 30, meaning only senior employees are selected for further
analysis.
DUMP employees;
Output:
(1,John,30,101)
(2,Jane,28,102)
(3,Mark,35,101)
(4,Susan,25,103)
(5,Paul,40,102)
👉 Confirms successful loading and schema correctness.
DUMP counts;
Output:
(101,2)
(102,2)
(103,1)
👉 Shows how many employees exist in each department.
👉 Demonstrates Pig’s capability for analytical queries.
DUMP joined;
Output:
(1,John,30,101,101,HR)
(3,Mark,35,101,101,HR)
(2,Jane,28,102,102,Finance)
(5,Paul,40,102,102,Finance)
(4,Susan,25,103,103,Engineering)
👉 Combines employee details with department names.
👉 Transforms raw data into meaningful business information.
🔹 Step 7: Sorting Employees by Age (Descending)
Aim
To create internal and external Hive tables and load data into them using CSV and Sqoop.
Theory
🔹 Introduction to Hive
Hive is a data warehouse tool built on Hadoop used to process structured data using SQL-
like language called HiveQL.
🔹 Hive Architecture
1. Hive Clients:
Apache Hive supports different types of client applications for performing queries on
the Hive. These clients can be categorized into three types:
Thrift Clients: As Hive server is based on Apache Thrift, it can serve the request from
all those programming language that supports Thrift.
JDBC Clients: Hive allows Java applications to connect to it using the JDBC driver
which is defined in the class [Link].
ODBC Clients: The Hive ODBC Driver allows applications that support the ODBC
protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to
communicate with the Hive server.)
2. Hive Services:
Hive provides many services as shown in the image above. Let us have a look at each of
them:
Hive CLI (Command Line Interface): This is the default shell provided by the Hive
where you can execute your Hive queries and commands directly.
Apache Hive Web Interfaces: Apart from the command line interface, Hive also
provides a web based GUI for executing Hive queries and commands.
Hive Server: Hive server is built on Apache Thrift and therefore, is also referred as
Thrift Server that allows different clients to submit requests to Hive and retrieve the
final result.
Apache Hive Driver: It is responsible for receiving the queries submitted through the
CLI, the web UI, Thrift, ODBC or JDBC interfaces by a client. Then, the driver
passes the query to the compiler where parsing, type checking and semantic analysis
takes place with the help of schema present in the metastore. In the next step, an
optimized logical plan is generated in the form of a DAG (Directed Acyclic Graph) of
map-reduce tasks and HDFS tasks. Finally, the execution engine executes these tasks
in the order of their dependencies, using Hadoop.
Metastore: You can think metastore as a central repository for storing all the Hive
metadata information. Hive metadata includes various types of information like
structure of tables and the partitions along with the column, column type, serializer
and deserializer which is required for Read/Write operation on the data present in
HDFS. The metastore comprises of two fundamental units:
o A service that provides metastore access to other Hive services.
o Disk storage for the metadata which is separate from HDFS storage.
🔹 Types of Tables
Internal Table
Managed by Hive
Data deleted when table is dropped
External Table
Only metadata managed
Data remains even if table is dropped
Procedure
🔹 Step 1: Create Internal Table
Query
CREATE TABLE employee (
id INT,
name STRING,
salary FLOAT,
dept STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Output
OK
Time taken: 0.532 seconds
Query
SHOW TABLES;
Output
employee
Query
DESCRIBE employee;
Output
id int
name string
salary float
dept string
Query
CREATE EXTERNAL TABLE employee_ext (
id INT,
name STRING,
salary FLOAT,
dept STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hive/external/employee';
Output
OK
Time taken: 0.421 seconds
Query
LOAD DATA LOCAL INPATH '/home/hive/[Link]'
INTO TABLE employee;
Output
Loading data to table [Link]
OK
Time taken: 1.24 seconds
Query
SELECT * FROM employee;
Output
1 Alice 50000 HR
2 Bob 60000 IT
3 John 55000 Finance
4 Sam 62000 IT
5 Meera 70000 HR
Aim
To perform filtering, updating, and join operations in Hive.
Theory
🔹 Hive Operations
Filtering
Uses WHERE clause to retrieve specific data
Updating
Hive does not support direct UPDATE
Uses INSERT OVERWRITE
🔹 Joins in Hive
Inner Join
Returns matching records
Left Outer Join
Returns all records from left table
Right Outer Join
Returns all records from right table
Full Outer Join
Returns all records from both tables
Procedure
Query
SELECT * FROM employee
WHERE salary > 55000;
Output
2 Bob 60000 IT
4 Sam 62000 IT
5 Meera 70000 HR
Query
SELECT * FROM employee
WHERE dept='IT' AND salary > 60000;
Output
4 Sam 62000 IT
Query
SELECT * FROM employee
WHERE name LIKE 'A%';
Output
1 Alice 50000 HR
🔹 Step 4: Update Data
Query
INSERT OVERWRITE TABLE employee
SELECT id, name, salary+1000, dept
FROM employee;
Output
Query executed successfully
Query
SELECT * FROM employee;
Output
1 Alice 51000 HR
2 Bob 61000 IT
3 John 56000 Finance
4 Sam 63000 IT
5 Meera 71000 HR
Query
CREATE TABLE department (
dept_id INT,
dept_name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Output
OK
🔹 Step 7: Insert Data
Query
INSERT INTO department VALUES
(1,'HR'),
(2,'IT'),
(3,'Finance');
Output
3 rows inserted
Query
SELECT [Link], d.dept_name
FROM employee e
JOIN department d
ON [Link] = d.dept_id;
Output
Alice HR
Bob IT
John Finance
Query
SELECT [Link], d.dept_name
FROM employee e
LEFT OUTER JOIN department d
ON [Link] = d.dept_id;
Output
Alice HR
Bob IT
John Finance
Sam NULL
Meera NULL
Query
SELECT [Link], d.dept_name
FROM employee e
RIGHT OUTER JOIN department d
ON [Link] = d.dept_id;
Output
Alice HR
Bob IT
John Finance
Query
SELECT [Link], d.dept_name
FROM employee e
FULL OUTER JOIN department d
ON [Link] = d.dept_id;
Output
Alice HR
Bob IT
John Finance
Sam NULL
Meera NULL
NULL Marketing
Result
Thus, Hive operations like filtering, updating, and different join operations were successfully
performed.
**WEEK-10**
1. Driver
The driver is the master node that controls the execution of parallel operations on the cluster.
It is responsible for:
Converting user programs into tasks.
Scheduling tasks on executors.
Coordinating between executors.
Tracking the progress of tasks and handling failures.
The driver runs the main function of the application and maintains the SparkContext, which
is the primary entry point for Spark functionality.
2. SparkContext
SparkContext is the gateway to all Spark functionalities. It is responsible for:
Connecting to the cluster manager.
Acquiring resources on the cluster.
Creating RDDs and performing transformations and actions on them.
When you write a Spark application, the first thing you do is create a SparkContext object.
3. Cluster Manager
The cluster manager allocates resources across the cluster. Spark supports several cluster
managers:
Standalone Cluster Manager: A simple cluster manager is included with Spark that
makes it easy to set up a cluster.
Apache Mesos: A general cluster manager that can run Hadoop MapReduce and other
applications.
Hadoop YARN: The resource manager in Hadoop 2.0 that manages the resources of a
Hadoop cluster.
4. Executors
Executors are worker nodes in the cluster that run individual tasks. Each application has its
own executors that:
Execute code assigned to them by the driver.
Report the status of computation and storage to the driver.
Executors store data for RDDs that are cached by user programs
through [Link]() or [Link]().
5. Tasks
Tasks are the smallest unit of work in Spark. They are a part of a job, which is divided into
stages. Each stage is further divided into tasks that run on individual executors. Tasks are
assigned to executors based on data locality and resource availability.
## **AIM**
To study and implement the basic operations of **PySpark**, namely **SparkSession**,
**SparkContext**, and different **ways of creating RDDs (Resilient Distributed Datasets)**
for distributed data processing.
---
## **SOFTWARE REQUIREMENTS**
- **Operating System:** Windows / Linux / macOS
- **Tool:** Google Colab / Jupyter Notebook
- **Language:** Python
- **Technology:** Apache Spark using PySpark
- **Libraries Required:**
- pyspark
---
## **DESCRIPTION**
PySpark is widely used in **data engineering, machine learning, data analysis, and real-time
data processing**. It provides powerful components such as **SparkSession**,
**SparkContext**, **RDDs**, **DataFrames**, and **Spark SQL**.
In this experiment, the focus is on understanding the **basic building blocks of PySpark**,
which are essential for working with Spark applications.
---
**Purpose of SparkSession:**
- To initialize the Spark environment
- To provide access to DataFrame and SQL operations
- To simplify Spark application development
---
**Purpose of SparkContext:**
- To connect the application to Spark
- To control distributed execution
- To create and manipulate RDDs
---
---
## **Ways to Create RDD**
There are multiple ways to create an RDD in PySpark. Some of the common methods are:
**Example:**
A Python list like `[10, 20, 30, 40]` can be distributed as an RDD.
**Use:**
This method is useful for small datasets and learning purposes.
---
**Example:**
A text file stored in local storage or HDFS can be read line by line into an RDD.
**Use:**
This method is used when data is available in files and needs distributed processing.
---
**Use:**
This is useful when switching from structured processing to RDD-based transformations.
---
**Use:**
This is useful in special cases where data will be added later or for testing purposes.
---
**WEEK-11**
---
## **AIM**
To perform and understand various **PySpark transformations and DataFrame operations**
such as converting an **RDD into a DataFrame**, reading a **CSV file into a DataFrame**,
using **drop()** and **dropDuplicates()**, applying **orderBy()** and **sort()**, and
performing **map**, **flatMap**, **groupBy**, and **join** operations for distributed
data processing.
---
## **SOFTWARE REQUIREMENTS**
- **Operating System:** Windows / Linux / macOS
- **Tool:** Google Colab / Jupyter Notebook
- **Programming Language:** Python
- **Technology:** Apache Spark using PySpark
- **Library Required:** pyspark
---
## **DESCRIPTION**
Transformations in PySpark are operations that create a **new dataset from an existing
dataset**. These transformations are executed lazily, meaning they are not performed
immediately until an action is called.
RDDs can be converted into DataFrames to make data processing easier and more efficient.
DataFrames provide better optimization and are easier to use for structured data analysis.
**Purpose:**
- To convert unstructured or semi-structured distributed data into tabular form
- To perform SQL-like operations on distributed data
- To improve readability and performance
**Example Concept:**
A list of employee records stored as an RDD can be converted into a DataFrame with column
names such as **ID**, **Name**, and **Department**.
---
A CSV file is a common format used for storing structured data. By reading a CSV file into a
DataFrame, the data can be processed, filtered, sorted, and analyzed easily.
**Purpose:**
- To load structured data from external storage
- To perform distributed processing on file-based data
- To simplify analysis using DataFrame operations
**Example Concept:**
A student details CSV file containing columns like **Roll No**, **Name**, and **Marks**
can be loaded into a PySpark DataFrame.
---
#### **drop()**
The **drop()** function is used to remove one or more columns from a DataFrame.
**Purpose:**
- To eliminate unwanted columns
- To simplify the dataset
- To keep only relevant information
#### **dropDuplicates()**
The **dropDuplicates()** function is used to remove duplicate rows from a DataFrame.
**Purpose:**
- To clean repeated data
- To improve data quality
- To ensure unique records in analysis
**Example Concept:**
If the same student record appears multiple times, **dropDuplicates()** removes the
repeated entries.
---
#### **orderBy()**
The **orderBy()** function is used to sort a DataFrame based on one or more columns.
#### **sort()**
The **sort()** function performs sorting similar to **orderBy()**.
Both are commonly used to organize data for reporting and analysis.
**Purpose:**
- To arrange records systematically
- To identify highest or lowest values
- To improve clarity in output presentation
**Example Concept:**
Student records can be sorted by **marks** in ascending or descending order.
---
#### **map()**
The **map()** function applies a given function to each element of the RDD and returns a
new RDD.
**Purpose:**
- To transform each element individually
- To modify data values
- To create a new dataset with changed output
**Example Concept:**
If an RDD contains numbers, **map()** can square each number.
#### **flatMap()**
The **flatMap()** function is similar to **map()**, but it can return multiple values for each
input element. The result is flattened into a single RDD.
**Purpose:**
- To split data into multiple parts
- To extract words from lines of text
- To flatten nested output into a simple list
**Example Concept:**
If an RDD contains sentences, **flatMap()** can split each sentence into words.
---
#### **groupBy()**
The **groupBy()** function groups rows having the same values in one or more columns.
**Purpose:**
- To categorize data into groups
- To perform aggregate operations such as count, sum, and average
- To summarize data effectively
**Example Concept:**
Students can be grouped by **department** to count how many students belong to each
department.
#### **sort()**
After grouping, the results can be sorted for better presentation.
**Purpose:**
- To arrange grouped results in an ordered way
- To make analysis easier
#### **join()**
The **join()** function combines two DataFrames based on a common column.
**Purpose:**
- To merge related datasets
- To combine information from multiple tables
- To perform relational-style operations
**Example Concept:**
A student DataFrame and a marks DataFrame can be joined using **Student ID**.
---
**WEEK-12**
**Title:** Build a Machine Learning Model using PySpark ML Library
---
## **AIM**
To build and implement a **Machine Learning model using the PySpark ML library** by
creating a dataset, converting input columns into feature vectors, training the model, and
generating predictions in a distributed environment.
---
## **SOFTWARE REQUIREMENTS**
- **Operating System:** Windows / Linux / macOS
- **Tool:** Google Colab / Jupyter Notebook
- **Programming Language:** Python
- **Technology:** Apache Spark using PySpark
- **Libraries Required:** pyspark, [Link]
---
## **DESCRIPTION**
Unlike traditional machine learning tools that work on a single system, PySpark performs
processing in a **distributed environment**, making it suitable for handling large-scale data
efficiently.
In this experiment, a machine learning model is built using the PySpark ML library. The
given program creates a dataset, prepares the feature columns, trains a classification model,
and predicts the output.
---
The labels in the dataset are represented as **0 and 1**, which indicates that the problem
belongs to **binary classification**. For this purpose, the program uses the **Logistic
Regression algorithm** provided by PySpark.
---
SparkSession is the entry point for working with PySpark. It initializes the Spark application
and allows the user to perform DataFrame operations and machine learning tasks.
**Purpose:**
- To start the PySpark environment
- To enable DataFrame and ML operations
- To act as the main interface for Spark programming
---
**Purpose:**
- To provide structured input data for the model
- To represent the features and corresponding class labels
- To make the data ready for machine learning processing
---
**Purpose:**
- To transform multiple feature columns into one vector column
- To prepare the dataset in the format required by PySpark ML
- To simplify model training
---
**Purpose:**
- To train the machine learning model using the input data
- To identify the relationship between features and output
- To prepare the model for prediction
---
This indicates the class predicted by the model for each input record.
**Purpose:**
- To test the working of the trained model
- To generate the predicted class labels
- To observe how the machine learning model behaves on the data
---
- Spark initialization
- Data creation
- Feature preparation
- Model building
- Prediction generation
Thus, although the title is general, the content is correctly mapped to the actual implemented
program.
---
``` id="71024"
+----------------------+
| Input Dataset |
| label, x1, x2 |
+----------+-----------+
|
v
+----------------------+
| SparkSession |
| Initialize Spark App |
+----------+-----------+
|
v
+----------------------+
| Create DataFrame |
| Structured Dataset |
+----------+-----------+
|
v
+----------------------+
| VectorAssembler |
| x1, x2 -> features |
+----------+-----------+
|
v
+----------------------+
| ML Model Training |
| Logistic Regression |
+----------+-----------+
|
v
+----------------------+
| Prediction |
| Output Classes |
+----------------------+