0% found this document useful (0 votes)

56 views44 pages

Data Analytics & Visualization Lab Manual

Q: What is ARIMA in the context of time series forecasting, and what are its essential components?

ARIMA stands for AutoRegressive Integrated Moving Average, a model used in time series forecasting to understand the data set and predict future trends. The essential components of ARIMA are: AR (AutoRegressive) part, which models the dependency on previous values, I (Integrated) part, which involves differencing of raw observations to make the series stationary, and MA (Moving Average) part, which models the relationship between an observation and a residual error from a moving average model applied to lagged observations .

Q: How can data be visually interpreted using Matplotlib in Python, and what types of plots are available in this library?

Matplotlib allows visual interpretation of data by providing a platform for creating 2D plots of arrays through various plot types like line, bar, scatter, histogram, etc. The library was introduced by John Hunter in 2002 and is a part of the broader SciPy stack. The visualization capability of Matplotlib is essential because it transforms complex data sets into easily digestible visuals .

Q: How does the Pandas function df.describe() assist in providing a statistical summary of data, and what key metrics does it return?

The df.describe() function in Pandas provides a statistical summary of the dataset and returns key metrics such as minimum, maximum, mean (central tendency), standard deviation (dispersion) of values in numerical columns, and the count of all non-null values. This function is crucial for quickly assessing the statistical properties and detecting any anomalies or patterns within the data without manual calculations .

Q: Explain the concept of seasonality in time series analysis, and describe why it is significant in industries like finance or pharmaceuticals?

Seasonality in time series analysis refers to periodic fluctuations that occur at regular intervals, such as weekly, monthly, or quarterly. It is significant in industries like finance or pharmaceuticals because it impacts decision-making and strategic planning. For instance, recognizing seasonal patterns allows financial analysts to forecast market trends and pharmaceutical companies to anticipate demand cycles, enhancing their capacity for making informed business decisions. The presence of seasonality can indicate essential insights into data trends and irregularities pertinent to specific times, affecting operational efficiency and strategic planning .

Q: In the context of using the Pandas library for data visualization, what role does the distribution of data types play, and how do differencing and shifting aid in data analysis?

The distribution of data types plays a critical role in the Pandas library for data visualization as it dictates the suitable analytical techniques, including differencing and shifting. Differencing helps remove trends by computing differences between data points at specific intervals, which is pivotal in analyzing time series to identify seasonality or cyclic behavior. Meanwhile, shifting enables a comparison of current values with previous period values by moving the data temporally, thus aiding in recognizing patterns or sudden anomalies, which aligns with an objective statistical overview of how data changes over time .

Q: What are the steps involved in implementing a Multiple Linear Regression model in Python?

The steps in implementing a Multiple Linear Regression model in Python include: Step #1 - Data Pre-Processing, which involves importing the libraries, importing the dataset, encoding categorical data, avoiding the dummy variable trap, and splitting the dataset into training and test sets. Step #2 - Fitting the Multiple Linear Regression to the training set. Step #3 - Predicting the test set results .

Q: When handling a dataset with missing values, how does Pandas' df.fillna() function contribute, and what strategic choices can be made in filling these gaps?

Pandas' df.fillna() function contributes to handling datasets with missing values by allowing these gaps to be filled with user-defined values. Strategic choices in filling these gaps can involve setting a constant replacement value, using statistical measures like mean or median of the column, or implementing more complex strategies like interpolation to estimate the missing points based on surrounding data. The choice of strategy depends on the dataset's context and the analysis's objective, whether maintaining dataset integrity or avoiding bias from imputation .

Q: What distinguishes the usage of Pandas' functions df.loc and df.iloc, and in what scenarios would one be preferable over the other?

The distinction between Pandas' df.loc and df.iloc lies in their indexing methods: df.loc uses labels or names of rows and columns for selection, making it preferable when accessing data by specific index names, while df.iloc uses integer-based position selection, making it ideal for iteration over rows and columns by number index. Selection scenarios would determine the preferable method; for instance, df.loc is useful for selection based on specific labels, whereas df.iloc is better when the index positions are static but the data content may vary .

Q: What is the importance of the CO-PSO mapping matrix in the educational syllabus, particularly in courses like AIML Engineering?

The CO-PSO (Course Outcome - Program Specific Outcome) mapping matrix is important in educational syllabuses, particularly for courses like AIML (Artificial Intelligence and Machine Learning) Engineering, as it aligns the course objectives with program-specific outcomes, ensuring that each element of the curriculum is strategically targeted to develop competencies required for the field. It allows educators to systematically evaluate whether the specific learning outcomes of each course contribute towards achieving the overarching goals of the program, such as practical application skills and theoretical understanding .

Q: What are some of the challenges faced in the visualization of large datasets using tools like Plotly and ggplot2, and how can they be mitigated?

Challenges in visualizing large datasets using tools like Plotly and ggplot2 include rendering performance issues, overplotting where data points overlap excessively, and the complexity of interpreting high-dimensional data. These challenges can be mitigated by employing techniques such as data aggregation to reduce the rendering load, using interactive visualization features to filter and zoom into datasets, and utilizing dimensionality reduction methods such as PCA to simplify high-dimensional data for clearer interpretation .

The document is a laboratory manual for the Data Analytics and Visualization Lab (CSL601) at Shivajirao S Jondhale College of Engineering, detailing the course objectives, outcomes, and experiments. It outlines the vision and mission of the department, program educational objectives, and specific outcomes related to Artificial Intelligence and Machine Learning. The manual includes a list of suggested experiments, prerequisites, and resources for students to effectively learn data analytics using Python and R.

Uploaded by

Shivraj Chavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views44 pages

Data Analytics & Visualization Lab Manual

Uploaded by

Shivraj Chavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Shivajirao S Jondhale College of Engineering, Dombivli (E)

Department of AIMLEngineering

Laboratory Manual
DATA ANALYTICS AND
VISUALIZATION LAB

Subject Code: CSL601

Semester – VI

Prepared by

Prof. Rashmi K Mahajan

Department of Artificial Intelligence and

Machine Learning

Shivajirao S. Jondhale College of Engineering,

Dombivli (E)

Affiliated to University of Mumbai

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

COURSE: DATA ANALYTICS AND VISUALIZATION LAB

COURSE CODE: CSL 601

Semester-VI

INDEX
Sr. Topic Page
No. No.
1 Vision Iii

2 Mission Iii

3 Program Educational Objectives (PEOs) Iii

4 Program Outcomes (POs) iv

5 Program Specific Outcomes (PSOs) Iv

6 Syllabus V

7 Course Objectives and Course Outcomes vii

8 List of Experiments ix

9 CO-PO Mapping Matrix and CO-PSO Mapping Matrix x

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

VISION
 To impart quality technical education in the department of Artificial Intelligence
and Machine Learning for creating competent and ethically strong engineers with
capabilities of accepting new challenges.

MISSION

 To provide learners with the technical knowledge to build a life long learning career
in the Artificial Intelligence and Machine Learning domain.
 To develop ability among the learners to analyze,design implement engineering
problems and real world applications by providing novel Artificial Intelligence and
Machine Learning Solution.
 To promote close interaction among industry, faculty and learners to enrich the
learning process and enhance career opportunities for learners.

Program Educational Objectives (PEO)

 Impel Learners to acquire in-depth understanding of Artificial Intelligence & Machine
Learning that will enable them to pursue higher education or professional positions in
the field of engineering.
 Prepare Learners to demonstrate technical skills, competency in the Artificial
Intelligence & Machine Learning field.
 Inculcate in Learners, professional and ethical attitude, good leadership qualities and
commitment to social responsibilities.

Program Outcomes (POs)

Program Specific Outcomes (PSOs)

 PSO1 : Ability to understand the concepts and key issues in artificial intelligence and
its associated fields to achieve adequate perspectives in real time applications
 PSO2 : Ability to design, implement solutions for various domains using Machine
learning and Deep Learning techniques.

University Syllabus for the lab

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Lab Code Lab Name Credit

CSL601 DATA ANALYTICS AND VISUALIZATION LAB 1

Prerequisite: Basic Python:

Lab Objectives:
1 To effectively use libraries for data analytics.
2 To understand the use of regression Techniques in data analytics applications.
3 To use time series models for prediction.
4 To introduce the concept of text analytics and its applications.
5 To apply suitable visualization techniques using R and Python.
Lab Outcomes:
At the end of the course, students will be able to -
1 Explore various data analytics Libraries in R and Python
2 Implement various Regression techniques for prediction.
3 Build various time series models on a given data set
4 Design Text Analytics Application on a given data set
5 Implement visualization techniques to given data sets using R .
6 Implement visualization techniques to given data sets using Python

Suggested Experiments: Students are required to complete at least 08 experiments Preferably

using R Programming Language/Python

1 Getting introduced to data analytics libraries in Python and R

2 Simple Linear Regression in Python/R.
3 Multiple Linear Regression in Python/R.
4 Time Series Analysis in Python/R
5 Implementation of ARIMA model in python / R.
6 Text analytics: Implementation of Spam filter/Sentiment analysis in python/R.
7,8 Two visualization experiments in R using different Libraries
9,10 Two visualization experiments in python using different Libraries.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Useful Links:
1 [Link]
2 [Link]
3 [Link]
4 [Link]

References:
1 Data Analytics using R, Bharati Motwani, Wiley Publications
2 Python for Data Analysis: 3rd Edition, WesMcKinney, Publisher(s): O'Reilly Media, Inc.
3 Better Data Visualizations A Guide for Scholars, Researchers, and Wonks, Jonathan
Schwabish, Columbia University Press

Term Work:
1 Term work should consist of 08 experiments.
2 Journal must include at least 2 assignments based on Theory and Practical.
3 The final certification and acceptance of term work ensures satisfactory performance of
laboratory work and minimum passing marks in term work.

4 Total 25 Marks (Experiments: 15-marks, Attendance Theory & Practical: 05-marks,

Assignments: 05-marks)
Oral & Practical exam
Based on the entire syllabus

Course Objectives
1 To effectively use libraries for data analytics.

2 To understand the use of regression Techniques in data analytics applications.

3 To use time series models for prediction.

4 To introduce the concept of text analytics and its applications.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

5 To apply suitable visualization techniques using R.

6 To apply suitable visualization techniques using Python.

Course Outcomes
At the end of the course, the learner will be able to :

1. Explore various data analytics Libraries in R

and Python

2. Implement various Regression techniques for prediction.

3. Build various time series models on a given data set

4. Design Text Analytics Application on a given data set

5. Implement visualization techniques to given data sets using R .

6. Implement visualization techniques to given data sets using Python

LIST OF EXPERIMENTS

Expt. Name of the Experiment Page COs

No. No.
1. Getting introduced to data analytics libraries in Python and R. 1 CO1

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

2. Simple Linear Regression in Python. 8 CO2

3. Multiple Linear Regression in Python. 11 CO2

4. Time Series Analysis in Python 15 CO3

5. Implementation of ARIMA model in python 17 CO3

6. Visualization experiments in python using matplotlib Library. 21 CO6

7. Visualization experiments in python using plotly Library. 26 CO6

8. Visualization experiments in R using ggplot2 library. 31 CO5

9. Contents beyond syllabus 35 CO4

CO-PO Mapping Matrix

PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12

CO1 2 2 2

CO2 2 2 2 2

CO3 2 2 2

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

CO4 2 2 2

CO5 2 2 2

CO6 2 2 2
CO-PSO Mapping Matrix

PSO1 PSO2

CO1 2 2

CO2 2

CO3 2

CO4 2

CO5 2

CO6 2

EXPERIMENT NO- 1
AIM: Getting introduced to data analytics libraries in Python and R.

RESOURCES REQUIRED: H/W :- P4 machine

S/W :- Jupyter Notebook

THEORY:
Pandas is a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both
"Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant. Relevant data is
very important in data science.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Steps to use Library:

1. Installation of Pandas
If you have Python and PIP already installed on a system, then installation of Pandas is
very easy. Install it using this command:
# Pip install pandas
2. Import Pandas
Once Pandas is installed, import it in your applications by adding the import keyword:
Import pandas
3. Creating Alias:
Pandas is usually imported under the pd alias.
3. Checking pandas version
The version string is stored under __version__ attribute.
print(pd.__version__)
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a
table with rows and columns.

import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = [Link](data)
print(df)
Named Indexes: With the index argument, you can name your own indexes.
df = [Link](data, index = ["day1", "day2", "day3"])
print(df)
Load Files Into a DataFrame :
If your data sets are stored in a file, Pandas can load them into a
DataFrame. df = pd.read_csv('[Link]') print(df)

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Read CSV Files

A simple way to store big data sets is to use CSV files (comma separated files). CSV files
contains plain text and is a well know format that can be read by everyone including
[Link] our examples we will be using a CSV file called '[Link]'. df =
pd.read_csv('[Link]') print(df.to_string()) Pandas Functions:
Let’s first import the data into pandas DataFrame
df import pandas as pd df =
pd.read_csv("Dummy_Sales_Data_v1.csv")
[Link]() :This function helps you to get the first few rows of the dataset. By default, it
returns the first 5 rows. However, you can change this number by simply mentioning the
desired number of rows in [Link]().

[Link]() : This function helps you to get the last few rows of the dataset. By default, it
returns the last 5 rows, and similar to .head(), you can simply mention the desired
number of rows in [Link]().

[Link]() : This function is used to get a randomly selected row, column, or both from a
dataset. [Link]() takes 7 optional parameters, which means this function can be run
without using any argument as below.
[Link]() :This function returns a quick summary of the DataFrame. This includes
information about column names and their respective data types, missing values, and
memory consumption by DataFrame.
Pandas Function to get the Statistical Summary of the Dataset
[Link](): This function returns descriptive statistics about the data. This includes
minimum, maximum, mean (central tendency), standard deviation (dispersion) of the
values in numerical columns, and the count of all non-null values in the data
Pandas Functions to Select a Subset of the Dataset [Link]() : This function is used to
query the DataFrame based on an expression. An expression can be as simple as a single
condition and as complex as a combination of multiple conditions.
[Link]("Quantity > 95")
[Link] : This function is a property of DataFrame that returns the group of rows and
columns identified by their labels or names.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

[Link][100,'Sales_Manager']
[Link] : This function is again the DataFrame property which returns the same output as
[Link], but uses row and column numbers instead of their labels.

[Link][[100, 200],[6,3]] [Link](): This function returns the list of unique values in

a column or series. Instead of applying on complete DataFrame, it works only on the

selected single column.

df["Sales_Manager"].unique()
[Link](): This method returns the number of unique records in each column. Similar
to the previous function, [Link]() can be used on single column as,
df["Sales_Manager"].nunique()
[Link](): This function helps you to check if there in which row and which column your
data has missing values.
From [Link]() you already know which columns have missing values. [Link]() returns
output in Boolean form — in terms of True and False — for all the rows in all columns.
[Link]()
[Link]() : This function is used to replace missing values or NaN in the df with
userdefined values. [Link]() takes 1 required and 5 optional parameters.
[Link]("MissingInfo")
df.sort_values() : This function helps to arrange the entire DataFrame in ascending or
descending order based on a specified column. It takes exactly 1 required and 5 optional
parameters.
df.sort_values("Quantity")
df.value_counts(): This function returns — how many times a value appeared in a column.
So, you need to pass the specific column name to this function
df.value_counts("Sales_Manager")
[Link](): This function is useful in quickly getting several largest values from a specific
column of the DataFrame and all the rows containing that.
[Link](10, "Delivery_Time(Days)")
[Link]() : Similar to the previous function, [Link]() helps you in getting several
smallest values in the dataset.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

[Link](7, "Shipping_Cost(USD)")
Pandas Functions to Modify the Dataset
[Link](): This is useful in copying the entire DataFrame in one go. It contains only one
optional parameter, which you probably never need to use.
df1 = [Link][0:10, :].copy()
df1
[Link](): This is the simplest method to easily change the selected column name. all
you need to do is pass a dictionary where the key is the old column name and the value is
the new column name.
[Link](columns = {"Shipping_Cost(USD)": "Shipping_Cost",
"Delivery_Time(Days)":"DeliveryTime_in_Days"}, inplace=True)
[Link](): This function checks the DataFrame for a given condition and replaces values
at all the locations with NaN where the condition is False.
condition = df1["Status"] == "Not Shipped"
[Link](condition)
[Link](): This function is used to remove specified rows or columns from a DataFrame.
The rows to be removed are identified by their labels or index, and columns are identified
by their column names.
[Link]("OrderCode", axis=1)
Pandas Function to Understand the Relationship Between all Columns
[Link](): This method is used to find out pairwise correlations between all the columns of
the DataFrame. So, when you do not mention any specific column names, it returns
Pearson correlation coefficients for all the column pairs in the datasets.
[Link]()

R Language:
R is a language and environment for statistical computing and graphics. It is a GNU project
which is similar to the S language and environment which was developed at Bell
Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

tests, time-series analysis, classification, clustering, …) and graphical techniques, and is

highly extensible.
R is an integrated suite of software facilities for data manipulation, calculation and
graphical display. It includes

• an effective data handling and storage facility,

• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data analysis,
• graphical facilities for data analysis and display either on-screen or on hardcopy,
and
• a well-developed, simple and effective programming language which includes
conditionals, loops, user-defined recursive functions and input and output
facilities.

Methods and Attributes in R:

1. dim(): shows the dimensions of the data frame by row and column
2. str(): shows the structure of the data frame
3. summary(): provides summary statistics on the columns of the data frame
4. colnames(): shows the name of each column in the data frame
5. head(): shows the first 6 rows of the data frame
6. tail(): shows the last 6 rows of the data frame
7. View(): shows a spreadsheet-like display of the entire data frame

CONCLUSION: We have studied basic functions of Pandas and R language.

MULTIPLE CHOICE QUESTIONS:

1. What is the module libraries in pandas
a) Numpy
b) Pandas
c) Matplotlib
d) All of the above

2. Pandas stands for-

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

a) Panel Data Analytics

b) Panel data analysis
c) Panel data
d) Panel dashboard

3. – library is an important library used for analyzing data

a) Math
b) Random
c) Pandas
d) None of the above
4. Important data structure of pandas is/are
a) Series
b) Data frame
c) Both of the above
d) None of the above

5. Pandas series can have ----- data types.

a) float
b) integer
c) String
d) All of the above

REFERENCES:
1. Pandas Functions you Should Know for Data Analysis -
([Link]) 2. R: What is R? ([Link])

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

EXPERIMENT NO- 2

AIM: Simple Linear Regression in Python.

RESOURCES REQUIRED: H/W: P4 machine

S/W: Jupyter Notebook

THEORY:
Linear regression is a common method to model the relationship between a dependent
variable and one or more independent variables. Linear models are developed using the
parameters which are estimated from the data. Linear regression is useful in prediction
and forecasting where a predictive model is fit to an observed data set of values to
determine the response. Linear regression models are often fitted using the least-squares
approach where the goal is to minimize the error.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Consider a dataset where the independent attribute is represented by x and the

dependent attribute is represented by y.

It is known that the equation of a straight line is y = mx + b where m is the slope and b is
the intercept.

In order to prepare a simple regression model of the given dataset, we need to calculate
the slope and intercept of the line which best fits the data points.
Mathematical formula to calculate slope and intercept are given below
Slope = Sxy/Sxx where Sxy and Sxx are sample covariance and sample
variance respectively.
Intercept = ymean – slope* xmean
Let us use these relations to determine the linear regression for the above dataset. For
this we calculate the xmean, ymean, Sxy, Sxx as shown in the table.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Squared Error=10.8
mean squared
error=3.28
Coefficient of Determination (R2) = 1-10.8 / 89.2 = 0.878
Low value of error and high value of R2 signify that the linear regression fits data well.
Conclusion: Linear Regression has been implemented successfully.

MULTIPLE CHOICE QUESTIONS:

1. Which of the following formulas is not a simple linear regression model?
a. Salary = a * Experience
b. Salary = a * Experience + b
c. Salary = a * Experience + b * Age

2. What is the function used in R to create a simple linear

regressor? a. lr
b. slr
c. lm
d. slm
3. What is the correct way of writing a simple linear regression equation in the formula
parameter in R?
a. Salary = YearsExperience
b. Salary ~ YearsExperience
c. Salary == YearsExperience
d. Salary = a * YearsExperience + b

4. We should use Simple Linear Regression to predict the winner of a football

game a. True
b. False

5. Which of the following metrics can be used for evaluating regression

models? a. R Squared
b. Adjusted R Squared

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

c. F Statistics
d. RMSE / MSE / MAE

REFERENCES:
1. Linear Regression (Python Implementation) – GeeksforGeeks
2. 250+ TOP MCQs on Linear Regression and Answers 2023

EXPERIMENT NO- 3
AIM: Multiple Linear Regression in Python.

RESOURCES REQUIRED: H/W: P4 machine

S/W: Jupyter Notebook
THEORY:
Multiple Linear Regression:
Multiple Linear Regression attempts to model the relationship between two or more
features and a response by fitting a linear equation to observed data. The steps to
perform multiple linear Regression are almost similar to that of simple linear Regression.
The Difference Lies in the evaluation. We can use it to find out which factor has the
highest impact on the predicted output and how different variables relate to each other.
Here: Y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + …… bn * xn

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Y = Dependent variable and x1, x2, x3, …… xn = multiple independent variables

Assumption of Regression Model:
Linearity: The relationship between dependent and independent variables should be
linear.
Homoscedasticity: Constant variance of the errors should be maintained.
Multivariate normality: Multiple Regression assumes that the residuals are normally
distributed.
Lack of Multicollinearity: It is assumed that there is little or no multicollinearity in the
data. Dummy Variable:
As we know in the Multiple Regression Model, we use a lot of categorical data. Using
Categorical Data is a good method to include non-numeric data into the respective
Regression Model. Categorical Data refers to data values that represent categories-data
values with the fixed and unordered number of values, for instance,
gender(male/female).
In the regression model, these values can be represented by Dummy Variables. These
variables consist of values such as 0 or 1 representing the presence and absence of
categorical values.

Dummy Variable Trap:

The Dummy Variable Trap is a condition in which two or more are Highly Correlated. In
the simple term, we can say that one variable can be predicted from the prediction of the
other. The solution of the Dummy Variable Trap is to drop one of the categorical variables.
So, if there are m Dummy variables then m-1 variables are used in the model.
D2 = D1-1

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Here D2, D1 = Dummy Variables

Method of Building Models :

• All-in
• Backward-Elimination
• Forward Selection
• Bidirectional Elimination
• Score Comparison Backward-Elimination :
Step #1: Select a significant level to start in the model.
Step #2: Fit the full model with all possible predictors.
Step #3: Consider the predictor with the highest P-value. If P > SL go to STEP 4, otherwise
the model is Ready.
Step #4: Remove the predictor.
Step #5: Fit the model without this variable.

Forward-Selection :
Step #1: Select a significance level to enter the model (e.g. SL = 0.05)
Step #2: Fit all simple regression models y~ x(n). Select the one with the lowest P-value.
Step #3: Keep this variable and fit all possible models with one extra predictor added to
the one(s) you already have.
Step #4: Consider the predictor with the lowest P-value. If P < SL, go to Step #3, otherwise
the model is Ready.
Steps Involved in any Multiple Linear Regression Model
Step #1: Data Pre-Processing
1. Importing The Libraries.
2. Importing the Data Set.
3. Encoding the Categorical Data.
4. Avoiding the Dummy Variable Trap.
5. Splitting the Data set into Training Set and Test Set. Step#2: Fitting
Multiple Linear Regression to the Training set Step #3: Predict the Test set results.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

CONCLUSION: Thus, multiple linear regression is implemented successfully.

MULTIPLE CHOICE QUESTIONS:

1. Which of the following formula is not a multiple linear regression model?
a. Salary = a * Experience + b * Age + c
b. Salary = a * Experience + b * Age + c * Level + d
c. Salary = a * Experience + b * Age^2
d. Salary = a * Experience + b * Age

2. We should use Multiple Linear Regression to predict a dependent variable that is

growing exponentially with time. a. Yes
b. No

3. When there are more than one independent variables in the model, then the linear
model is termed as __________. a. Unimodal
b. Multiple model
c. Multiple Linear model
d. Multiple logistics model

4. The terms intercepts and slope are usually called as

____________. a. Regressionists
b. Coefficients
c. Regressive
d. Regression Coefficients

5. What is predicting y for a value of x that is with in the interval of point that we saw in
the original data called? a. Regression
b. Extrapolation
c. Intra polation
d. Polation

REFERENCES:
1. ML | Multiple Linear Regression using Python – GeeksforGeeks.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

2. 250+ TOP MCQs on Linear Regression and Answers 2023

EXPERIMENT NO- 4
AIM: Time Series Analysis in Python/R.

RESOURCES REQUIRED: H/W :- P4 machine

S/W :- Jupyter Notebook
THEORY:
A time series is the series of data points listed in time order. A time series is a sequence of
successive equal interval points in time. A time-series analysis consists of methods for
analyzing time series data in order to extract meaningful insights and other useful
characteristics of data. Time-series data analysis is becoming very important in so many
industries like financial industries, pharmaceuticals, social media companies, web service
providers, research, and many more. To understand the time-series data, visualizations

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

are essential. Any type of data analysis is not complete without visualizations. Because
one good visualization can provide meaningful and interesting insights into data.

Properties of Time series:

Seasonality: In time-series data, seasonality is the presence of variations that occur at
specific regular time intervals less than a year, such as weekly, monthly, or quarterly.
Resampling: Resampling is a methodology of economically using a data sample to
improve the accuracy and quantify the uncertainty of a population parameter. Resampling
for months or weeks and making bar plots is another very simple and widely used
method of finding seasonality. Here we are going to make a bar plot of month data for
2016 and 2017. Differencing: Differencing is used to make the difference in values of a
specified interval. By default, it’s one, we can specify different values for plots. It is the
most popular method to remove trends in the data.
Shift: The shift function can be used to shift the data before or after the specified time
interval. We can specify the time, and it will shift the data by one day by default. That
means we will get the previous day’s data. It is helpful to see previous day data and
today’s data simultaneously side by side.

CONCLUSION: Thus, the time series analysis and their properties are implemented
successfully.

MULTIPLE CHOICE QUESTIONS:

1. An orderly set of data arranged in accordance with their time of occurrence is called:
(a) Arithmetic series (b) Harmonic series (c) Geometric series (d) Time series

2. A time series consists of:

(a) Short-term variations (b) Long-term variations (c) Irregular variations (d) All of the
above

3. The secular trend is measured by the method of semi-averages when:

(a) Time series based on yearly values (b) Trend is linear (c) Time series consists of even
number of values (d) None of them

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

4. In time series seasonal variations can occur within a period of: (a) Four
years (b) Three years (c) One year (d) Nine years

5. Moving average method is used for measurement of trend when:

(a) Trend is linear (b) Trend is non linear (c) Trend is curvilinear (d) None of them

REFERENCES:
1. [Link]
2. [Link]
bcom/mcqtime-series-with-correct-answers/13650394

EXPERIMENT NO- 5
AIM: Implementation of ARIMA model in python / R.

RESOURCES REQUIRED: H/W :- P4 machine

S/W :- Jupyter Notebook
THEORY:
An autoregressive integrated moving average, or ARIMA, is a statistical analysis model
that uses time series data to either better understand the data set or to predict future
trends. A statistical model is autoregressive if it predicts future values based on past
values. For example, an ARIMA model might seek to predict a stock's future prices based
on its past performance or forecast a company's earnings based on past periods.

Understanding Autoregressive Integrated Moving Average (ARIMA)

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

An autoregressive integrated moving average model is a form of regression analysis that

gauges the strength of one dependent variable relative to other changing variables. The
model's goal is to predict future securities or financial market moves by examining the
differences between values in the series instead of through actual values.
An ARIMA model can be understood by outlining each of its components as follows:
• Autoregression (AR): refers to a model that shows a changing variable that
regresses on its own lagged, or prior, values.
• Integrated (I): represents the differencing of raw observations to allow the time
series to become stationary (i.e., data values are replaced by the difference
between the data values and the previous values).
• Moving average (MA): incorporates the dependency between an observation and
a residual error from a moving average model applied to lagged observations.

ARIMA Parameters
Each component in ARIMA functions as a parameter with a standard notation. For ARIMA
models, a standard notation would be ARIMA with p, d, and q, where integer values
substitute for the parameters to indicate the type of ARIMA model used. The parameters
can be defined as:
• p: the number of lag observations in the model, also known as the lag order.
• d: the number of times the raw observations are differenced; also known as the
degree of differencing.
• q: the size of the moving average window, also known as the order of the moving
average.
For example, a linear regression model includes the number and type of terms. A value of
zero (0), which can be used as a parameter, would mean that particular component
should not be used in the model. This way, the ARIMA model can be constructed to
perform the function of an ARMA model, or even simple AR, I, or MA models. ARIMA is a
method for forecasting or predicting future outcomes based on a historical time series. It
is based on the statistical concept of serial correlation, where past data points influence
future data points.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

The following table lists other ARIMA traits that demonstrate good and bad
characteristics.
Pros
• Good for short-term forecasting
• Only needs historical data
• Models non-stationary data Cons
• Not built for long-term forecasting
• Poor at predicting turning points
• Computationally expensive
• Parameters are subjective

CONCLUSION: Thus, the implementation ARIMA model done using python libraries.

MULTIPLE CHOICE QUESTIONS:

1. How many AR and MA terms should be included for the time series by looking at the
above ACF and PACF plots? a) AR (1) MA(0)
b) AR(0)MA(1)
c) AR(2)MA(1)
d) AR(1)MA(2)
e) Can’t Say

2. The length of a prediction interval for Yt+l from fitting a nonstationary ARIMA(p, d, q)
model generally
a) increases as l increases.
b) decreases as l increases.
c) becomes constant for l sufficiently large.
d) tends to zero as l increases

3. Which of the following statement is correct?

1. If autoregressive parameter (p) in an ARIMA model is 1, it means that there is no
autocorrelation in the series.
2. If moving average component (q) in an ARIMA model is 1, it means that there is
autocorrelation in the series with lag 1.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

3. If integrated component (d) in an ARIMA model is 0, it means that the series is not
stationary. a) Only 1
b) Both 1 and 2
c) Only 2
d) All of the statements

4. Which of the following is not a technique used in smoothing time series?

a) Nearest Neighbour Regression
b) Locally weighted scatter plot smoothing
c) Tree based models like (CART)
d) Smoothing Splines

5. An ARMA(p,q) (p, q are integers bigger than zero) model will have
a) An acf and pacf that both decline geometrically
b) An acf that declines geometrically and a pacf that is zero after p lags
c) An acf that declines geometrically and a pacf that is zero after q lags
d) An acf that is zero after p lags and a pacf that is zero after q lags

REFERENCES:
1. [Link]
[Link]
2. [Link]
solutionskillpower-time-series-datafest-2017/

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

EXPERIMENT NO- 6
AIM: Visualization experiments in python using matplotlib Library.

RESOURCES REQUIRED: H/W :- P4 machine

S/W :- Jupyter Notebook

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

THEORY:
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is
a multi-platform data visualization library built on NumPy arrays and designed to work
with the broader SciPy stack. It was introduced by John Hunter in the year 2002. One of
the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
histogram etc.
Installation: Windows, Linux and macOS distributions have matplotlib and most of its
dependencies as wheel packages. Run the following command to install matplotlib
package: python -mpip install -U matplotlib

Basic plots in Matplotlib

Matplotlib is a data visualization library in Python. The pyplot, a sub library of matplotlib,
is a collection of functions that helps in creating a variety of charts.
Line plot :
Line plots are drawn by joining straight lines connecting data points where the x-axis and
yaxis values intersect. Line plots are the simplest form of representing data. In Matplotlib,
the plot() function represents this.

Bar Plot:
The bar plots are vertical/horizontal rectangular graphs that show data comparison where
you can gauge the changes over a period represented in another axis (mostly the X-axis).
Each bar can store the value of one or multiple data divided in a ratio. The longer a bar
becomes, the greater the value it holds. In Matplotlib, we use the bar() or barh() function
to represent it.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Scatter plot:
We can implement the scatter (previously called XY) plots while comparing various data
variables to determine the connection between dependent and independent variables.
The data gets expressed as a collection of points clustered together meaningfully. Here
each value has one variable (x) determining the relationship with the other (Y). We use ht

Pie Plot:
A pie plot is a circular graph where the data get represented within that
components/segments or slices of pie. Data analysts use them while representing the
percentage or proportional data in which each pie slice represents an item or data
classification. In Matplotlib, the pie() function represents it.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Area plot:
The area plots spread across certain areas with bumps and drops (highs and lows) and are
also known as stack plots. They look identical to the line plots and help track the changes
over time for two or multiple related groups to make it one whole category. In Matplotlib,
the stackplot() function represents it.

Histogram plot:
We can use a histogram plot when the data remains distributed, whereas we can use a
bar graph to compare two entities. Both histogram and bar plot look alike but are used in
different scenarios. In Matplotlib, the hist() function represents this.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

CONCLUSION: Thus, the implementation of above plots is done using matplotlib library in
python.

MULTIPLE CHOICE QUESTIONS:

1. What is true about Data Visualization?
A. Data Visualization is used to communicate information clearly and efficiently to users
by the usage of information graphics such as tables and charts.
B. Data Visualization helps users in analyzing a large amount of data in a simpler way.
C. Data Visualization makes complex data more accessible, understandable, and usable.
D. All of the above

2. Data can be visualized using?

A. graphs
B. charts
C. maps
D. All of the above
3. Which one of the following is most basic and commonly used techniques?
A. Line charts
B. Scatter plots
C. Population pyramids
D. Area charts

4. Which of the following is tool for checking normality?

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

A. qqline()
B. qline()
C. anova()
D. lm()

5. Which of the following lists names of variables in a [Link]?

A. par()
B. names()
C. barchart()
D. quantile()

REFERENCES:
[Link]
[Link]

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

EXPERIMENT NO- 7
AIM: Visualization experiments in python using plotly Library.

RESOURCES REQUIRED: H/W :- P4 machine

S/W :- Jupyter Notebook
THEORY:
The Plotly Python library is an interactive open-source library. This can be a very helpful
tool for data visualization and understanding the data simply and easily. plotly graph
objects are a high-level interface to plotly which are easy to use. It can plot various types
of graphs and charts like scatter plots, line charts, bar charts, box plots, histograms, pie
charts, etc. So you all must be wondering why plotly over other visualization tools or
libraries? Here’s the answer –
• Plotly has hover tool capabilities that allow us to detect any outliers or
anomalies in a large number of data points.
• It is visually attractive that can be accepted by a wide range of audiences.
• It allows us for the endless customization of our graphs that makes our plot
more meaningful and understandable for others.

Various plots using Plotly

Box plot: A box plot is the representation of a statistical summary. Minimum, First
Quartile, Median, Third Quartile, Maximum.

Violin charts:
Violin plots are distribution charts similar to box plots that allow visualizing the
underlying distribution of the data through a mirrored kernel density line of that data.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

With the violin function from the plotly express module you can create violin plots in
Python. You will need to input a numerical variable to y or specify the name of the
column of a data frame with the desired variable in order to create a vertical violin plot.

Horizontal violin plot

If you pass the variable to x instead of to y you will create a horizontal violin plot.

Violin plot with box plot inside:

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Violin plots by group

Heatmaps:
Heatmap is defined as a graphical representation of data using colors to visualize the
value of the matrix. In this, to represent more common values or higher activities brighter
colors basically reddish colors are used and to represent less common or activity values,
darker colors are preferred. Heatmap is also defined by the name of the shading matrix.

Bubble Chart
The bubble chart in Plotly is created using the scatter plot. It can be created using the
scatter() method of [Link]. A bubble chart is a data visualization which helps to
displays multiple circles (bubbles) in a two-dimensional plot as same in scatter plot. A

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

bubble chart is primarily used to depict and show relationships between numeric
variables.

CONCLUSION: Thus, the implementation of above plots is done using plotly library in
python.

MULTIPLE CHOICE QUESTIONS:

1. What is Plotly?
a) A programming language
b) A data visualization library
c) A machine learning algorithm
d) A database management system

2. Which programming languages can be used with Plotly?

a) Python, R, MATLAB, and JavaScript
b) Python, Ruby, C++, and Swift
c) Java, Scala, Kotlin, and Groovy
d) PHP, Perl, Lua, and Pascal

3. Which of the following chart types is NOT available in

Plotly? a) Line chart
b) Bar chart
c) Scatter plot
d) Heat map

4. What is a Plotly trace?

a) A single data series in a plot
b) A function that creates a plot

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

c) A library of pre-made charts

d) An interactive element of a plot

5. Which of the following Plotly chart types is best for comparing multiple data
series? a) Line chart
b) Bar chart
c) Scatter plot
d) Heat map

REFERENCES:
1. [Link]
2. Bubble chart using Plotly in Python - GeeksforGeeks
3. Plotly MCQs and Answers With Explanation | Plotly Quiz ([Link])

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

EXPERIMENT NO- 8
AIM: Visualization experiments in R using ggplot2 library.

RESOURCES REQUIRED: H/W:- P4 machine

S/W:- R studio /Jupyter Notebook
THEORY:
ggplot2 package in R Programming Language also termed as Grammar of Graphics is a
free, open-source, and easy-to-use visualization package widely used in R. It is the most
powerful visualization package written by Hadley Wickham.
It includes several layers on which it is governed. The layers are as follows:

Building Blocks of layers with the grammar of graphics

• Data: The element is the data set itself
• Aesthetics: The data is to map onto the Aesthetics attributes such as x-axis,
yaxis, color, fill, size, labels, alpha, shape, line width, line type
• Geometrics: How our data being displayed using point, line, histogram, bar,
boxplot
• Facets: It displays the subset of the data using Columns and rows
• Statistics: Binning, smoothing, descriptive, intermediate
• Coordinates: the space between data and display using Cartesian, fixed, polar,
limits
• Themes: Non-data link

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Data Layer: In the data Layer we define the source of the information to be

visualize, let’s use the mtcars dataset in the ggplot2 package.

Aesthetic Layer: Here we will display and map dataset into certain aesthetics.

Geometric layer: In geometric layer control the essential elements, see how our

data being displayed using point, line, histogram, bar, boxplot

Facet Layer: It is used to split the data up into subsets of the entire dataset and it

allows the subsets to be visualized on the same plot. Here we separate rows
according to transmission type and Separate columns according to cylinders.
Statistics layer: In this layer, we transform our data using binning, smoothing,

descriptive, intermediate.
Coordinates layer: In these layers, data coordinates are mapped together to the

mentioned plane of the graphic and we adjust the axis and changes the spacing of
displayed data with Control plot dimensions.
Theme Layer: This layer controls the finer points of display like the font size and

background color properties. ggplot2 provides various types of visualizations. More

parameters can be used included in the package as the package gives greater
control over the visualizations of data. Many packages can integrate with the
ggplot2 package to make the visualizations interactive and animated.

CONCLUSION: Thus, the implementation of above plots is done using ggplot library in R
programming.

MULTIPLE CHOICE QUESTIONS:

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

1. ______ grammar makes a clear distinction between your data and what gets displayed
on the screen or page. a. ggplot1
b.ggplot2
c. [Link]
d. ggplot3
2. Which of the following is a plot to investigate the order in which observations
were recorded? a. ggplot
b. ggsave
c. ggpcp
d. ggorder

3. ________ is used to create a plot to illustrate patterns of missing values.

a. ggmissplot
b. ggmissing
c. ggfluctuation
d. ggpcp

4. Which R data type is most appropriate for a categorical variable?

a. Numeric
b. Factor
c. Integer
d. Character

5. Which of the following opens the ggplot2 library?

a. [Link]("ggplot2")
b. library(package = "ggplot2")
c. summary(object = ggplot2)
d. open(x = ggplot2)

REFERENCES:
1. Data visualization with R and ggplot2 - GeeksforGeeks
2. Multiple Choice Questions | Online Resources ([Link])

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Contents Beyond Syllabus

AIM: Text Analysis Using Turicreate
RESOURCES REQUIRED: H/W: - P4 machine
S/W: - Jupyter Notebook
THEORY:
Text is a group of words or sentences. Text analysis is analyzing the text and then
extracting information with the help of text. Text data is one of the biggest factors that
can make a company big or small. For example
• On E-Commerce website people buy things. With Text Analysis the E-
Commerce website can know what its customer likes and it through this data it
can make its productivity higher.
• Using Text analysis and some Machine Learning Algorithm our Alexa Google
Home mini works. These two are based on Natural Language Processing.
Text analysis can be done using text mining. As the text “data” can be structured as well
as unstructured. The text mining technique will help us in differentiating between them.
Now let’s do some text analysis using Turicreate. We will build a model that classifies that
a message is a spam or ham for text analysis.
Link for the dataset=[Link]
classification Step 1: Import the Turicreate Library Step 2: Load the data set.
Step 3: We will explore the data first.
Step 4: Now adding the word count in the data set.
This is because data has two things category and message. Adding the word count will
help in model feature selection.

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

Step 5: To split the data into train and test set.

Step 6: Now we will make a model for classifying the spam and ham.
Step 7: Now we will check accuracy of our model.
Step 8: We can predict manually by checking from our test data that it is giving right
answer or not.
Step 9: Predicting the test data.
CONCLUSION: Thus, We have implemented text analysis using Turicreate library.

MULTIPLE CHOICE QUESTIONS:

1. By default, how many spaces are code intended when using the Python IDLE?
A. 2 B. 4 C. 3 D. 5

2. Text Mining is:

A. Conceptual
B. Theoretical
C. Empirical
D. All of the above

3. Predictive text analytics tasks include

A. Prediction
B. Classification
C. Clustering
D. All of the above

4. Which of the following technique is not a part of flexible text matching?

A. Soundex
B. Metaphone
C. Edit Distance
D. Keyword Hashing

5. Text Mining can be used in:

A. Detecting spam model
B. Predicting stock Movements

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering

C. None of the above

D. Both a and b

REFERENCES:
1. [Link]
2. [Link]

TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan

Common questions