Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Laboratory Manual
DATA ANALYTICS AND
VISUALIZATION LAB
Subject Code: CSL601
Semester – VI
Prepared by
Prof. Rashmi K Mahajan
Department of Artificial Intelligence and
Machine Learning
Shivajirao S. Jondhale College of Engineering,
Dombivli (E)
Affiliated to University of Mumbai
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
COURSE: DATA ANALYTICS AND VISUALIZATION LAB
COURSE CODE: CSL 601
Semester-VI
INDEX
Sr. Topic Page
No. No.
1 Vision Iii
2 Mission Iii
3 Program Educational Objectives (PEOs) Iii
4 Program Outcomes (POs) iv
5 Program Specific Outcomes (PSOs) Iv
6 Syllabus V
7 Course Objectives and Course Outcomes vii
8 List of Experiments ix
9 CO-PO Mapping Matrix and CO-PSO Mapping Matrix x
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
VISION
To impart quality technical education in the department of Artificial Intelligence
and Machine Learning for creating competent and ethically strong engineers with
capabilities of accepting new challenges.
MISSION
To provide learners with the technical knowledge to build a life long learning career
in the Artificial Intelligence and Machine Learning domain.
To develop ability among the learners to analyze,design implement engineering
problems and real world applications by providing novel Artificial Intelligence and
Machine Learning Solution.
To promote close interaction among industry, faculty and learners to enrich the
learning process and enhance career opportunities for learners.
Program Educational Objectives (PEO)
Impel Learners to acquire in-depth understanding of Artificial Intelligence & Machine
Learning that will enable them to pursue higher education or professional positions in
the field of engineering.
Prepare Learners to demonstrate technical skills, competency in the Artificial
Intelligence & Machine Learning field.
Inculcate in Learners, professional and ethical attitude, good leadership qualities and
commitment to social responsibilities.
Program Outcomes (POs)
Program Specific Outcomes (PSOs)
PSO1 : Ability to understand the concepts and key issues in artificial intelligence and
its associated fields to achieve adequate perspectives in real time applications
PSO2 : Ability to design, implement solutions for various domains using Machine
learning and Deep Learning techniques.
University Syllabus for the lab
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Lab Code Lab Name Credit
CSL601 DATA ANALYTICS AND VISUALIZATION LAB 1
Prerequisite: Basic Python:
Lab Objectives:
1 To effectively use libraries for data analytics.
2 To understand the use of regression Techniques in data analytics applications.
3 To use time series models for prediction.
4 To introduce the concept of text analytics and its applications.
5 To apply suitable visualization techniques using R and Python.
Lab Outcomes:
At the end of the course, students will be able to -
1 Explore various data analytics Libraries in R and Python
2 Implement various Regression techniques for prediction.
3 Build various time series models on a given data set
4 Design Text Analytics Application on a given data set
5 Implement visualization techniques to given data sets using R .
6 Implement visualization techniques to given data sets using Python
Suggested Experiments: Students are required to complete at least 08 experiments Preferably
using R Programming Language/Python
1 Getting introduced to data analytics libraries in Python and R
2 Simple Linear Regression in Python/R.
3 Multiple Linear Regression in Python/R.
4 Time Series Analysis in Python/R
5 Implementation of ARIMA model in python / R.
6 Text analytics: Implementation of Spam filter/Sentiment analysis in python/R.
7,8 Two visualization experiments in R using different Libraries
9,10 Two visualization experiments in python using different Libraries.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Useful Links:
1 [Link]
2 [Link]
3 [Link]
4 [Link]
References:
1 Data Analytics using R, Bharati Motwani, Wiley Publications
2 Python for Data Analysis: 3rd Edition, WesMcKinney, Publisher(s): O'Reilly Media, Inc.
3 Better Data Visualizations A Guide for Scholars, Researchers, and Wonks, Jonathan
Schwabish, Columbia University Press
Term Work:
1 Term work should consist of 08 experiments.
2 Journal must include at least 2 assignments based on Theory and Practical.
3 The final certification and acceptance of term work ensures satisfactory performance of
laboratory work and minimum passing marks in term work.
4 Total 25 Marks (Experiments: 15-marks, Attendance Theory & Practical: 05-marks,
Assignments: 05-marks)
Oral & Practical exam
Based on the entire syllabus
Course Objectives
1 To effectively use libraries for data analytics.
2 To understand the use of regression Techniques in data analytics applications.
3 To use time series models for prediction.
4 To introduce the concept of text analytics and its applications.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
5 To apply suitable visualization techniques using R.
6 To apply suitable visualization techniques using Python.
Course Outcomes
At the end of the course, the learner will be able to :
1. Explore various data analytics Libraries in R
and Python
2. Implement various Regression techniques for prediction.
3. Build various time series models on a given data set
4. Design Text Analytics Application on a given data set
5. Implement visualization techniques to given data sets using R .
6. Implement visualization techniques to given data sets using Python
LIST OF EXPERIMENTS
Expt. Name of the Experiment Page COs
No. No.
1. Getting introduced to data analytics libraries in Python and R. 1 CO1
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
2. Simple Linear Regression in Python. 8 CO2
3. Multiple Linear Regression in Python. 11 CO2
4. Time Series Analysis in Python 15 CO3
5. Implementation of ARIMA model in python 17 CO3
6. Visualization experiments in python using matplotlib Library. 21 CO6
7. Visualization experiments in python using plotly Library. 26 CO6
8. Visualization experiments in R using ggplot2 library. 31 CO5
9. Contents beyond syllabus 35 CO4
CO-PO Mapping Matrix
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 2 2 2
CO2 2 2 2 2
CO3 2 2 2
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
CO4 2 2 2
CO5 2 2 2
CO6 2 2 2
CO-PSO Mapping Matrix
PSO1 PSO2
CO1 2 2
CO2 2
CO3 2
CO4 2
CO5 2
CO6 2
EXPERIMENT NO- 1
AIM: Getting introduced to data analytics libraries in Python and R.
RESOURCES REQUIRED: H/W :- P4 machine
S/W :- Jupyter Notebook
THEORY:
Pandas is a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both
"Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant. Relevant data is
very important in data science.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Steps to use Library:
1. Installation of Pandas
If you have Python and PIP already installed on a system, then installation of Pandas is
very easy. Install it using this command:
# Pip install pandas
2. Import Pandas
Once Pandas is installed, import it in your applications by adding the import keyword:
Import pandas
3. Creating Alias:
Pandas is usually imported under the pd alias.
3. Checking pandas version
The version string is stored under __version__ attribute.
print(pd.__version__)
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a
table with rows and columns.
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = [Link](data)
print(df)
Named Indexes: With the index argument, you can name your own indexes.
df = [Link](data, index = ["day1", "day2", "day3"])
print(df)
Load Files Into a DataFrame :
If your data sets are stored in a file, Pandas can load them into a
DataFrame. df = pd.read_csv('[Link]') print(df)
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files). CSV files
contains plain text and is a well know format that can be read by everyone including
[Link] our examples we will be using a CSV file called '[Link]'. df =
pd.read_csv('[Link]') print(df.to_string()) Pandas Functions:
Let’s first import the data into pandas DataFrame
df import pandas as pd df =
pd.read_csv("Dummy_Sales_Data_v1.csv")
[Link]() :This function helps you to get the first few rows of the dataset. By default, it
returns the first 5 rows. However, you can change this number by simply mentioning the
desired number of rows in [Link]().
[Link]() : This function helps you to get the last few rows of the dataset. By default, it
returns the last 5 rows, and similar to .head(), you can simply mention the desired
number of rows in [Link]().
[Link]() : This function is used to get a randomly selected row, column, or both from a
dataset. [Link]() takes 7 optional parameters, which means this function can be run
without using any argument as below.
[Link]() :This function returns a quick summary of the DataFrame. This includes
information about column names and their respective data types, missing values, and
memory consumption by DataFrame.
Pandas Function to get the Statistical Summary of the Dataset
[Link](): This function returns descriptive statistics about the data. This includes
minimum, maximum, mean (central tendency), standard deviation (dispersion) of the
values in numerical columns, and the count of all non-null values in the data
Pandas Functions to Select a Subset of the Dataset [Link]() : This function is used to
query the DataFrame based on an expression. An expression can be as simple as a single
condition and as complex as a combination of multiple conditions.
[Link]("Quantity > 95")
[Link] : This function is a property of DataFrame that returns the group of rows and
columns identified by their labels or names.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
[Link][100,'Sales_Manager']
[Link] : This function is again the DataFrame property which returns the same output as
[Link], but uses row and column numbers instead of their labels.
[Link][[100, 200],[6,3]] [Link](): This function returns the list of unique values in
a column or series. Instead of applying on complete DataFrame, it works only on the
selected single column.
df["Sales_Manager"].unique()
[Link](): This method returns the number of unique records in each column. Similar
to the previous function, [Link]() can be used on single column as,
df["Sales_Manager"].nunique()
[Link](): This function helps you to check if there in which row and which column your
data has missing values.
From [Link]() you already know which columns have missing values. [Link]() returns
output in Boolean form — in terms of True and False — for all the rows in all columns.
[Link]()
[Link]() : This function is used to replace missing values or NaN in the df with
userdefined values. [Link]() takes 1 required and 5 optional parameters.
[Link]("MissingInfo")
df.sort_values() : This function helps to arrange the entire DataFrame in ascending or
descending order based on a specified column. It takes exactly 1 required and 5 optional
parameters.
df.sort_values("Quantity")
df.value_counts(): This function returns — how many times a value appeared in a column.
So, you need to pass the specific column name to this function
df.value_counts("Sales_Manager")
[Link](): This function is useful in quickly getting several largest values from a specific
column of the DataFrame and all the rows containing that.
[Link](10, "Delivery_Time(Days)")
[Link]() : Similar to the previous function, [Link]() helps you in getting several
smallest values in the dataset.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
[Link](7, "Shipping_Cost(USD)")
Pandas Functions to Modify the Dataset
[Link](): This is useful in copying the entire DataFrame in one go. It contains only one
optional parameter, which you probably never need to use.
df1 = [Link][0:10, :].copy()
df1
[Link](): This is the simplest method to easily change the selected column name. all
you need to do is pass a dictionary where the key is the old column name and the value is
the new column name.
[Link](columns = {"Shipping_Cost(USD)": "Shipping_Cost",
"Delivery_Time(Days)":"DeliveryTime_in_Days"}, inplace=True)
[Link](): This function checks the DataFrame for a given condition and replaces values
at all the locations with NaN where the condition is False.
condition = df1["Status"] == "Not Shipped"
[Link](condition)
[Link](): This function is used to remove specified rows or columns from a DataFrame.
The rows to be removed are identified by their labels or index, and columns are identified
by their column names.
[Link]("OrderCode", axis=1)
Pandas Function to Understand the Relationship Between all Columns
[Link](): This method is used to find out pairwise correlations between all the columns of
the DataFrame. So, when you do not mention any specific column names, it returns
Pearson correlation coefficients for all the column pairs in the datasets.
[Link]()
R Language:
R is a language and environment for statistical computing and graphics. It is a GNU project
which is similar to the S language and environment which was developed at Bell
Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
tests, time-series analysis, classification, clustering, …) and graphical techniques, and is
highly extensible.
R is an integrated suite of software facilities for data manipulation, calculation and
graphical display. It includes
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data analysis,
• graphical facilities for data analysis and display either on-screen or on hardcopy,
and
• a well-developed, simple and effective programming language which includes
conditionals, loops, user-defined recursive functions and input and output
facilities.
Methods and Attributes in R:
1. dim(): shows the dimensions of the data frame by row and column
2. str(): shows the structure of the data frame
3. summary(): provides summary statistics on the columns of the data frame
4. colnames(): shows the name of each column in the data frame
5. head(): shows the first 6 rows of the data frame
6. tail(): shows the last 6 rows of the data frame
7. View(): shows a spreadsheet-like display of the entire data frame
CONCLUSION: We have studied basic functions of Pandas and R language.
MULTIPLE CHOICE QUESTIONS:
1. What is the module libraries in pandas
a) Numpy
b) Pandas
c) Matplotlib
d) All of the above
2. Pandas stands for-
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
a) Panel Data Analytics
b) Panel data analysis
c) Panel data
d) Panel dashboard
3. – library is an important library used for analyzing data
a) Math
b) Random
c) Pandas
d) None of the above
4. Important data structure of pandas is/are
a) Series
b) Data frame
c) Both of the above
d) None of the above
5. Pandas series can have ----- data types.
a) float
b) integer
c) String
d) All of the above
REFERENCES:
1. Pandas Functions you Should Know for Data Analysis -
([Link]) 2. R: What is R? ([Link])
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
EXPERIMENT NO- 2
AIM: Simple Linear Regression in Python.
RESOURCES REQUIRED: H/W: P4 machine
S/W: Jupyter Notebook
THEORY:
Linear regression is a common method to model the relationship between a dependent
variable and one or more independent variables. Linear models are developed using the
parameters which are estimated from the data. Linear regression is useful in prediction
and forecasting where a predictive model is fit to an observed data set of values to
determine the response. Linear regression models are often fitted using the least-squares
approach where the goal is to minimize the error.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Consider a dataset where the independent attribute is represented by x and the
dependent attribute is represented by y.
It is known that the equation of a straight line is y = mx + b where m is the slope and b is
the intercept.
In order to prepare a simple regression model of the given dataset, we need to calculate
the slope and intercept of the line which best fits the data points.
Mathematical formula to calculate slope and intercept are given below
Slope = Sxy/Sxx where Sxy and Sxx are sample covariance and sample
variance respectively.
Intercept = ymean – slope* xmean
Let us use these relations to determine the linear regression for the above dataset. For
this we calculate the xmean, ymean, Sxy, Sxx as shown in the table.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Squared Error=10.8
mean squared
error=3.28
Coefficient of Determination (R2) = 1-10.8 / 89.2 = 0.878
Low value of error and high value of R2 signify that the linear regression fits data well.
Conclusion: Linear Regression has been implemented successfully.
MULTIPLE CHOICE QUESTIONS:
1. Which of the following formulas is not a simple linear regression model?
a. Salary = a * Experience
b. Salary = a * Experience + b
c. Salary = a * Experience + b * Age
2. What is the function used in R to create a simple linear
regressor? a. lr
b. slr
c. lm
d. slm
3. What is the correct way of writing a simple linear regression equation in the formula
parameter in R?
a. Salary = YearsExperience
b. Salary ~ YearsExperience
c. Salary == YearsExperience
d. Salary = a * YearsExperience + b
4. We should use Simple Linear Regression to predict the winner of a football
game a. True
b. False
5. Which of the following metrics can be used for evaluating regression
models? a. R Squared
b. Adjusted R Squared
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
c. F Statistics
d. RMSE / MSE / MAE
REFERENCES:
1. Linear Regression (Python Implementation) – GeeksforGeeks
2. 250+ TOP MCQs on Linear Regression and Answers 2023
EXPERIMENT NO- 3
AIM: Multiple Linear Regression in Python.
RESOURCES REQUIRED: H/W: P4 machine
S/W: Jupyter Notebook
THEORY:
Multiple Linear Regression:
Multiple Linear Regression attempts to model the relationship between two or more
features and a response by fitting a linear equation to observed data. The steps to
perform multiple linear Regression are almost similar to that of simple linear Regression.
The Difference Lies in the evaluation. We can use it to find out which factor has the
highest impact on the predicted output and how different variables relate to each other.
Here: Y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + …… bn * xn
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Y = Dependent variable and x1, x2, x3, …… xn = multiple independent variables
Assumption of Regression Model:
Linearity: The relationship between dependent and independent variables should be
linear.
Homoscedasticity: Constant variance of the errors should be maintained.
Multivariate normality: Multiple Regression assumes that the residuals are normally
distributed.
Lack of Multicollinearity: It is assumed that there is little or no multicollinearity in the
data. Dummy Variable:
As we know in the Multiple Regression Model, we use a lot of categorical data. Using
Categorical Data is a good method to include non-numeric data into the respective
Regression Model. Categorical Data refers to data values that represent categories-data
values with the fixed and unordered number of values, for instance,
gender(male/female).
In the regression model, these values can be represented by Dummy Variables. These
variables consist of values such as 0 or 1 representing the presence and absence of
categorical values.
Dummy Variable Trap:
The Dummy Variable Trap is a condition in which two or more are Highly Correlated. In
the simple term, we can say that one variable can be predicted from the prediction of the
other. The solution of the Dummy Variable Trap is to drop one of the categorical variables.
So, if there are m Dummy variables then m-1 variables are used in the model.
D2 = D1-1
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Here D2, D1 = Dummy Variables
Method of Building Models :
• All-in
• Backward-Elimination
• Forward Selection
• Bidirectional Elimination
• Score Comparison Backward-Elimination :
Step #1: Select a significant level to start in the model.
Step #2: Fit the full model with all possible predictors.
Step #3: Consider the predictor with the highest P-value. If P > SL go to STEP 4, otherwise
the model is Ready.
Step #4: Remove the predictor.
Step #5: Fit the model without this variable.
Forward-Selection :
Step #1: Select a significance level to enter the model (e.g. SL = 0.05)
Step #2: Fit all simple regression models y~ x(n). Select the one with the lowest P-value.
Step #3: Keep this variable and fit all possible models with one extra predictor added to
the one(s) you already have.
Step #4: Consider the predictor with the lowest P-value. If P < SL, go to Step #3, otherwise
the model is Ready.
Steps Involved in any Multiple Linear Regression Model
Step #1: Data Pre-Processing
1. Importing The Libraries.
2. Importing the Data Set.
3. Encoding the Categorical Data.
4. Avoiding the Dummy Variable Trap.
5. Splitting the Data set into Training Set and Test Set. Step#2: Fitting
Multiple Linear Regression to the Training set Step #3: Predict the Test set results.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
CONCLUSION: Thus, multiple linear regression is implemented successfully.
MULTIPLE CHOICE QUESTIONS:
1. Which of the following formula is not a multiple linear regression model?
a. Salary = a * Experience + b * Age + c
b. Salary = a * Experience + b * Age + c * Level + d
c. Salary = a * Experience + b * Age^2
d. Salary = a * Experience + b * Age
2. We should use Multiple Linear Regression to predict a dependent variable that is
growing exponentially with time. a. Yes
b. No
3. When there are more than one independent variables in the model, then the linear
model is termed as __________. a. Unimodal
b. Multiple model
c. Multiple Linear model
d. Multiple logistics model
4. The terms intercepts and slope are usually called as
____________. a. Regressionists
b. Coefficients
c. Regressive
d. Regression Coefficients
5. What is predicting y for a value of x that is with in the interval of point that we saw in
the original data called? a. Regression
b. Extrapolation
c. Intra polation
d. Polation
REFERENCES:
1. ML | Multiple Linear Regression using Python – GeeksforGeeks.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
2. 250+ TOP MCQs on Linear Regression and Answers 2023
EXPERIMENT NO- 4
AIM: Time Series Analysis in Python/R.
RESOURCES REQUIRED: H/W :- P4 machine
S/W :- Jupyter Notebook
THEORY:
A time series is the series of data points listed in time order. A time series is a sequence of
successive equal interval points in time. A time-series analysis consists of methods for
analyzing time series data in order to extract meaningful insights and other useful
characteristics of data. Time-series data analysis is becoming very important in so many
industries like financial industries, pharmaceuticals, social media companies, web service
providers, research, and many more. To understand the time-series data, visualizations
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
are essential. Any type of data analysis is not complete without visualizations. Because
one good visualization can provide meaningful and interesting insights into data.
Properties of Time series:
Seasonality: In time-series data, seasonality is the presence of variations that occur at
specific regular time intervals less than a year, such as weekly, monthly, or quarterly.
Resampling: Resampling is a methodology of economically using a data sample to
improve the accuracy and quantify the uncertainty of a population parameter. Resampling
for months or weeks and making bar plots is another very simple and widely used
method of finding seasonality. Here we are going to make a bar plot of month data for
2016 and 2017. Differencing: Differencing is used to make the difference in values of a
specified interval. By default, it’s one, we can specify different values for plots. It is the
most popular method to remove trends in the data.
Shift: The shift function can be used to shift the data before or after the specified time
interval. We can specify the time, and it will shift the data by one day by default. That
means we will get the previous day’s data. It is helpful to see previous day data and
today’s data simultaneously side by side.
CONCLUSION: Thus, the time series analysis and their properties are implemented
successfully.
MULTIPLE CHOICE QUESTIONS:
1. An orderly set of data arranged in accordance with their time of occurrence is called:
(a) Arithmetic series (b) Harmonic series (c) Geometric series (d) Time series
2. A time series consists of:
(a) Short-term variations (b) Long-term variations (c) Irregular variations (d) All of the
above
3. The secular trend is measured by the method of semi-averages when:
(a) Time series based on yearly values (b) Trend is linear (c) Time series consists of even
number of values (d) None of them
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
4. In time series seasonal variations can occur within a period of: (a) Four
years (b) Three years (c) One year (d) Nine years
5. Moving average method is used for measurement of trend when:
(a) Trend is linear (b) Trend is non linear (c) Trend is curvilinear (d) None of them
REFERENCES:
1. [Link]
2. [Link]
bcom/mcqtime-series-with-correct-answers/13650394
EXPERIMENT NO- 5
AIM: Implementation of ARIMA model in python / R.
RESOURCES REQUIRED: H/W :- P4 machine
S/W :- Jupyter Notebook
THEORY:
An autoregressive integrated moving average, or ARIMA, is a statistical analysis model
that uses time series data to either better understand the data set or to predict future
trends. A statistical model is autoregressive if it predicts future values based on past
values. For example, an ARIMA model might seek to predict a stock's future prices based
on its past performance or forecast a company's earnings based on past periods.
Understanding Autoregressive Integrated Moving Average (ARIMA)
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
An autoregressive integrated moving average model is a form of regression analysis that
gauges the strength of one dependent variable relative to other changing variables. The
model's goal is to predict future securities or financial market moves by examining the
differences between values in the series instead of through actual values.
An ARIMA model can be understood by outlining each of its components as follows:
• Autoregression (AR): refers to a model that shows a changing variable that
regresses on its own lagged, or prior, values.
• Integrated (I): represents the differencing of raw observations to allow the time
series to become stationary (i.e., data values are replaced by the difference
between the data values and the previous values).
• Moving average (MA): incorporates the dependency between an observation and
a residual error from a moving average model applied to lagged observations.
ARIMA Parameters
Each component in ARIMA functions as a parameter with a standard notation. For ARIMA
models, a standard notation would be ARIMA with p, d, and q, where integer values
substitute for the parameters to indicate the type of ARIMA model used. The parameters
can be defined as:
• p: the number of lag observations in the model, also known as the lag order.
• d: the number of times the raw observations are differenced; also known as the
degree of differencing.
• q: the size of the moving average window, also known as the order of the moving
average.
For example, a linear regression model includes the number and type of terms. A value of
zero (0), which can be used as a parameter, would mean that particular component
should not be used in the model. This way, the ARIMA model can be constructed to
perform the function of an ARMA model, or even simple AR, I, or MA models. ARIMA is a
method for forecasting or predicting future outcomes based on a historical time series. It
is based on the statistical concept of serial correlation, where past data points influence
future data points.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
The following table lists other ARIMA traits that demonstrate good and bad
characteristics.
Pros
• Good for short-term forecasting
• Only needs historical data
• Models non-stationary data Cons
• Not built for long-term forecasting
• Poor at predicting turning points
• Computationally expensive
• Parameters are subjective
CONCLUSION: Thus, the implementation ARIMA model done using python libraries.
MULTIPLE CHOICE QUESTIONS:
1. How many AR and MA terms should be included for the time series by looking at the
above ACF and PACF plots? a) AR (1) MA(0)
b) AR(0)MA(1)
c) AR(2)MA(1)
d) AR(1)MA(2)
e) Can’t Say
2. The length of a prediction interval for Yt+l from fitting a nonstationary ARIMA(p, d, q)
model generally
a) increases as l increases.
b) decreases as l increases.
c) becomes constant for l sufficiently large.
d) tends to zero as l increases
3. Which of the following statement is correct?
1. If autoregressive parameter (p) in an ARIMA model is 1, it means that there is no
autocorrelation in the series.
2. If moving average component (q) in an ARIMA model is 1, it means that there is
autocorrelation in the series with lag 1.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
3. If integrated component (d) in an ARIMA model is 0, it means that the series is not
stationary. a) Only 1
b) Both 1 and 2
c) Only 2
d) All of the statements
4. Which of the following is not a technique used in smoothing time series?
a) Nearest Neighbour Regression
b) Locally weighted scatter plot smoothing
c) Tree based models like (CART)
d) Smoothing Splines
5. An ARMA(p,q) (p, q are integers bigger than zero) model will have
a) An acf and pacf that both decline geometrically
b) An acf that declines geometrically and a pacf that is zero after p lags
c) An acf that declines geometrically and a pacf that is zero after q lags
d) An acf that is zero after p lags and a pacf that is zero after q lags
REFERENCES:
1. [Link]
[Link]
2. [Link]
solutionskillpower-time-series-datafest-2017/
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
EXPERIMENT NO- 6
AIM: Visualization experiments in python using matplotlib Library.
RESOURCES REQUIRED: H/W :- P4 machine
S/W :- Jupyter Notebook
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
THEORY:
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is
a multi-platform data visualization library built on NumPy arrays and designed to work
with the broader SciPy stack. It was introduced by John Hunter in the year 2002. One of
the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
histogram etc.
Installation: Windows, Linux and macOS distributions have matplotlib and most of its
dependencies as wheel packages. Run the following command to install matplotlib
package: python -mpip install -U matplotlib
Basic plots in Matplotlib
Matplotlib is a data visualization library in Python. The pyplot, a sub library of matplotlib,
is a collection of functions that helps in creating a variety of charts.
Line plot :
Line plots are drawn by joining straight lines connecting data points where the x-axis and
yaxis values intersect. Line plots are the simplest form of representing data. In Matplotlib,
the plot() function represents this.
Bar Plot:
The bar plots are vertical/horizontal rectangular graphs that show data comparison where
you can gauge the changes over a period represented in another axis (mostly the X-axis).
Each bar can store the value of one or multiple data divided in a ratio. The longer a bar
becomes, the greater the value it holds. In Matplotlib, we use the bar() or barh() function
to represent it.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Scatter plot:
We can implement the scatter (previously called XY) plots while comparing various data
variables to determine the connection between dependent and independent variables.
The data gets expressed as a collection of points clustered together meaningfully. Here
each value has one variable (x) determining the relationship with the other (Y). We use ht
Pie Plot:
A pie plot is a circular graph where the data get represented within that
components/segments or slices of pie. Data analysts use them while representing the
percentage or proportional data in which each pie slice represents an item or data
classification. In Matplotlib, the pie() function represents it.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Area plot:
The area plots spread across certain areas with bumps and drops (highs and lows) and are
also known as stack plots. They look identical to the line plots and help track the changes
over time for two or multiple related groups to make it one whole category. In Matplotlib,
the stackplot() function represents it.
Histogram plot:
We can use a histogram plot when the data remains distributed, whereas we can use a
bar graph to compare two entities. Both histogram and bar plot look alike but are used in
different scenarios. In Matplotlib, the hist() function represents this.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
CONCLUSION: Thus, the implementation of above plots is done using matplotlib library in
python.
MULTIPLE CHOICE QUESTIONS:
1. What is true about Data Visualization?
A. Data Visualization is used to communicate information clearly and efficiently to users
by the usage of information graphics such as tables and charts.
B. Data Visualization helps users in analyzing a large amount of data in a simpler way.
C. Data Visualization makes complex data more accessible, understandable, and usable.
D. All of the above
2. Data can be visualized using?
A. graphs
B. charts
C. maps
D. All of the above
3. Which one of the following is most basic and commonly used techniques?
A. Line charts
B. Scatter plots
C. Population pyramids
D. Area charts
4. Which of the following is tool for checking normality?
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
A. qqline()
B. qline()
C. anova()
D. lm()
5. Which of the following lists names of variables in a [Link]?
A. par()
B. names()
C. barchart()
D. quantile()
REFERENCES:
[Link]
[Link]
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
EXPERIMENT NO- 7
AIM: Visualization experiments in python using plotly Library.
RESOURCES REQUIRED: H/W :- P4 machine
S/W :- Jupyter Notebook
THEORY:
The Plotly Python library is an interactive open-source library. This can be a very helpful
tool for data visualization and understanding the data simply and easily. plotly graph
objects are a high-level interface to plotly which are easy to use. It can plot various types
of graphs and charts like scatter plots, line charts, bar charts, box plots, histograms, pie
charts, etc. So you all must be wondering why plotly over other visualization tools or
libraries? Here’s the answer –
• Plotly has hover tool capabilities that allow us to detect any outliers or
anomalies in a large number of data points.
• It is visually attractive that can be accepted by a wide range of audiences.
• It allows us for the endless customization of our graphs that makes our plot
more meaningful and understandable for others.
Various plots using Plotly
Box plot: A box plot is the representation of a statistical summary. Minimum, First
Quartile, Median, Third Quartile, Maximum.
Violin charts:
Violin plots are distribution charts similar to box plots that allow visualizing the
underlying distribution of the data through a mirrored kernel density line of that data.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
With the violin function from the plotly express module you can create violin plots in
Python. You will need to input a numerical variable to y or specify the name of the
column of a data frame with the desired variable in order to create a vertical violin plot.
Horizontal violin plot
If you pass the variable to x instead of to y you will create a horizontal violin plot.
Violin plot with box plot inside:
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Violin plots by group
Heatmaps:
Heatmap is defined as a graphical representation of data using colors to visualize the
value of the matrix. In this, to represent more common values or higher activities brighter
colors basically reddish colors are used and to represent less common or activity values,
darker colors are preferred. Heatmap is also defined by the name of the shading matrix.
Bubble Chart
The bubble chart in Plotly is created using the scatter plot. It can be created using the
scatter() method of [Link]. A bubble chart is a data visualization which helps to
displays multiple circles (bubbles) in a two-dimensional plot as same in scatter plot. A
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
bubble chart is primarily used to depict and show relationships between numeric
variables.
CONCLUSION: Thus, the implementation of above plots is done using plotly library in
python.
MULTIPLE CHOICE QUESTIONS:
1. What is Plotly?
a) A programming language
b) A data visualization library
c) A machine learning algorithm
d) A database management system
2. Which programming languages can be used with Plotly?
a) Python, R, MATLAB, and JavaScript
b) Python, Ruby, C++, and Swift
c) Java, Scala, Kotlin, and Groovy
d) PHP, Perl, Lua, and Pascal
3. Which of the following chart types is NOT available in
Plotly? a) Line chart
b) Bar chart
c) Scatter plot
d) Heat map
4. What is a Plotly trace?
a) A single data series in a plot
b) A function that creates a plot
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
c) A library of pre-made charts
d) An interactive element of a plot
5. Which of the following Plotly chart types is best for comparing multiple data
series? a) Line chart
b) Bar chart
c) Scatter plot
d) Heat map
REFERENCES:
1. [Link]
2. Bubble chart using Plotly in Python - GeeksforGeeks
3. Plotly MCQs and Answers With Explanation | Plotly Quiz ([Link])
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
EXPERIMENT NO- 8
AIM: Visualization experiments in R using ggplot2 library.
RESOURCES REQUIRED: H/W:- P4 machine
S/W:- R studio /Jupyter Notebook
THEORY:
ggplot2 package in R Programming Language also termed as Grammar of Graphics is a
free, open-source, and easy-to-use visualization package widely used in R. It is the most
powerful visualization package written by Hadley Wickham.
It includes several layers on which it is governed. The layers are as follows:
Building Blocks of layers with the grammar of graphics
• Data: The element is the data set itself
• Aesthetics: The data is to map onto the Aesthetics attributes such as x-axis,
yaxis, color, fill, size, labels, alpha, shape, line width, line type
• Geometrics: How our data being displayed using point, line, histogram, bar,
boxplot
• Facets: It displays the subset of the data using Columns and rows
• Statistics: Binning, smoothing, descriptive, intermediate
• Coordinates: the space between data and display using Cartesian, fixed, polar,
limits
• Themes: Non-data link
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Data Layer: In the data Layer we define the source of the information to be
visualize, let’s use the mtcars dataset in the ggplot2 package.
Aesthetic Layer: Here we will display and map dataset into certain aesthetics.
Geometric layer: In geometric layer control the essential elements, see how our
data being displayed using point, line, histogram, bar, boxplot
Facet Layer: It is used to split the data up into subsets of the entire dataset and it
allows the subsets to be visualized on the same plot. Here we separate rows
according to transmission type and Separate columns according to cylinders.
Statistics layer: In this layer, we transform our data using binning, smoothing,
descriptive, intermediate.
Coordinates layer: In these layers, data coordinates are mapped together to the
mentioned plane of the graphic and we adjust the axis and changes the spacing of
displayed data with Control plot dimensions.
Theme Layer: This layer controls the finer points of display like the font size and
background color properties. ggplot2 provides various types of visualizations. More
parameters can be used included in the package as the package gives greater
control over the visualizations of data. Many packages can integrate with the
ggplot2 package to make the visualizations interactive and animated.
CONCLUSION: Thus, the implementation of above plots is done using ggplot library in R
programming.
MULTIPLE CHOICE QUESTIONS:
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
1. ______ grammar makes a clear distinction between your data and what gets displayed
on the screen or page. a. ggplot1
b.ggplot2
c. [Link]
d. ggplot3
2. Which of the following is a plot to investigate the order in which observations
were recorded? a. ggplot
b. ggsave
c. ggpcp
d. ggorder
3. ________ is used to create a plot to illustrate patterns of missing values.
a. ggmissplot
b. ggmissing
c. ggfluctuation
d. ggpcp
4. Which R data type is most appropriate for a categorical variable?
a. Numeric
b. Factor
c. Integer
d. Character
5. Which of the following opens the ggplot2 library?
a. [Link]("ggplot2")
b. library(package = "ggplot2")
c. summary(object = ggplot2)
d. open(x = ggplot2)
REFERENCES:
1. Data visualization with R and ggplot2 - GeeksforGeeks
2. Multiple Choice Questions | Online Resources ([Link])
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Contents Beyond Syllabus
AIM: Text Analysis Using Turicreate
RESOURCES REQUIRED: H/W: - P4 machine
S/W: - Jupyter Notebook
THEORY:
Text is a group of words or sentences. Text analysis is analyzing the text and then
extracting information with the help of text. Text data is one of the biggest factors that
can make a company big or small. For example
• On E-Commerce website people buy things. With Text Analysis the E-
Commerce website can know what its customer likes and it through this data it
can make its productivity higher.
• Using Text analysis and some Machine Learning Algorithm our Alexa Google
Home mini works. These two are based on Natural Language Processing.
Text analysis can be done using text mining. As the text “data” can be structured as well
as unstructured. The text mining technique will help us in differentiating between them.
Now let’s do some text analysis using Turicreate. We will build a model that classifies that
a message is a spam or ham for text analysis.
Link for the dataset=[Link]
classification Step 1: Import the Turicreate Library Step 2: Load the data set.
Step 3: We will explore the data first.
Step 4: Now adding the word count in the data set.
This is because data has two things category and message. Adding the word count will
help in model feature selection.
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
Step 5: To split the data into train and test set.
Step 6: Now we will make a model for classifying the spam and ham.
Step 7: Now we will check accuracy of our model.
Step 8: We can predict manually by checking from our test data that it is giving right
answer or not.
Step 9: Predicting the test data.
CONCLUSION: Thus, We have implemented text analysis using Turicreate library.
MULTIPLE CHOICE QUESTIONS:
1. By default, how many spaces are code intended when using the Python IDLE?
A. 2 B. 4 C. 3 D. 5
2. Text Mining is:
A. Conceptual
B. Theoretical
C. Empirical
D. All of the above
3. Predictive text analytics tasks include
A. Prediction
B. Classification
C. Clustering
D. All of the above
4. Which of the following technique is not a part of flexible text matching?
A. Soundex
B. Metaphone
C. Edit Distance
D. Keyword Hashing
5. Text Mining can be used in:
A. Detecting spam model
B. Predicting stock Movements
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of AIMLEngineering
C. None of the above
D. Both a and b
REFERENCES:
1. [Link]
2. [Link]
TE-AIML-SEM-VI [DAV-Lab Mannual] Prof. Rashmi Mahajan