0% found this document useful (0 votes)

5 views127 pages

Python Notes

The document provides an overview of Python programming, highlighting its ease of learning, strong community support, and backing from major corporations like Google. It discusses the versatility of Python, its extensive libraries, and its applications in fields such as data science and automation. Additionally, it covers the use of Tkinter for building graphical user interfaces in Python, including examples of various widgets and event handling.

Uploaded by

royankales

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views127 pages

Python Notes

Uploaded by

royankales

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Kisii University, School of Information Science & Technology, Computing Sciences Department

Lecturer: Dr. Ben Mariga Bogonko, Computing Sciences Department

Unit Name: Programming in Python

Python Environment Setup Using Python Executable version 3.9 and Jupyterlab IDE (Pycharm, Anaconda, Jupyter
Notebook, Spyder, Atom, Thonny, Eclipse)

Reasons for the Popularity of Python Programming Language

1. Easy to Learn and Use

Python language is incredibly easy to use and learn for new beginners and newcomers. The python language is one
of the most accessible programming languages available because it has simplified syntax and not complicated,
which gives more emphasis on natural language. Due to its ease of learning and usage, python codes can be easily
written and executed much faster than other programming languages.

When Guido van Rossum was creating python in the 1980s, he made sure to design it to be a general-purpose
language. One of the main reasons for the popularity of python would be its simplicity in syntax so that it could be
easily read and understood even by amateur developers also.

2. Mature and Supportive Python Community

Python was created more than 30 years ago, which is a lot of time for any community of programming language to
grow and mature adequately to support developers ranging from beginner to expert levels. There are plenty of
documentation, guides and Video Tutorials for Python language are available that learner and developer of any skill
level or ages can use and receive the support required to enhance their knowledge in python programming language.
Many students get introduced to computer science only through Python language, which is the same language used
for in-depth research projects.

3. Support from Renowned Corporate Community

Programming languages grow faster when a corporate sponsor backs it. For example, PHP is backed by Facebook,
Java by Oracle and Sun, Visual Basic & C# by Microsoft. Python Programming language is heavily backed by
Facebook, Amazon Web Services, and especially Google.

Google adopted python language way back in 2006 and have used it for many applications and platforms since then.
Lots of Institutional effort and money have been devoted to the training and success of the python language by
Google. They have even created a dedicated portal only for python. The list of support tools and documentation
keeps on growing for python language in the developers’ world.

4. Hundreds of Python Libraries and Frameworks

Due to its corporate sponsorship and big supportive community of python, python has excellent libraries that you
can use to select and save your time and effort on the initial cycle of development. There are also lots of cloud media
services that offer cross-platform support through library-like tools, which can be extremely beneficial.

Libraries with specific focus are also available like nltk for natural language processing or scikit-learn for machine
learning applications.

There are many frameworks and libraries that are available for python language, such as:

1
 matplotib for plotting charts and graphs
 SciPy for engineering applications, science, and mathematics
 BeautifulSoup and Requests for HTML parsing and XML
 NumPy for scientific computing
 Django for server-side web development

5. Versatility, Efficiency, Reliability, and Speed

Ask any python developer, and they will wholeheartedly agree that the python language is efficient, reliable, and
much faster than most modern languages. Python can be used in nearly any kind of environment, and one will not
face any kind of performance loss issue irrespective of the platform one is working.

One more best thing about versatility of python language is that it can be used in many varieties of environments
such as mobile applications, desktop applications, web development, hardware programming, and many more. The
versatility of python makes it more attractive to use due to its high number of applications.

6. Big data, Machine Learning and Cloud Computing

Cloud Computing, Machine Learning, and Big Data are some of the hottest trends in the computer science world
right now, which helps lots of organizations to transform and improve their processes and workflows.

Python language is the second most popular used tool after R language for data science and analytics. Lots of many
data processing workloads in the organization are powered by python language only. Most of the research and
development takes place in python language due to its many applications, including ease of analyzing and
organizing the usable data.

Not only this, but hundreds of python libraries are being used in thousands of machine learning projects every day,
such as TensorFlow for neural networks and OpenCV for computer vision, etc.

7. First-choice Language

Python language is the first choice for many programmers and students due to the main reason for python being in
high demand in the development market. Students and developers always look forward to learning a language that is
in high demand. Python is undoubtedly the hottest cake in the market now.

Many programmers and data science students are using python language for their development projects. Learning
python is one of the important section in data science certification courses. In this way, the python language can
provide plenty of fantastic career opportunities for students. Due to the variety of applications of python, one can
pursue different career options and will not remain stuck to one.

8. The Flexibility of Python Language

The python language is so flexible that it gives the developer the chance to try something new. The person who is an
expert in python language is not just limited to build similar kinds of things but can also go on to try to make
something different than before.

Python doesn’t restrict developers from developing any sort of application. This kind of freedom and flexibility by
just learning one language is not available in other programming languages.

2
9. Use of python in academics

Now python language is being treated as the core programming language in schools and colleges due to its countless
uses in Artificial Intelligence, Deep Learning, Data Science, etc. It has now become a fundamental part of the
development world that schools and colleges cannot afford not to teach python language.

In this way, it is increasing more python Developers and Programmers and thus further expanding its growth and
popularity.

10. Automation

Python language can help a lot in automation of tasks as there are lots of tools and modules available, which makes
things much more comfortable. It is incredible to know that one can reach an advanced level of automation easily by
just using necessary python codes.

Python is the best performance booster in the automation of software testing also. One will be amazed at how much
less time and few numbers of lines are required to write codes for automation tools.

3
GUI BUILDING IN PYTHON USING TKINTER

Modern computer applications are user-friendly. User interaction is not restricted to console-based I/O. They have a
more ergonomic graphical user interface (GUI) thanks to high speed processors and powerful graphics hardware.
These applications can receive inputs through mouse clicks and can enable the user to choose from alternatives with
the help of radio buttons, dropdown lists, and other GUI elements (or widgets).

Such applications are developed using one of various graphics libraries available. A graphics library is a software
toolkit having a collection of classes that define a functionality of various GUI elements. These graphics libraries are
generally written in C/C++.

GUI elements and their functionality are defined in the Tkinter module. The following code demonstrates the steps
in creating a UI.

from tkinter import *

window=Tk()
# add widgets here

[Link]('Hello Python')
[Link]("300x200+10+20")
[Link]()

First of all, import the TKinter module. After importing, setup the application object by calling the Tk() function.
This will create a top-level window (root) having a frame with a title bar, control box with the minimize and close
buttons, and a client area to hold other widgets. The geometry() method defines the width, height and coordinates of
the top left corner of the frame as below (all values are in pixels): [Link]("widthxheight+XPOS+YPOS")
The application object then enters an event listening loop by calling the mainloop() method. The application is now
constantly waiting for any event generated on the elements in it. The event could be text entered in a text field, a
selection made from the dropdown or radio button, single/double click actions of mouse, etc. The application's
functionality involves executing appropriate callback functions in response to a particular type of event. We shall
discuss event handling later in this tutorial. The event loop will terminate as and when the close button on the title
bar is clicked. The above code will create the following window:

Python-Tkinter Window

All Tkinter widget classes are inherited from the Widget class. Let's add the most commonly used widgets.

Button

The button can be created using the Button class. The Button class constructor requires a reference to the main
window and to the options.

Signature: Button(window, attributes)

4
You can set the following important properties to customize a button:

 text : caption of the button

 bg : background colour
 fg : foreground colour
 font : font name and size
 image : to be displayed instead of text
 command : function to be called when clicked

Example: Button
 from tkinter import *
 window=Tk()
 btn=Button(window, text="This is Button widget", fg='blue')
 [Link](x=80, y=100)
 [Link]('Hello Python')
 [Link]("300x200+10+10")
 [Link]()

Label

A label can be created in the UI in Python using the Label class. The Label constructor requires the top-level
window object and options parameters. Option parameters are similar to the Button object.

The following adds a label in the window.

Example: Label
from tkinter import *
window=Tk()
lbl=Label(window, text="This is Label widget", fg='red', font=("Helvetica", 16))
[Link](x=60, y=50)
[Link]('Hello Python')
[Link]("300x200+10+10")
[Link]()

Here, the label's caption will be displayed in red colour using Helvetica font of 16 point size.

Entry

This widget renders a single-line text box for accepting the user input. For multi-line text input use the Text widget.
Apart from the properties already mentioned, the Entry class constructor accepts the following:

 bd : border size of the text box; default is 2 pixels.

 show : to convert the text box into a password field, set show property to "*".

The following code adds the text field.

txtfld=Entry(window, text="This is Entry Widget", bg='black',fg='white', bd=5)

The following example creates a window with a button, label and entry field.

Example: Create Widgets

from tkinter import *
window=Tk()

5
btn=Button(window, text="This is Button widget", fg='blue')
[Link](x=80, y=100)
lbl=Label(window, text="This is Label widget", fg='red', font=("Helvetica", 16))
[Link](x=60, y=50)
txtfld=Entry(window, text="This is Entry Widget", bd=5)
[Link](x=80, y=150)
[Link]('Hello Python')
[Link]("300x200+10+10")
[Link]()

The above example will create the following window.

Create UI Widgets in Python-Tkinter

Selection Widgets

Radiobutton: This widget displays a toggle button having an ON/OFF state. There may be more than one button,
but only one of them will be ON at a given time.

Checkbutton: This is also a toggle button. A rectangular check box appears before its caption. Its ON state is
displayed by the tick mark in the box which disappears when it is clicked to OFF.

Combobox: This class is defined in the ttk module of tkinterpackage. It populates drop down data from a collection
data type, such as a tuple or a list as values parameter.

Listbox: Unlike Combobox, this widget displays the entire collection of string items. The user can select one or
multiple items.

The following example demonstrates the window with the selection widgets: Radiobutton, Checkbutton, Listbox and
Combobox:

Example: Selection Widgets

from tkinter import *
from [Link] import Combobox
window=Tk()

6
var = StringVar()
[Link]("one")
data=("one", "two", "three", "four")
cb=Combobox(window, values=data)
[Link](x=60, y=150)

lb=Listbox(window, height=5, selectmode='multiple')

for num in data:
[Link](END,num)
[Link](x=250, y=150)

v0=IntVar()
[Link](1)
r1=Radiobutton(window, text="male", variable=v0,value=1)
r2=Radiobutton(window, text="female", variable=v0,value=2)
[Link](x=100,y=50)
[Link](x=180, y=50)

v1 = IntVar()
v2 = IntVar()
C1 = Checkbutton(window, text = "Cricket", variable = v1)
C2 = Checkbutton(window, text = "Tennis", variable = v2)
[Link](x=100, y=100)
[Link](x=180, y=100)

[Link]('Hello Python')
[Link]("400x300+10+10")
[Link]()

Create UI in Python-Tkinter

Event Handling

An event is a notification received by the application object from various GUI widgets as a result of user interaction.
The Application object is always anticipating events as it runs an event listening loop. User's actions include mouse
button click or double click, keyboard key pressed while control is inside the text box, certain element gains or goes
out of focus etc.

7
Events are expressed as strings in <modifier-type-qualifier> format.

Many events are represented just as qualifier. The type defines the class of the event.

The following table shows how the Tkinter recognizes different events:

Event Modifier Type Qualifier Action

<Button-1> Button 1 Left mouse button click.
<Button-2> Button 2 Middle mouse button click.
<Destroy> Destroy Window is being destroyed.
<Double-Button-1> Double Button 1 Double-click first mouse button 1.
<Enter> Enter Cursor enters window.
<Expose> Expose Window fully or partially exposed.
<KeyPress-a> KeyPress a Any key has been pressed.
<KeyRelease> KeyRelease Any key has been released.
<Leave> Leave Cursor leaves window.
<Print> Print PRINT key has been pressed.
<FocusIn> FocusIn Widget gains focus.
<FocusOut> FocusOut widget loses focus.

An event should be registered with one or more GUI widgets in the application. If it's not, it will be ignored. In
Tkinter, there are two ways to register an event with a widget. First way is by using the bind() method and the
second way is by using the command parameter in the widget constructor.

Bind() Method

The bind() method associates an event to a callback function so that, when the even occurs, the function is called.

Syntax:
[Link](event, callback)

For example, to invoke the MyButtonClicked() function on left button click, use the following code:

Example: Even Binding

from tkinter import *
window=Tk()
btn = Button(window, text='OK')
[Link]('<Button-1>', MyButtonClicked)

The event object is characterized by many properties such as source widget, position coordinates, mouse button
number and event type. These can be passed to the callback function if required.

Command Parameter

Each widget primarily responds to a particular type. For example, Button is a source of the Button event. So, it is by
default bound to it. Constructor methods of many widget classes have an optional parameter called command. This
command parameter is set to callback the function which will be invoked whenever its bound event occurs. This
method is more convenient than the bind() method.

btn = Button(window, text='OK', command=myEventHandlerFunction)

8
In the example given below, the application window has two text input fields and another one to display the result.
There are two button objects with the captions Add and Subtract. The user is expected to enter the number in the two
Entry widgets. Their addition or subtraction is displayed in the third.

The first button (Add) is configured using the command parameter. Its value is the add() method in the class. The
second button uses the bind() method to register the left button click with the sub() method. Both methods read the
contents of the text fields by the get() method of the Entry widget, parse to numbers, perform the
addition/subtraction and display the result in third text field using the insert() method.

Example:
from tkinter import *
class MyWindow:
def __init__(self, win):
self.lbl1=Label(win, text='First number')
self.lbl2=Label(win, text='Second number')
self.lbl3=Label(win, text='Result')
self.t1=Entry(bd=3)
self.t2=Entry()
self.t3=Entry()
self.btn1 = Button(win, text='Add')
self.btn2=Button(win, text='Subtract')
[Link](x=100, y=50)
[Link](x=200, y=50)
[Link](x=100, y=100)
[Link](x=200, y=100)
self.b1=Button(win, text='Add', command=[Link])
self.b2=Button(win, text='Subtract')
[Link]('<Button-1>', [Link])
[Link](x=100, y=150)
[Link](x=200, y=150)
[Link](x=100, y=200)
[Link](x=200, y=200)
def add(self):
[Link](0, 'end')
num1=int([Link]())
num2=int([Link]())
result=num1+num2
[Link](END, str(result))
def sub(self, event):
[Link](0, 'end')
num1=int([Link]())
num2=int([Link]())
result=num1-num2
[Link](END, str(result))

window=Tk()
mywin=MyWindow(window)
[Link]('Hello Python')
[Link]("400x300+10+10")
[Link]()

The above example creates the following UI.

9
UI in Python-Tkinter

Thus, you can create the UI using TKinter in Python.

Data Wrangling/Data Cleaning/Data Munging, Data Analysis, in Python Programming Language

Data cleaning/Data Wrangling is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.

10
When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data
is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute
way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset.
But it is crucial to establish a template for your data cleaning process so you know you are doing it the right way
every time.

What is the difference between data cleaning and data transformation?

Data cleaning is the process that removes data that does not belong in your dataset. Data transformation is the
process of converting data from one format or structure into another. Transformation processes can also be referred
to as data wrangling, or data munging, transforming and mapping data from one "raw" data form into another format
for warehousing and analyzing. This article focuses on the processes of cleaning that data.

Python is an easy-to-learn programming language, which makes it the most preferred choice for beginners in Data
Science, Data Analytics, and Machine Learning. It also has a great community of online learners and excellent data-
centric libraries.

With so much data being generated, it becomes important that the data we use for Data Science applications like
Machine Learning and Predictive Modeling is clean. But what do we mean by clean data? And what makes data
dirty in the first place?

Dirty data simply means data that is erroneous. Duplicacy of records, incomplete or outdated data, and improper
parsing can make data dirty. This data needs to be cleaned. Data cleaning (or data cleansing) refers to the process of
“cleaning” this dirty data, by identifying errors in the data and then rectifying them.

Data cleaning is an important step in and Machine Learning project, and we will cover some basic data cleaning
techniques (in Python) in this article.

Cleaning Data in Python

We will learn more about data cleaning in Python with the help of a toy dataset. We will use the Russian housing
dataset on Kaggle repository.

We will start by importing the required libraries.

# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import [Link] as plt
%matplotlib inline

Download the data, and then read it into a Pandas DataFrame by using the read_csv() function, and specifying the
file path. Then use the shape attribute to check the number of rows and columns in the dataset. The code for this is
as below:

df = pd.read_csv('housing_data.csv')
[Link]

The dataset has 30,471 rows and 292 columns.

11
We will now separate the numeric columns from the categorical columns.

# select numerical columns

df_numeric = df.select_dtypes(include=[[Link]])
numeric_cols = df_numeric.[Link]
# select non-numeric columns
df_non_numeric = df.select_dtypes(exclude=[[Link]])
non_numeric_cols = df_non_numeric.[Link]

We are now through with the preliminary steps. We can now move on to data cleaning. We will start by identifying
columns that contain missing values and try to fix them.

Missing values

We will start by calculating the percentage of values missing in each column, and then storing this information in a
DataFrame.

# % of values missing in each column

values_list = list()
cols_list = list()
for col in [Link]:
pct_missing = [Link](df[col].isnull())*100
cols_list.append(col)
values_list.append(pct_missing)
pct_missing_df = [Link]()
pct_missing_df['col'] = cols_list
pct_missing_df['pct_missing'] = values_list

The DataFrame pct_missing_df now contains the percentage of missing values in each column along with the
column names.

We can also create a visual out of this information for better understanding using the code below:

pct_missing_df.loc[pct_missing_df.pct_missing > 0].plot(kind='bar', figsize=(12,8))

[Link]()

The output after execution of the above line of code should look like this:

12
It is clear that some columns have very few values missing, while other columns have a substantial % of values
missing. We will now fix these missing values.

There are a number of ways in which we can fix these missing values. Some of them are:

Drop Observations

One way could be to drop those observations that contain any null value in them for any of the columns. This will
work when the percentage of missing values in each column is very less. We will drop observations that contain null
in those columns that have less than 0.5% nulls. These columns would be metro_min_walk, metro_km_walk,
railroad_station_walk_km, railroad_station_walk_min, and ID_railroad_station_walk.

less_missing_values_cols_list = list(pct_missing_df.loc[(pct_missing_df.pct_missing < 0.5) &

(pct_missing_df.pct_missing > 0), 'col'].values)
[Link](subset=less_missing_values_cols_list, inplace=True)

This will reduce the number of records in our dataset to 30,446 records.

Remove columns (Features)

13
Another way to tackle missing values in a dataset would be to drop those columns or features that have a significant
percentage of values missing. Such columns don’t contain a lot of information and can be dropped altogether from
the dataset. In our case, let us drop all those columns that have more than 40% values missing in them. These
columns would be build_year, state, hospital_beds_raion, cafe_sum_500_min_price_avg,
cafe_sum_500_max_price_avg, and cafe_avg_price_500.

# dropping columns with more than 40% null values

_40_pct_missing_cols_list = list(pct_missing_df.loc[pct_missing_df.pct_missing > 40, 'col'].values)
[Link](columns=_40_pct_missing_cols_list, inplace=True)

The number of features in our dataset is now 286.

Impute Missing Values

There is still missing data left in our dataset. We will now impute the missing values in each numerical column with
the median value of that column.

df_numeric = df.select_dtypes(include=[[Link]])
numeric_cols = df_numeric.[Link]
for col in numeric_cols:
missing = df[col].isnull()
num_missing = [Link](missing)
if num_missing > 0: # impute values only for columns that have missing values
med = df[col].median() #impute with the median
df[col] = df[col].fillna(med)

Missing values in numerical columns are now fixed. In the case of categorical columns, we will replace missing
values with the mode values of that column.

df_non_numeric = df.select_dtypes(exclude=[[Link]])
non_numeric_cols = df_non_numeric.[Link]
for col in non_numeric_cols:
missing = df[col].isnull()
num_missing = [Link](missing)
if num_missing > 0: # impute values only for columns that have missing values
mod = df[col].describe()['top'] # impute with the most frequently occuring value
df[col] = df[col].fillna(mod)

All missing values in our dataset have now been treated. We can verify this by running the following piece of code:

[Link]().sum().sum()

If the output is zero, it means that there are no missing values left in our dataset now.

We can also replace missing values with a particular value (like -9999 or ‘missing’) which will indicate the fact that
the data was missing in this place. This can be a substitute for missing value imputation.

Outliers

An outlier is an unusual observation that lies away from the majority of the data. Outliers can affect the performance
of a Machine Learning model significantly. Hence, it becomes important to identify outliers and treat them.

14
Let us take the ‘life_sq’ column as an example. We will first use the describe() method to look at the descriptive
statistics and see if we can gather any information from it.

df.life_sq.describe()

The output will look like this:

count 30446.000000
mean 33.482658
std 46.538609
min 0.000000
25% 22.000000
50% 30.000000
75% 38.000000
max 7478.000000
Name: life_sq, dtype: float64

From the output, it is clear that something is not correct. The max value seems to be abnormally large compared to
the mean and median values. Let us make a boxplot of this data to get a better idea.

df.life_sq.plot(kind='box', figsize=(12, 8))

[Link]()

The output will look like this:

15
It is clear from the boxplot that the observation corresponding to the maximum value (7478) is an outlier in this data.
Descriptive statistics, boxplots, and scatter plots help us in identifying outliers in the data.

We can deal with outliers just like we dealt with missing values. We can either drop the observations that we think
are outliers, or we can replace the outliers with suitable values, or we can perform some sort of transformation on
the data (like log or exponential). In our case, let us drop the record where the value of ‘life_sq’ is 7478.

# removing the outlier value in life_sq column

df = [Link][df.life_sq < 7478]

Duplicate records

Data can sometimes contain duplicate values. It is important to remove duplicate records from your dataset before
you proceed with any Machine Learning project. In our data, since the ID column is a unique identifier, we will drop
duplicate records by considering all but the ID column.

# dropping duplicates by considering all columns other than ID

cols_other_than_id = list([Link])[1:]
df.drop_duplicates(subset=cols_other_than_id, inplace=True)

This will help us in dropping the duplicate records. By using the shape method, you can check that duplicate records
have actually been dropped. The number of observations is 30,434 now.

16
Fixing Datatype

Often in the dataset, values are not stored in the correct data type. This can create a problem in later stages, and we
may not get the desired output or may get errors while execution. One common data type error is with dates. Dates
are often parsed as objects in Python. There is a separate data type for dates in Pandas, called DateTime.

We will first check the data type of the timestamp column in our data.

[Link]

This returns the data type ‘object’. We now know the timestamp is not stored correctly. To fix this, let’s convert the
timestamp column to the DateTime format.

# converting timestamp to datetime format

df['timestamp'] = pd.to_datetime([Link], format='%Y-%m-%d')

We now have the timestamp in the correct format. Similarly, there can be columns where integers are stored as
objects. Identifying such features and correcting the data type is important before you proceed on to Machine
Learning. Fortunately for us, we don’t have any such issue in our dataset.

Web Scraping
Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can
extract, parse, download and organize useful information from the web automatically.

Python Modules for Web Scraping

Web scraping is the process of constructing an agent which can extract, parse, download and organize useful
information from the web automatically. In other words, instead of manually saving the data from websites, the web
scraping software will automatically load and extract data from multiple websites as per our requirement.

In this section, we are going to discuss about useful Python libraries for web scraping.

Requests

It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the
help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before
using requests, let us understand its installation.

Urllib3

It is another Python library that can be used for retrieving data from URLs similar to the requests library.

Selenium

It is an open source automated testing suite for web applications across different browsers and platforms. It is not a
single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript. Here we
are going to perform web scraping by using selenium and its Python bindings. You can learn more about Selenium
with Java on the link Selenium.

Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome,
Remote etc. The current supported Python versions are 2.7, 3.5 and above.

17
Scrapy

Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page
with the help of selectors based on XPath. Scrapy was first released on June 26, 2008 licensed under BSD, with a
milestone 1.0 releasing in June 2015. It provides us all the tools we need to extract, process and structure the data
from websites.

Analyzing a web page means understanding its sructure . Now, the question arises why it is important for web
scraping? In this chapter, let us understand this in detail.

Web page Analysis

Web page analysis is important because without analyzing we are not able to know in which form we are going to
receive the data from (structured or unstructured) that web page after extraction. We can do web page analysis in the
following ways −

Viewing Page Source

This is a way to understand how a web page is structured by examining its source code. To implement this, we need
to right click the page and then must select the View page source option. Then, we will get the data of our interest
from that web page in the form of HTML. But the main concern is about whitespaces and formatting which is
difficult for us to format.

Inspecting Page Source by Clicking Inspect Element Option

This is another way of analyzing web page. But the difference is that it will resolve the issue of formatting and
whitespaces in the source code of web page. You can implement this by right clicking and then selecting the Inspect
or Inspect element option from menu. It will provide the information about particular area or element of that web
page.

Different Ways to Extract Data from Web Page

The following methods are mostly used for extracting data from a web page −

Regular Expression

They are highly specialized programming language embedded in Python. We can use it through re module of
Python. It is also called RE or regexes or regex patterns. With the help of regular expressions, we can specify some
rules for the possible set of strings we want to match from the data.

Example

In the following example, we are going to scrape data about India from [Link] after
matching the contents of <td> with the help of regular expression.

import re
import [Link]
response =
[Link]('[Link]
html = [Link]()
text = [Link]()
[Link]('<td class="w2p_fw">(.*?)</td>',text)

18
Output

The corresponding output will be as shown here −

[
'<img src="/places/static/images/flags/[Link]" />',
'3,287,590 square kilometres',
'1,173,108,018',
'IN',
'India',
'New Delhi',
'<a href="/places/default/continent/AS">AS</a>',
'.in',
'INR',
'Rupee',
'91',
'######',
'^(\\d{6})$',
'enIN,hi,bn,te,mr,ta,ur,gu,kn,ml,or,pa,as,bh,sat,ks,ne,sd,kok,doi,mni,sit,sa,fr,lus,inc',
'<div>
<a href="/places/default/iso/CN">CN </a>
<a href="/places/default/iso/NP">NP </a>
<a href="/places/default/iso/MM">MM </a>
<a href="/places/default/iso/BT">BT </a>
<a href="/places/default/iso/PK">PK </a>
<a href="/places/default/iso/BD">BD </a>
</div>'
]

Observe that in the above output you can see the details about country India by using regular expression.

Beautiful Soup

Suppose we want to collect all the hyperlinks from a web page, then we can use a parser called BeautifulSoup which
can be known in more detail at [Link] In simple words,
BeautifulSoup is a Python library for pulling data out of HTML and XML files. It can be used with requests,
because it needs an input (document or url) to create a soup object asit cannot fetch a web page by itself. You can
use the following Python script to gather the title of web page and hyperlinks.

Example

Note that in this example, we are extending the above example implemented with requests python module. we are
using [Link] for creating a soup object which will further be used to fetch details like title of the webpage.

First, we need to import necessary Python modules −

import requests
from bs4 import BeautifulSoup

In this following line of code we use requests to make a GET HTTP requests for the url:
[Link] by making a GET request.

r = [Link]('[Link]

19
Now we need to create a Soup object as follows −

soup = BeautifulSoup([Link], 'lxml')

print ([Link])
print ([Link])

Output

The corresponding output will be as shown here −

<title>Learn and Grow with Aditi Agarwal</title>

Learn and Grow with Aditi Agarwal

Example: Data extraction using lxml and requests

In the following example, we are scraping a particular element of the web page from [Link] by
using lxml and requests −

First, we need to import the requests and html from lxml library as follows −

import requests
from lxml import html

Now we need to provide the url of web page to scrap

url = '[Link]

Now we need to provide the path (Xpath) to particular element of that web page −

path = '//*[@id="panel-836-0-0-1"]/div/div/p[1]'
response = [Link](url)
byte_string = [Link]
source_code = [Link](byte_string)
tree = source_code.xpath(path)
print(tree[0].text_content())

Output

The corresponding output will be as shown here −

The Sprint Burndown or the Iteration Burndown chart is a powerful tool to communicate
daily progress to the stakeholders. It tracks the completion of work for a given sprint
or an iteration. The horizontal axis represents the days within a Sprint. The vertical
axis represents the hours remaining to complete the committed work.

Data Analysis

Introduction to Web Scraping

To process the data that has been scraped, we must store the data on our local machine in a particular format like
spreadsheet (CSV), JSON or sometimes in databases like MySQL.

20
CSV and JSON Data Processing

First, we are going to write the information, after grabbing from web page, into a CSV file or a spreadsheet. Let us
first understand through a simple example in which we will first grab the information using BeautifulSoup module,
as did earlier, and then by using Python CSV module we will write that textual information into CSV file.

First, we need to import the necessary Python libraries as follows −

import requests
from bs4 import BeautifulSoup
import csv

In this following line of code, we use requests to make a GET HTTP requests for the url:
[Link] by making a GET request.

r = [Link]('[Link]

Now, we need to create a Soup object as follows −

soup = BeautifulSoup([Link], 'lxml')

Now, with the help of next lines of code, we will write the grabbed data into a CSV file named [Link].

f = [Link](open(' [Link] ','w'))

[Link](['Title'])
[Link]([[Link]])

After running this script, the textual information or the title of the webpage will be saved in the above mentioned
CSV file on your local machine.

Similarly, we can save the collected information in a JSON file. The following is an easy to understand Python
script for doing the same in which we are grabbing the same information as we did in last Python script, but this
time the grabbed information is saved in [Link] by using JSON Python module.

import requests
from bs4 import BeautifulSoup
import csv
import json
r = [Link]('[Link]
soup = BeautifulSoup([Link], 'lxml')
y = [Link]([Link])
with open('[Link]', 'wt') as outfile:
[Link](y, outfile)

After running this script, the grabbed information i.e. title of the webpage will be saved in the above mentioned text
file on your local machine.

Data Processing using AWS S3

Sometimes we may want to save scraped data in our local storage for archive purpose. But what if there were need
to store and analyze this data at a massive scale? The answer is cloud storage service named Amazon S3 or AWS S3

21
(Simple Storage Service). Basically AWS S3 is an object storage which is built to store and retrieve any amount of
data from anywhere.

We can follow the following steps for storing data in AWS S3 −

Step 1 − First we need an AWS account which will provide us the secret keys for using in our Python script while
storing the data. It will create a S3 bucket in which we can store our data.

Step 2 − Next, we need to install boto3 Python library for accessing S3 bucket. It can be installed with the help of
the following command −

pip install boto3

Step 3 − Next, we can use the following Python script for scraping data from web page and saving it to AWS S3
bucket.

First, we need to import Python libraries for scraping, here we are working with requests, and boto3 saving data to
S3 bucket.

import requests
import boto3

Now we can scrape the data from our URL.

data = [Link]("Enter the URL").text

Now for storing data to S3 bucket, we need to create S3 client as follows −

s3 = [Link]('s3')
bucket_name = "our-content"

Next line of code will create S3 bucket as follows −

s3.create_bucket(Bucket = bucket_name, ACL = 'public-read')

s3.put_object(Bucket = bucket_name, Key = '', Body = data, ACL = "public-read")

Now you can check the bucket with name our-content from your AWS account.

Data Processing using MySQL

Let us learn how to process data using MySQL. If you want to learn about MySQL, then you can follow the link
[Link]

With the help of following steps, we can scrape and process data into MySQL table −

Step 1 − First, by using MySQL we need to create a Assumptions and table in which we want to save our scraped
data. For example, we are creating the table with following query −

CREATE TABLE Scrap_pages (id BIGINT(7) NOT NULL AUTO_INCREMENT,

title VARCHAR(200), content VARCHAR(10000),PRIMARY KEY(id));

22
Step 2 − Next, we need to deal with Unicode. Note that MySQL does not handle Unicode by default. We need to
turn on this feature with the help of following commands which will change the default character set for the
database, for the table and for both of the columns.

ALTER DATABASE scrap CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

ALTER TABLE Scrap_pages CONVERT TO CHARACTER SET utf8mb4 COLLATE
utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CHANGE title title VARCHAR(200) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
ALTER TABLE pages CHANGE content content VARCHAR(10000) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

Step 3 − Now, integrate MySQL with Python. For this, we will need PyMySQL which can be installed with the help
of the following command:

pip install PyMySQL

Step 4 − Now, our database named Scrap, created earlier, is ready to save the data, after scraped from web, into
table named Scrap_pages. Here in our example we are going to scrape data from Wikipedia and it will be saved into
our database.

First, we need to import the required Python modules.

from [Link] import urlopen

from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re

Now, make a connection, that is integrate this with Python.

conn = [Link](host='[Link]',user='root', passwd = None, db = 'mysql',

charset = 'utf8')
cur = [Link]()
[Link]("USE scrap")
[Link]([Link]())
def store(title, content):
[Link]('INSERT INTO scrap_pages (title, content) VALUES ''("%s","%s")', (title, content))
[Link]()

Now, connect with Wikipedia and get data from it.

def getLinks(articleUrl):
html = urlopen('[Link]
bs = BeautifulSoup(html, '[Link]')
title = [Link]('h1').get_text()
content = [Link]('div', {'id':'mw-content-text'}).find('p').get_text()
store(title, content)
return [Link]('div', {'id':'bodyContent'}).findAll('a',href=[Link]('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
try:
while len(links) > 0:

23
newArticle = links[[Link](0, len(links)-1)].attrs['href']
print(newArticle)
links = getLinks(newArticle)

Lastly, we need to close both cursor and connection.

finally:
[Link]()
[Link]()

This will save the data gather from Wikipedia into table named scrap_pages. If you are familiar with MySQL and
web scraping, then the above code would not be tough to understand.

Data processing using PostgreSQL

PostgreSQL, developed by a worldwide team of volunteers, is an open source relational database Management
system (RDMS). The process of processing the scraped data using PostgreSQL is similar to that of MySQL. There
would be two changes: First, the commands would be different to MySQL and second, here we will use psycopg2
Python library to perform its integration with Python.

If you are not familiar with PostgreSQL then you can learn it at [Link] And
with the help of following command we can install psycopg2 Python library:

pip install psycopg2

24
Introduction to Natural Language Processing

You can perform text analysis by using Python library called Natural Language Tool Kit (NLTK). Before
proceeding into the concepts of NLTK, let us understand the relation between text analysis and web scraping.

Analyzing the words in the text can lead us to know about which words are important, which words are unusual,
how words are grouped. This analysis eases the task of web scraping.

Getting started with NLTK

The Natural language toolkit (NLTK) is collection of Python libraries which is designed especially for identifying
and tagging parts of speech found in the text of natural language like English.

Installing NLTK

You can use the following command to install NLTK in Python −

pip install nltk

If you are using Anaconda, then a conda package for NLTK can be built by using the following command −

conda install -c anaconda nltk

Downloading NLTK’s Data

After installing NLTK, we have to download preset text repositories. But before downloading text preset
repositories, we need to import NLTK with the help of import command as follows −

import nltk

Now, with the help of following command NLTK data can be downloaded −

[Link]()

Installation of all available packages of NLTK will take some time, but it is always recommended to install all the
packages.

Installing Other Necessary packages

We also need some other Python packages like gensim and pattern for doing text analysis as well as building
building natural language processing applications by using NLTK.

gensim − A robust semantic modeling library which is useful for many applications. It can be installed by the
following command −

pip install gensim

pattern − Used to make gensim package work properly. It can be installed by the following command −

pip install pattern

25
Tokenization

The Process of breaking the given text, into the smaller units called tokens, is called tokenization. These tokens can
be the words, numbers or punctuation marks. It is also called word segmentation.

Example

NLTK module provides different packages for tokenization. We can use these packages as per our requirement.
Some of the packages are described here −

sent_tokenize package − This package will divide the input text into sentences. You can use the following
command to import this package −

from [Link] import sent_tokenize

word_tokenize package − This package will divide the input text into words. You can use the following command
to import this package −

from [Link] import word_tokenize

WordPunctTokenizer package − This package will divide the input text as well as the punctuation marks into
words. You can use the following command to import this package −

from [Link] import WordPuncttokenizer

Stemming

In any language, there are different forms of a words. A language includes lots of variations due to the grammatical
reasons. For example, consider the words democracy, democratic, and democratization. For machine learning as
well as for web scraping projects, it is important for machines to understand that these different words have the same
base form. Hence we can say that it can be useful to extract the base forms of the words while analyzing the text.

This can be achieved by stemming which may be defined as the heuristic process of extracting the base forms of the
words by chopping off the ends of words.

NLTK module provides different packages for stemming. We can use these packages as per our requirement. Some
of these packages are described here −

PorterStemmer package − Porter’s algorithm is used by this Python stemming package to extract the base form.
You can use the following command to import this package −

from [Link] import PorterStemmer

For example, after giving the word ‘writing’ as the input to this stemmer, the output would be the word ‘write’ after
stemming.

LancasterStemmer package − Lancaster’s algorithm is used by this Python stemming package to extract the base
form. You can use the following command to import this package −

26
from [Link] import LancasterStemmer

For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘writ’
after stemming.

SnowballStemmer package − Snowball’s algorithm is used by this Python stemming package to extract the base
form. You can use the following command to import this package −

from [Link] import SnowballStemmer

For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘write’
after stemming.

Lemmatization

Another way to extract the base form of words is by lemmatization, normally aiming to remove inflectional endings
by using vocabulary and morphological analysis. The base form of any word after lemmatization is called lemma.

NLTK module provides following packages for lemmatization −

WordNetLemmatizer package − It will extract the base form of the word depending upon whether it is used as
noun as a verb. You can use the following command to import this package −

from [Link] import WordNetLemmatizer

Chunking

Chunking, which means dividing the data into small chunks, is one of the important processes in natural language
processing to identify the parts of speech and short phrases like noun phrases. Chunking is to do the labeling of
tokens. We can get the structure of the sentence with the help of chunking process.

Example

In this example, we are going to implement Noun-Phrase chunking by using NLTK Python module. NP chunking is
a category of chunking which will find the noun phrases chunks in the sentence.

Steps for implementing noun phrase chunking

We need to follow the steps given below for implementing noun-phrase chunking −

Step 1 − Chunk grammar definition

In the first step we will define the grammar for chunking. It would consist of the rules which we need to follow.

Step 2 − Chunk parser creation

Now, we will create a chunk parser. It would parse the grammar and give the output.

Step 3 − The Output

In this last step, the output would be produced in a tree format.

27
First, we need to import the NLTK package as follows −

import nltk

Next, we need to define the sentence. Here DT: the determinant, VBP: the verb, JJ: the adjective, IN: the preposition
and NN: the noun.

sentence = [("a", "DT"),("clever","JJ"),("fox","NN"),("was","VBP"),("jumping","VBP"),("over","IN"),("the","DT"),

("wall","NN")]

Next, we are giving the grammar in the form of regular expression.

grammar = "NP:{<DT>?<JJ>*<NN>}"

Now, next line of code will define a parser for parsing the grammar.

parser_chunking = [Link](grammar)

Now, the parser will parse the sentence.

parser_chunking.parse(sentence)

Next, we are giving our output in the variable.

Output = parser_chunking.parse(sentence)

With the help of following code, we can draw our output in the form of a tree as shown below.

[Link]()

Bag of Word (BoW) Model Extracting and converting the Text into Numeric Form

Bag of Word (BoW), a useful model in natural language processing, is basically used to extract the features from
text. After extracting the features from the text, it can be used in modeling in machine learning algorithms because
raw data cannot be used in ML applications.

28
Working of BoW Model

Initially, model extracts a vocabulary from all the words in the document. Later, using a document term matrix, it
would build a model. In this way, BoW model represents the document as a bag of words only and the order or
structure is discarded.

Example

Suppose we have the following two sentences −

Sentence1 − This is an example of Bag of Words model.

Sentence2 − We can extract features by using Bag of Words model.

Now, by considering these two sentences, we have the following 14 distinct words −

 This
 is
 an
 example
 bag
 of
 words
 model
 we
 can
 extract
 features
 by
 using

Building a Bag of Words Model in NLTK

Let us look into the following Python script which will build a BoW model in NLTK.

First, import the following package −

from sklearn.feature_extraction.text import CountVectorizer

Next, define the set of sentences −

Sentences=['This is an example of Bag of Words model.', ' We can extract

features by using Bag of Words model.']
vector_count = CountVectorizer()
features_text = vector_count.fit_transform(Sentences).todense()
print(vector_count.vocabulary_)

Output

It shows that we have 14 distinct words in the above two sentences −

{
'this': 10, 'is': 7, 'an': 0, 'example': 4, 'of': 9,

29
'bag': 1, 'words': 13, 'model': 8, 'we': 12, 'can': 3,
'extract': 5, 'features': 6, 'by': 2, 'using':11
}

Topic Modeling: Identifying Patterns in Text Data

Generally documents are grouped into topics and topic modeling is a technique to identify the patterns in a text that
corresponds to a particular topic. In other words, topic modeling is used to uncover abstract themes or hidden
structure in a given set of documents.

You can use topic modeling in following scenarios −

Text Classification

Classification can be improved by topic modeling because it groups similar words together rather than using each
word separately as a feature.

Recommender Systems

We can build recommender systems by using similarity measures.

Topic Modeling Algorithms

We can implement topic modeling by using the following algorithms −

Latent Dirichlet Allocation(LDA) − It is one of the most popular algorithm that uses the probabilistic graphical
models for implementing topic modeling.

Latent Semantic Analysis(LDA) or Latent Semantic Indexing(LSI) − It is based upon Linear Algebra and uses
the concept of SVD (Singular Value Decomposition) on document term matrix.

Non-Negative Matrix Factorization (NMF) − It is also based upon Linear Algebra as like LDA.

The above mentioned algorithms would have the following elements −

 Number of topics: Parameter

 Document-Word Matrix: Input
 WTM (Word Topic Matrix) & TDM (Topic Document Matrix): Output

30
Python File I/O - Read and Write Files

In Python, the IO module provides methods of three types of IO operations; raw binary files, buffered binary files,
and text files. The canonical way to create a file object is by using the open() function.

Any file operations can be performed in the following three steps:

1. Open the file to get the file object using the built-in open() function. There are different access modes,
which you can specify while opening a file using the open() function.
2. Perform read, write, append operations using the file object retrieved from the open() function.
3. Close and dispose the file object.

Reading File

File object includes the following methods to read data from the file.

 read(chars): reads the specified number of characters starting from the current position.
 readline(): reads the characters starting from the current reading position up to a newline character.
 readlines(): reads all lines until the end of file and returns a list object.

The following C:\[Link] file will be used in all the examples of reading and writing files.

C:\[Link]
This is the first line.
This is the second line.
This is the third line.

The following example performs the read operation using the read(chars) method.

Example: Reading a File

>>> f = open('C:\[Link]') # opening a file
>>> lines = [Link]() # reading a file
>>> lines
'This is the first line. \nThis is the second line.\nThis is the third line.'
>>> [Link]() # closing file object

Above, f = open('C:\[Link]') opens the [Link] in the default read mode from the current directory and returns
a file object. [Link]() function reads all the content until EOF as a string. If you specify the char size argument in the
read(chars) method, then it will read that many chars only. [Link]() will flush and close the stream.

Reading a Line

The following example demonstrates reading a line from the file.

Example: Reading Lines

>>> f = open('C:\[Link]') # opening a file
>>> line1 = [Link]() # reading a line
>>> line1
'This is the first line. \n'
>>> line2 = [Link]() # reading a line
>>> line2

31
'This is the second line.\n'
>>> line3 = [Link]() # reading a line
>>> line3
'This is the third line.'
>>> line4 = [Link]() # reading a line
>>> line4
''
>>> [Link]() # closing file object

As you can see, we have to open the file in 'r' mode. The readline() method will return the first line, and then will
point to the second line in the file.

Reading All Lines

The following reads all lines using the readlines() function.

Example: Reading a File

>>> f = open('C:\[Link]') # opening a file
>>> lines = [Link]() # reading all lines
>>> lines
'This is the first line. \nThis is the second line.\nThis is the third line.'
>>> [Link]() # closing file object

The file object has an inbuilt iterator. The following program reads the given file line by line until StopIteration is
raised, i.e., the EOF is reached.

Example: File Iterator

f=open('C:\[Link]')
while True:
try:
line=next(f)
print(line)
except StopIteration:
break
[Link]()

Use the for loop to read a file easily.

Example: Read File using the For Loop

f=open('C:\[Link]')
for line in f:
print(line)
[Link]()
Output
This is the first line.
This is the second line.
This is the third line.

Writing to a File

The file object provides the following methods to write to a file.

32
 write(s): Write the string s to the stream and return the number of characters written.
 writelines(lines): Write a list of lines to the stream. Each line must have a separator at the end of it.

Create a new File and Write

The following creates a new file if it does not exist or overwrites to an existing file.

Example: Create or Overwrite to Existing File

>>> f = open('C:\[Link]','w')
>>> [Link]("Hello") # writing to file
5
>>> [Link]()

# reading file
>>> f = open('C:\[Link]','r')
>>> [Link]()
'Hello'
>>> [Link]()

In the above example, the f=open("[Link]","w") statement opens [Link] in write mode, the open() method
returns the file object and assigns it to a variable f. 'w' specifies that the file should be writable. Next,
[Link]("Hello") overwrites an existing content of the [Link] file. It returns the number of characters written to a
file, which is 5 in the above example. In the end, [Link]() closes the file object.

Appending to an Existing File

The following appends the content at the end of the existing file by passing 'a' or 'a+' mode in the open() method.

Example: Append to Existing File

>>> f = open('C:\[Link]','a')
>>> [Link](" World!")
7
>>> [Link]()

# reading file
>>> f = open('C:\[Link]','r')
>>> [Link]()
'Hello World!'
>>> [Link]()

Write Multiple Lines

Python provides the writelines() method to save the contents of a list object in a file. Since the newline character is
not automatically written to the file, it must be provided as a part of the string.

Example: Write Lines to File

>>> lines=["Hello world.\n", "Welcome to TutorialsTeacher.\n"]
>>> f=open("D:\[Link]", "w")
>>> [Link](lines)
>>> [Link]()

Opening a file with "w" mode or "a" mode can only be written into and cannot be read from. Similarly "r" mode
allows reading only and not writing. In order to perform simultaneous read/append operations, use "a+" mode.

33
Writing to a Binary File

The open() function opens a file in text format by default. To open a file in binary format, add 'b' to the mode
parameter. Hence the "rb" mode opens the file in binary format for reading, while the "wb" mode opens the file in
binary format for writing. Unlike text files, binary files are not human-readable. When opened using any text editor,
the data is unrecognizable.

The following code stores a list of numbers in a binary file. The list is first converted in a byte array before writing.
The built-in function bytearray() returns a byte representation of the object.

Example: Write to a Binary File

f=open("[Link]","wb")
num=[5, 10, 15, 20, 25]
arr=bytearray(num)
[Link](arr)
[Link]()

34
Exception Handling in Python

The cause of an exception is often external to the program itself. For example, an incorrect input, a malfunctioning
IO device etc. Because the program abruptly terminates on encountering an exception, it may cause damage to
system resources, such as files. Hence, the exceptions should be properly handled so that an abrupt termination of
the program is prevented.

Python uses try and except keywords to handle exceptions. Both keywords are followed by indented blocks.

Syntax:
try :
#statements in try block
except :
#executed when error in try block

The try: block contains one or more statements which are likely to encounter an exception. If the statements in this
block are executed without an exception, the subsequent except: block is skipped.

If the exception does occur, the program flow is transferred to the except: block. The statements in the except: block
are meant to handle the cause of the exception appropriately. For example, returning an appropriate error message.

You can specify the type of exception after the except keyword. The subsequent block will be executed only if the
specified exception occurs. There may be multiple except clauses with different exception types in a single try
block. If the type of exception doesn't match any of the except blocks, it will remain unhandled and the program will
terminate.

The rest of the statements after the except block will continue to be executed, regardless if the exception is
encountered or not.

The following example will throw an exception when we try to devide an integer by a string.

Example: try...except blocks

try:
a=5
b='0'
print(a/b)
except:
print('Some error occurred.')
print("Out of try except blocks.")
Output
Some error occurred.
Out of try except blocks.

You can mention a specific type of exception in front of the except keyword. The subsequent block will be executed
only if the specified exception occurs. There may be multiple except clauses with different exception types in a
single try block. If the type of exception doesn't match any of the except blocks, it will remain unhandled and the
program will terminate.

Example: Catch Specific Error Type

try:

35
a=5
b='0'
print (a+b)
except TypeError:
print('Unsupported operation')
print ("Out of try except blocks")
Output
Unsupported operation
Out of try except blocks

As mentioned above, a single try block may have multiple except blocks. The following example uses two except
blocks to process two different exception types:

Example: Multiple except Blocks

try:
a=5
b=0
print (a/b)
except TypeError:
print('Unsupported operation')
except ZeroDivisionError:
print ('Division by zero not allowed')
print ('Out of try except blocks')
Output
Division by zero not allowed
Out of try except blocks

However, if variable b is set to '0', TypeError will be encountered and processed by corresponding except block.

else and finally

In Python, keywords else and finally can also be used along with the try and except clauses. While the except block
is executed if the exception occurs inside the try block, the else block gets processed if the try block is found to be
exception free.

Syntax:
try:
#statements in try block
except:
#executed when error in try block
else:
#executed if try block is error-free
finally:
#executed irrespective of exception occured or not

The finally block consists of statements which should be processed regardless of an exception occurring in the try
block or not. As a consequence, the error-free try block skips the except clause and enters the finally block before
going on to execute the rest of the code. If, however, there's an exception in the try block, the appropriate except
block will be processed, and the statements in the finally block will be processed before proceeding to the rest of the
code.

The example below accepts two numbers from the user and performs their division. It demonstrates the uses of else
and finally blocks.

36
Example: try, except, else, finally blocks
try:
print('try block')
x=int(input('Enter a number: '))
y=int(input('Enter another number: '))
z=x/y
except ZeroDivisionError:
print("except ZeroDivisionError block")
print("Division by 0 not accepted")
else:
print("else block")
print("Division = ", z)
finally:
print("finally block")
x=0
y=0
print ("Out of try, except, else and finally blocks." )

The first run is a normal case. The out of the else and finally blocks is displayed because the try block is error-free.

Output
try block
Enter a number: 10
Enter another number: 2
else block
Division = 5.0
finally block
Out of try, except, else and finally blocks.

The second run is a case of division by zero, hence, the except block and the finally block are executed, but the else
block is not executed.

Output
try block
Enter a number: 10
Enter another number: 0
except ZeroDivisionError block
Division by 0 not accepted
finally block
Out of try, except, else and finally blocks.

In the third run case, an uncaught exception occurs. The finally block is still executed but the program terminates
and does not execute the program after the finally block.

Output
try block
Enter a number: 10
Enter another number: xyz
finally block
Traceback (most recent call last):
File "C:\python36\codes\[Link]", line 3, in <module>
y=int(input('Enter another number: '))
ValueError: invalid literal for int() with base 10: 'xyz'

37
Typically the finally clause is the ideal place for cleaning up the operations in a process. For example closing a file
irrespective of the errors in read/write operations. This will be dealt with in the next chapter.

Raise an Exception

Python also provides the raise keyword to be used in the context of exception handling. It causes an exception to be
generated explicitly. Built-in errors are raised implicitly. However, a built-in or custom exception can be forced
during execution.

The following code accepts a number from the user. The try block raises a ValueError exception if the number is
outside the allowed range.

Example: Raise an Exception

try:
x=int(input('Enter a number upto 100: '))
if x > 100:
raise ValueError(x)
except ValueError:
print(x, "is out of allowed range")
else:
print(x, "is within the allowed range")

Output
Enter a number upto 100: 200
200 is out of allowed range
Enter a number upto 100: 50
50 is within the allowed range

Here, the raised exception is a ValueError type. However, you can define your custom exception type to be raised

38
Testing Linear Regression Assumptions in Python

Checking model assumptions is like commenting code. Everybody should be doing it often, but it sometimes ends
up being overlooked in reality. A failure to do either can result in a lot of time being confused, going down rabbit
holes, and can have pretty serious consequences from the model not being interpreted correctly.

Linear regression is a fundamental tool that has distinct advantages over other regression algorithms. Due to its
simplicity, it’s an exceptionally quick algorithm to train, thus typically makes it a good baseline algorithm for
common regression scenarios. More importantly, models trained with linear regression are the most interpretable
kind of regression models available - meaning it’s easier to take action from the results of a linear regression model.
However, if the assumptions are not satisfied, the interpretation of the results will not always be valid. This can be
very dangerous depending on the application.

This post contains code for tests on the assumptions of linear regression and examples with both a real-world dataset
and a toy dataset.

The Data

For our real-world dataset, we’ll use the Boston house prices dataset from the late 1970’s. The toy dataset will be
created using scikit-learn’s make_regression function which creates a dataset that should perfectly satisfy all of our
assumptions.

One thing to note is that I’m assuming outliers have been removed in this blog post. This is an important part of any
exploratory data analysis (which isn’t being performed in this post in order to keep it short) that should happen in
real world scenarios, and outliers in particular will cause significant issues with linear regression. See Anscombe’s
Quartet for examples of outliers causing issues with fitting linear regression models.

Here are the variable descriptions for the Boston housing dataset straight from the documentation:

 CRIM: Per capita crime rate by town

 ZN: Proportion of residential land zoned for lots over 25,000 [Link].

 INDUS: Proportion of non-retail business acres per town.

 CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)

 NOX: Nitric oxides concentration (parts per 10 million)

 RM: Average number of rooms per dwelling

 AGE: Proportion of owner-occupied units built prior to 1940

 DIS: Weighted distances to five Boston employment centers

 RAD: Index of accessibility to radial highways

 TAX: Full-value property-tax rate per $10,000

 PTRATIO: Pupil-teacher ratio by town

39
 B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

 LSTAT: % lower status of the population

 MEDV: Median value of owner-occupied homes in $1,000’s

import numpy as np
import pandas as pd
import [Link] as plt
import seaborn as sns
from sklearn import datasets
%matplotlib inline

"""
Real-world data of Boston housing prices
Additional Documentation: [Link]

Attributes:
data: Features/predictors
label: Target/label/response variable
feature_names: Abbreviations of names of features
"""
boston = datasets.load_boston()

"""
Artificial linear data using the same number of features and observations as the
Boston housing prices dataset for assumption test comparison
"""
linear_X, linear_y = datasets.make_regression(n_samples=[Link][0],
n_features=[Link][1],
noise=75, random_state=46)

# Setting feature names to x1, x2, x3, etc. if they are not defined
linear_feature_names = ['X'+str(feature+1) for feature in range(linear_X.shape[1])]

Now that the data is loaded in, let’s preview it:

df = [Link]([Link], columns=boston.feature_names)
df['HousePrice'] = [Link]

[Link]()

Now that the data is loaded in, let’s preview it:

df = [Link]([Link], columns=boston.feature_names)
df['HousePrice'] = [Link]

[Link]()

Initial Setup

40
Before we test the assumptions, we’ll need to fit our linear regression models. I have a master function for
performing all of the assumption testing at the bottom of this post that does this automatically, but to abstract the
assumption tests out to view them independently we’ll have to re-write the individual tests to take the trained model
as a parameter.

from sklearn.linear_model import LinearRegression

# Fitting the model

boston_model = LinearRegression()
boston_model.fit([Link], [Link])

# Returning the R^2 for the model

boston_r2 = boston_model.score([Link], [Link])
print('R^2: {0}'.format(boston_r2))
R^2: 0.7406077428649428
# Fitting the model
linear_model = LinearRegression()
linear_model.fit(linear_X, linear_y)

# Returning the R^2 for the model

linear_r2 = linear_model.score(linear_X, linear_y)
print('R^2: {0}'.format(linear_r2))
R^2: 0.873743725796525
def calculate_residuals(model, features, label):
"""
Creates predictions on the features with the model and calculates residuals
"""
predictions = [Link](features)
df_results = [Link]({'Actual': label, 'Predicted': predictions})
df_results['Residuals'] = abs(df_results['Actual']) - abs(df_results['Predicted'])

return df_results

We’re all set, so onto the assumption testing!

The Assumptions

I) Linearity Assumption

This assumes that there is a linear relationship between the predictors (e.g. independent variables or features) and the
response variable (e.g. dependent variable or label). This also assumes that the predictors are additive.

Why it can happen: There may not just be a linear relationship among the data. Modeling is about trying to
estimate a function that explains a process, and linear regression would not be a fitting estimator (pun intended) if
there is no linear relationship.

What it will affect: The predictions will be extremely inaccurate because our model is underfitting. This is a serious
violation that should not be ignored.

How to detect it: If there is only one predictor, this is pretty easy to test with a scatter plot. Most cases aren’t so
simple, so we’ll have to modify this by using a scatter plot to see our predicted values versus the actual values (in
other words, view the residuals). Ideally, the points should lie on or around a diagonal line on the scatter plot.

41
How to fix it: Either adding polynomial terms to some of the predictors or applying nonlinear transformations . If
those do not work, try adding additional variables to help capture the relationship between the predictors and the
label.

def linear_assumption(model, features, label):

"""
Linearity: Assumes that there is a linear relationship between the predictors and
the response variable. If not, either a quadratic term or another
algorithm should be used.
"""
print('Assumption 1: Linear Relationship between the Target and the Feature', '\n')

print('Checking with a scatter plot of actual vs. predicted.',

'Predictions should follow the diagonal line.')

# Calculating residuals for the plot

df_results = calculate_residuals(model, features, label)

# Plotting the actual vs predicted values

[Link](x='Actual', y='Predicted', data=df_results, fit_reg=False, size=7)

# Plotting the diagonal line

line_coords = [Link](df_results.min().min(), df_results.max().max())
[Link](line_coords, line_coords, # X and y points
color='darkorange', linestyle='--')
[Link]('Actual vs. Predicted')
[Link]()

We’ll start with our linear dataset:

linear_assumption(linear_model, linear_X, linear_y)

Assumption 1: Linear Relationship between the Target and the Feature

Checking with a scatter plot of actual vs. predicted. Predictions should follow the diagonal line.

42
We can see a relatively even spread around the diagonal line.

Now, let’s compare it to the Boston dataset:

linear_assumption(boston_model, [Link], [Link])

Assumption 1: Linear Relationship between the Target and the Feature

Checking with a scatter plot of actual vs. predicted. Predictions should follow the diagonal line.

43
We can see in this case that there is not a perfect linear relationship. Our predictions are biased towards lower values
in both the lower end (around 5-10) and especially at the higher values (above 40).

II) Normality of the Error Terms

More specifically, this assumes that the error terms of the model are normally distributed. Linear regressions other
than Ordinary Least Squares (OLS) may also assume normality of the predictors or the label, but that is not the case
here.

Why it can happen: This can actually happen if either the predictors or the label are significantly non-normal.
Other potential reasons could include the linearity assumption being violated or outliers affecting our model.

What it will affect: A violation of this assumption could cause issues with either shrinking or inflating our
confidence intervals.

How to detect it: There are a variety of ways to do so, but we’ll look at both a histogram and the p-value from the
Anderson-Darling test for normality.

How to fix it: It depends on the root cause, but there are a few options. Nonlinear transformations of the variables,
excluding specific variables (such as long-tailed variables), or removing outliers may solve this problem.

44
def normal_errors_assumption(model, features, label, p_value_thresh=0.05):
"""
Normality: Assumes that the error terms are normally distributed. If they are not,
nonlinear transformations of variables may solve this.

This assumption being violated primarily causes issues with the confidence intervals
"""
from [Link] import normal_ad
print('Assumption 2: The error terms are normally distributed', '\n')

# Calculating residuals for the Anderson-Darling test

df_results = calculate_residuals(model, features, label)

print('Using the Anderson-Darling test for normal distribution')

# Performing the test on the residuals

p_value = normal_ad(df_results['Residuals'])[1]
print('p-value from the test - below 0.05 generally means non-normal:', p_value)

# Reporting the normality of the residuals

if p_value < p_value_thresh:
print('Residuals are not normally distributed')
else:
print('Residuals are normally distributed')

# Plotting the residuals distribution

[Link](figsize=(12, 6))
[Link]('Distribution of Residuals')
[Link](df_results['Residuals'])
[Link]()

print()
if p_value > p_value_thresh:
print('Assumption satisfied')
else:
print('Assumption not satisfied')
print()
print('Confidence intervals will likely be affected')
print('Try performing nonlinear transformations on variables')

As with our previous assumption, we’ll start with the linear dataset:

normal_errors_assumption(linear_model, linear_X, linear_y)

Assumption 2: The error terms are normally distributed

Using the Anderson-Darling test for normal distribution

p-value from the test - below 0.05 generally means non-normal: 0.335066045847
Residuals are normally distributed

45
Assumption satisfied

Now let’s run the same test on the Boston dataset:

normal_errors_assumption(boston_model, [Link], [Link])

Assumption 2: The error terms are normally distributed

Using the Anderson-Darling test for normal distribution

p-value from the test - below 0.05 generally means non-normal: 7.78748286642e-25
Residuals are not normally distributed

46
Assumption not satisfied

Confidence intervals will likely be affected

Try performing nonlinear transformations on variables

This isn’t ideal, and we can see that our model is biasing towards under-estimating.

III) No Multicollinearity among Predictors

This assumes that the predictors used in the regression are not correlated with each other. This won’t render our
model unusable if violated, but it will cause issues with the interpretability of the model.

Why it can happen: A lot of data is just naturally correlated. For example, if trying to predict a house price with
square footage, the number of bedrooms, and the number of bathrooms, we can expect to see correlation between
those three variables because bedrooms and bathrooms make up a portion of square footage.

What it will affect: Multicollinearity causes issues with the interpretation of the coefficients. Specifically, you can
interpret a coefficient as “an increase of 1 in this predictor results in a change of (coefficient) in the response
variable, holding all other predictors constant.” This becomes problematic when multicollinearity is present because
we can’t hold correlated predictors constant. Additionally, it increases the standard error of the coefficients, which
results in them potentially showing as statistically insignificant when they might actually be significant.

How to detect it: There are a few ways, but we will use a heatmap of the correlation as a visual aid and examine the
variance inflation factor (VIF).

How to fix it: This can be fixed by other removing predictors with a high variance inflation factor (VIF) or
performing dimensionality reduction.

47
def multicollinearity_assumption(model, features, label, feature_names=None):
"""
Multicollinearity: Assumes that predictors are not correlated with each other. If there is
correlation among the predictors, then either remove prepdictors with high
Variance Inflation Factor (VIF) values or perform dimensionality reduction

This assumption being violated causes issues with interpretability of the

coefficients and the standard errors of the coefficients.
"""
from [Link].outliers_influence import variance_inflation_factor
print('Assumption 3: Little to no multicollinearity among predictors')

# Plotting the heatmap

[Link](figsize = (10,8))
[Link]([Link](features, columns=feature_names).corr(), annot=True)
[Link]('Correlation of Variables')
[Link]()

print('Variance Inflation Factors (VIF)')

print('> 10: An indication that multicollinearity may be present')
print('> 100: Certain multicollinearity among the variables')
print('-------------------------------------')

# Gathering the VIF for each variable

VIF = [variance_inflation_factor(features, i) for i in range([Link][1])]
for idx, vif in enumerate(VIF):
print('{0}: {1}'.format(feature_names[idx], vif))

# Gathering and printing total cases of possible or definite multicollinearity

possible_multicollinearity = sum([1 for vif in VIF if vif > 10])
definite_multicollinearity = sum([1 for vif in VIF if vif > 100])
print()
print('{0} cases of possible multicollinearity'.format(possible_multicollinearity))
print('{0} cases of definite multicollinearity'.format(definite_multicollinearity))
print()

else:
print('Assumption not satisfied')
print()
print('Coefficient interpretability will be problematic')
print('Consider removing variables with a high Variance Inflation Factor (VIF)')

Starting with the linear dataset:

multicollinearity_assumption(linear_model, linear_X, linear_y, linear_feature_names)

Assumption 3: Little to no multicollinearity among predictors

48
Variance Inflation Factors (VIF)
> 10: An indication that multicollinearity may be present
> 100: Certain multicollinearity among the variables
-------------------------------------
X1: 1.030931170297102
X2: 1.0457176802992108
X3: 1.0418076962011933
X4: 1.0269600632251443
X5: 1.0199882018822783
X6: 1.0404194675991594
X7: 1.0670847781889177
X8: 1.0229686036798158
X9: 1.0292923730360835
X10: 1.0289003332516535
X11: 1.052043220821624
X12: 1.0336719449364813
X13: 1.0140788728975834

0 cases of possible multicollinearity

0 cases of definite multicollinearity

Assumption satisfied

49
Everything looks peachy keen. Onto the Boston dataset:

multicollinearity_assumption(boston_model, [Link], [Link], boston.feature_names)

Assumption 3: Little to no multicollinearity among predictors

Variance Inflation Factors (VIF)

> 10: An indication that multicollinearity may be present
> 100: Certain multicollinearity among the variables
-------------------------------------
CRIM: 2.0746257632525675
ZN: 2.8438903527570782
INDUS: 14.484283435031545
CHAS: 1.1528909172683364
NOX: 73.90221170812129
RM: 77.93496867181426
AGE: 21.38677358304778
DIS: 14.699368125642422
RAD: 15.154741587164747
TAX: 61.226929320337554
PTRATIO: 85.0273135204276
B: 20.066007061121244
LSTAT: 11.088865100659874

50
10 cases of possible multicollinearity
0 cases of definite multicollinearity

Assumption possibly satisfied

Coefficient interpretability may be problematic

Consider removing variables with a high Variance Inflation Factor (VIF)

This isn’t quite as egregious as our normality assumption violation, but there is possible multicollinearity for most of
the variables in this dataset.

IV) No Autocorrelation of the Error Terms

This assumes no autocorrelation of the error terms. Autocorrelation being present typically indicates that we are
missing some information that should be captured by the model.

Why it can happen: In a time series scenario, there could be information about the past that we aren’t capturing. In
a non-time series scenario, our model could be systematically biased by either under or over predicting in certain
conditions. Lastly, this could be a result of a violation of the linearity assumption.

What it will affect: This will impact our model estimates.

How to detect it: We will perform a Durbin-Watson test to determine if either positive or negative correlation is
present. Alternatively, you could create plots of residual autocorrelations.

How to fix it: A simple fix of adding lag variables can fix this problem. Alternatively, interaction terms, additional
variables, or additional transformations may fix this.

def autocorrelation_assumption(model, features, label):

"""
Autocorrelation: Assumes that there is no autocorrelation in the residuals. If there is
autocorrelation, then there is a pattern that is not explained due to
the current value being dependent on the previous value.
This may be resolved by adding a lag variable of either the dependent
variable or some of the predictors.
"""
from [Link] import durbin_watson
print('Assumption 4: No Autocorrelation', '\n')

# Calculating residuals for the Durbin Watson-tests

df_results = calculate_residuals(model, features, label)

print('\nPerforming Durbin-Watson Test')

print('Values of 1.5 < d < 2.5 generally show that there is no autocorrelation in the data')
print('0 to 2< is positive autocorrelation')
print('>2 to 4 is negative autocorrelation')
print('-------------------------------------')
durbinWatson = durbin_watson(df_results['Residuals'])
print('Durbin-Watson:', durbinWatson)
if durbinWatson < 1.5:
print('Signs of positive autocorrelation', '\n')
print('Assumption not satisfied')
elif durbinWatson > 2.5:
print('Signs of negative autocorrelation', '\n')

51
print('Assumption not satisfied')
else:
print('Little to no autocorrelation', '\n')
print('Assumption satisfied')

Testing with our ideal dataset:

autocorrelation_assumption(linear_model, linear_X, linear_y)

Assumption 4: No Autocorrelation

Performing Durbin-Watson Test

Values of 1.5 < d < 2.5 generally show that there is no autocorrelation in the data
0 to 2< is positive autocorrelation
>2 to 4 is negative autocorrelation
-------------------------------------
Durbin-Watson: 2.00345051385
Little to no autocorrelation

Assumption satisfied

And with our Boston dataset:

autocorrelation_assumption(boston_model, [Link], [Link])

Assumption 4: No Autocorrelation

Performing Durbin-Watson Test

Assumption not satisfied

We’re having signs of positive autocorrelation here, but we should expect this since we know our model is
consistently under-predicting and our linearity assumption is being violated. Since this isn’t a time series dataset, lag
variables aren’t possible. Instead, we should look into either interaction terms or additional transformations.

V) Homoscedasticity/Heteroscedasticity

This assumes homoscedasticity, which is the same variance within our error terms. Heteroscedasticity, the violation
of homoscedasticity, occurs when we don’t have an even variance across the error terms.

Why it can happen: Our model may be giving too much weight to a subset of the data, particularly where the error
variance was the largest.

What it will affect: Significance tests for coefficients due to the standard errors being biased. Additionally, the
confidence intervals will be either too wide or too narrow.

How to detect it: Plot the residuals and see if the variance appears to be uniform.

52
How to fix it: Heteroscedasticity (can you tell I like the scedasticity words?) can be solved either by using weighted
least squares regression instead of the standard OLS or transforming either the dependent or highly skewed
variables. Performing a log transformation on the dependent variable is not a bad place to start.

def homoscedasticity_assumption(model, features, label):

"""
Homoscedasticity: Assumes that the errors exhibit constant variance
"""
print('Assumption 5: Homoscedasticity of Error Terms', '\n')

print('Residuals should have relative constant variance')

# Calculating residuals for the plot

df_results = calculate_residuals(model, features, label)

# Plotting the residuals

[Link](figsize=(12, 6))
ax = [Link](111) # To remove spines
[Link](x=df_results.index, y=df_results.Residuals, alpha=0.5)
[Link]([Link](0, df_results.[Link]()), color='darkorange', linestyle='--')
[Link]['right'].set_visible(False) # Removing the right spine
[Link]['top'].set_visible(False) # Removing the top spine
[Link]('Residuals')
[Link]()

Plotting the residuals of our ideal dataset:

homoscedasticity_assumption(linear_model, linear_X, linear_y)

Assumption 5: Homoscedasticity of Error Terms

Residuals should have relative constant variance

53
There don’t appear to be any obvious problems with that.

Next, looking at the residuals of the Boston dataset:

homoscedasticity_assumption(boston_model, [Link], [Link])

Assumption 5: Homoscedasticity of Error Terms

Residuals should have relative constant variance

54
We can’t see a fully uniform variance across our residuals, so this is potentially problematic. However, we know
from our other tests that our model has several issues and is under predicting in many cases.

Conclusion

We can clearly see that a linear regression model on the Boston dataset violates a number of assumptions which
cause significant problems with the interpretation of the model itself. It’s not uncommon for assumptions to be
violated on real-world data, but it’s important to check them so we can either fix them and/or be aware of the flaws
in the model for the presentation of the results or the decision making process.

It is dangerous to make decisions on a model that has violated assumptions because those decisions are effectively
being formulated on made-up numbers. Not only that, but it also provides a false sense of security due to trying to be
empirical in the decision making process. Empiricism requires due diligence, which is why these assumptions exist
and are stated up front. Hopefully this code can help ease the due diligence process and make it less painful.

Code for the Master Function

This function performs all of the assumption tests listed in this blog post:

def linear_regression_assumptions(features, label, feature_names=None):

"""
Tests a linear regression on the model to see if assumptions are being met
"""
from sklearn.linear_model import LinearRegression

# Setting feature names to x1, x2, x3, etc. if they are not defined
if feature_names is None:

55
feature_names = ['X'+str(feature+1) for feature in range([Link][1])]

print('Fitting linear regression')

# Multi-threading if the dataset is a size where doing so is beneficial
if [Link][0] < 100000:
model = LinearRegression(n_jobs=-1)
else:
model = LinearRegression()

[Link](features, label)

# Returning linear regression R^2 and coefficients before performing diagnostics

r2 = [Link](features, label)
print()
print('R^2:', r2, '\n')
print('Coefficients')
print('-------------------------------------')
print('Intercept:', model.intercept_)

for feature in range(len(model.coef_)):

print('{0}: {1}'.format(feature_names[feature], model.coef_[feature]))

print('\nPerforming linear regression assumption testing')

# Creating predictions and calculating residuals for assumption tests

predictions = [Link](features)
df_results = [Link]({'Actual': label, 'Predicted': predictions})
df_results['Residuals'] = abs(df_results['Actual']) - abs(df_results['Predicted'])

def linear_assumption():
"""
Linearity: Assumes there is a linear relationship between the predictors and
the response variable. If not, either a polynomial term or another
algorithm should be used.
"""
print('\
n==================================================================================
=====')
print('Assumption 1: Linear Relationship between the Target and the Features')

print('Checking with a scatter plot of actual vs. predicted. Predictions should follow the diagonal line.')

# Plotting the actual vs predicted values

[Link](x='Actual', y='Predicted', data=df_results, fit_reg=False, size=7)

# Plotting the diagonal line

line_coords = [Link](df_results.min().min(), df_results.max().max())
[Link](line_coords, line_coords, # X and y points
color='darkorange', linestyle='--')
[Link]('Actual vs. Predicted')
[Link]()
print('If non-linearity is apparent, consider adding a polynomial term')

def normal_errors_assumption(p_value_thresh=0.05):

56
"""
Normality: Assumes that the error terms are normally distributed. If they are not,
nonlinear transformations of variables may solve this.

This assumption being violated primarily causes issues with the confidence intervals
"""
from [Link] import normal_ad
print('\
n==================================================================================
=====')
print('Assumption 2: The error terms are normally distributed')
print()

print('Using the Anderson-Darling test for normal distribution')

# Performing the test on the residuals

p_value = normal_ad(df_results['Residuals'])[1]
print('p-value from the test - below 0.05 generally means non-normal:', p_value)

# Reporting the normality of the residuals

if p_value < p_value_thresh:
print('Residuals are not normally distributed')
else:
print('Residuals are normally distributed')

# Plotting the residuals distribution

[Link](figsize=(12, 6))
[Link]('Distribution of Residuals')
[Link](df_results['Residuals'])
[Link]()

def multicollinearity_assumption():
"""
Multicollinearity: Assumes that predictors are not correlated with each other. If there is
correlation among the predictors, then either remove prepdictors with high
Variance Inflation Factor (VIF) values or perform dimensionality reduction

This assumption being violated causes issues with interpretability of the

coefficients and the standard errors of the coefficients.
"""
from [Link].outliers_influence import variance_inflation_factor
print('\
n==================================================================================
=====')
print('Assumption 3: Little to no multicollinearity among predictors')

57
# Plotting the heatmap
[Link](figsize = (10,8))
[Link]([Link](features, columns=feature_names).corr(), annot=True)
[Link]('Correlation of Variables')
[Link]()

print('Variance Inflation Factors (VIF)')

print('> 10: An indication that multicollinearity may be present')
print('> 100: Certain multicollinearity among the variables')
print('-------------------------------------')

# Gathering the VIF for each variable

VIF = [variance_inflation_factor(features, i) for i in range([Link][1])]
for idx, vif in enumerate(VIF):
print('{0}: {1}'.format(feature_names[idx], vif))

# Gathering and printing total cases of possible or definite multicollinearity

if definite_multicollinearity == 0:
if possible_multicollinearity == 0:
print('Assumption satisfied')
else:
print('Assumption possibly satisfied')
print()
print('Coefficient interpretability may be problematic')
print('Consider removing variables with a high Variance Inflation Factor (VIF)')
else:
print('Assumption not satisfied')
print()
print('Coefficient interpretability will be problematic')
print('Consider removing variables with a high Variance Inflation Factor (VIF)')

def autocorrelation_assumption():
"""
Autocorrelation: Assumes that there is no autocorrelation in the residuals. If there is
autocorrelation, then there is a pattern that is not explained due to
the current value being dependent on the previous value.
This may be resolved by adding a lag variable of either the dependent
variable or some of the predictors.
"""
from [Link] import durbin_watson
print('\
n==================================================================================
=====')
print('Assumption 4: No Autocorrelation')
print('\nPerforming Durbin-Watson Test')
print('Values of 1.5 < d < 2.5 generally show that there is no autocorrelation in the data')
print('0 to 2< is positive autocorrelation')
print('>2 to 4 is negative autocorrelation')

58
print('-------------------------------------')
durbinWatson = durbin_watson(df_results['Residuals'])
print('Durbin-Watson:', durbinWatson)
if durbinWatson < 1.5:
print('Signs of positive autocorrelation', '\n')
print('Assumption not satisfied', '\n')
print('Consider adding lag variables')
elif durbinWatson > 2.5:
print('Signs of negative autocorrelation', '\n')
print('Assumption not satisfied', '\n')
print('Consider adding lag variables')
else:
print('Little to no autocorrelation', '\n')
print('Assumption satisfied')

def homoscedasticity_assumption():
"""
Homoscedasticity: Assumes that the errors exhibit constant variance
"""
print('\
n==================================================================================
=====')
print('Assumption 5: Homoscedasticity of Error Terms')
print('Residuals should have relative constant variance')

# Plotting the residuals

linear_assumption()
normal_errors_assumption()
multicollinearity_assumption()
autocorrelation_assumption()
homoscedasticity_assumption()

Summary of Assumptions of Linear Regression

1. Linearity – There should be linear relationship between dependent and independent variable. This is very logical
and most essential assumption of Linear Regression. Visually it can be check by making a scatter plot between
dependent and independent variable

2. Homoscedasticity – Constant Error Variance, i.e, the variance of the error term is same across all values of the
independent variable. It can be easily checked by making a scatter plot between Residual and Fitted Values. If there
is no trend then the variance of error term is constant.

59
import seaborn as sns
[Link](x ="expected",
y = "residual", data = result)

A close observation of the above plot shows that the variance of residual term is relatively more for higher fitted
values. Note: In many real-life scenarios, it is practically difficult to ensure all assumptions of linear regression will
hold 100%

3. Normal Error – The error term should be normally distributed. QQ plot is a good way of checking normality. If
the plot forms a line that is roughly straight then we can assume there is normality.

import [Link] as sm

[Link](result["residual"], ylabel = "Residual Quantiles" )

4. No Autocorrelation of residual – This is typically applicable to time series data. Autocorrelation means the
current value of Yt is dependent on historic value of Yt-n with n as lag period. Durbin-Watson test is a quick way to
find if there is any autocorrelation.

5. No Perfect Multi-Collinearity – Multi-Collinearity is a phenomenon when two or more independent variables

are highly correlated. Multi-collinearity is checked by Variance Inflation Factor (VIF). There should be no variable
in the model having VIF above 2. (…for more details see our blog on Multi-Collinearity)

6. Exogeneity – Exogeneity is a standard assumption of regression and it means that each X variable does not
depend on the dependent variable Y, rather Y depends on the Xs and on Error (e). In simple terms X is completely
unaffected by Y.

7. Sample Size – In linear regression, it is desirable that the number of records should be at least 10 or more times
the number of independent variables to avoid the curse of dimensionality.

60
Skewness & Kurtosis

What is Skewness and how do we detect it?

If you will ask Mother Nature — What is her favorite probability distribution?

The answer will be — ‘Normal’ and the reason behind it is the existence of chance/random causes that influence
every known variable on earth. What if a process is under the influence of assignable/significant causes as well?
This is surely going to modify the shape of the distribution (distort) and that’s when we need a measure like
skewness to capture it. Below is a normal distribution visual, also known as a bell curve. It is a symmetrical graph
with all measures of central tendency in the middle.

(Author, 2021)

But what if we encounter an asymmetrical distribution, how do we detect the extent of asymmetry? Let’s see
visually what happens to the measures of central tendency when we encounter such graphs.

( Author, 2021)

Notice how these central tendency measures tend to spread when the normal distribution is distorted. For the
nomenclature just follow the direction of the tail — For the left graph since the tail is to the left, it is left-skewed
(negatively skewed) and the right graph has the tail to the right, so it is right-skewed (positively skewed).

61
How about deriving a measure that captures the horizontal distance between the Mode and the Mean of the
distribution? It’s intuitive to think that the higher the skewness, the more apart these measures will be. So let’s jump
to the formula for skewness now:

Division by Standard Deviation enables the relative comparison among distributions on the same standard scale.
Since mode calculation as a central tendency for small data sets is not recommended, so to arrive at a more robust
formula for skewness we will replace mode with the derived calculation from the median and the mean.

*approximately for skewed distributions

Replacing the value of mode in the formula of skewness, we get:

( Author, 2021)

What is Kurtosis and how do we capture it?

Think of punching or pulling the normal distribution curve from the top, what impact will it have on the shape of the
distribution? Let’s visualize:

62
(Author, 2021)

So there are two things to notice — The peak of the curve and the tails of the curve, Kurtosis measure is responsible
for capturing this phenomenon. The formula for kurtosis calculation is complex (4th moment in the moment-based
calculation) so we will stick to the concept and its visual clarity. A normal distribution has a kurtosis of 3 and is
called mesokurtic. Distributions greater than 3 are called leptokurtic and less than 3 are called platykurtic. So the
greater the value more the peakedness. Kurtosis ranges from 1 to infinity. As the kurtosis measure for a normal
distribution is 3, we can calculate excess kurtosis by keeping reference zero for normal distribution. Now excess
kurtosis will vary from -2 to infinity.

Excess Kurtosis for Normal Distribution = 3–3 = 0

The lowest value of Excess Kurtosis is when Kurtosis is 1 = 1–3 = -2

63
(Author, 2021)

The topic of Kurtosis has been controversial for decades now, the basis of kurtosis all these years has been linked
with the peakedness but the ultimate verdict is that outliers (fatter tails) govern the kurtosis effect far more than the
values near the mean (peak).

So we can conclude from the above discussions that the horizontal push or pull distortion of a normal distribution
curve gets captured by the Skewness measure and the vertical push or pull distortion gets captured by the Kurtosis
measure. Also, it is the impact of outliers that dominate the kurtosis effect which has its roots of proof sitting in the
fourth-order moment-based formula. I hope this blog helped you clarify the idea of Skewness & Kurtosis in a
simplified manner, watch out for more similar blogs in the future.

Matplotlib Violin Plot - violinplot() Function

we will cover the Violin Plot and how to create a violin plot using the violinplot() function in the Matplotlib library.

The Violin Plot is used to indicate the probability density of data at different values and it is quite similar to the
Matplotlib Box Plot.

 These plots are mainly a combination of Box Plots and Histograms.

 The violin plot usually portrays the distribution, median, interquartile range of data.
 In this, the interquartile and median are statistical information that is provided by the box plot whereas
the distribution is being provided by the histogram.
 The violin plots are also used to represent the comparison of a variable distribution across different
"categories"; like the Box plots.
 The Violin plots are more informative as they show the full distribution of the data.

Here is a figure showing common components of the Box Plot and Violin Plot:

64
Creation of the Violin Plot

The violinplot() method is used for the creation of the violin plot.

The syntax required for the method is as follows:

violinplot(dataset, positions, vert, widths, showmeans, showextrema,showmedians,quantiles,points=1, bw_method,

*, data)

Parameters

The description of the Parameters of this function is as follows:

 dataset

This parameter denotes the array or sequence of vectors. It is the input data.

 positions

This parameter is used to set the positions of the violins. In this, the ticks and limits are set automatically in
order to match the positions. It is an array-like structured data with the default as = [1, 2, …, n].

 vert

This parameter contains the boolean value. If the value of this parameter is set to true then it will create a
vertical plot, otherwise, it will create a horizontal plot.

 showmeans

65
This parameter contains a boolean value with false as its default value. If the value of this parameter is
True, then it will toggle the rendering of the means.

 showextrema

This parameter contains the boolean values with false as its default value. If the value of this parameter is
True, then it will toggle the rendering of the extrema.

 showmedians

This parameter contains the boolean values with false as its default [Link] the value of this parameter is
True, then it will toggle the rendering of the medians.

 quantiles

This is an array-like data structure having None as its default [Link] value of this parameter is not None
then,it set a list of floats in interval [0, 1] for each violin,which then stands for the quantiles that will be
rendered for that violin.

 points

It is scalar in nature and is used to define the number of points to evaluate each of the Gaussian kernel
density estimations.

 bw_method

This method is used to calculate the estimator bandwidth, for which there are many different ways of
calculation. The default rule used is Scott's Rule, but you can choose ‘silverman’, a scalar constant, or a
callable.

Now its time to dive into some examples in order to clear the concepts:

Violin Plot Basic Example:

Below we have a simple example where we will create violin plots for a different collection of data.

import [Link] as plt

import numpy as np

[Link](10)
collectn_1 = [Link](120, 10, 200)
collectn_2 = [Link](150, 30, 200)
collectn_3 = [Link](50, 20, 200)
collectn_4 = [Link](100, 25, 200)

data_to_plot = [collectn_1, collectn_2, collectn_3, collectn_4]

fig = [Link]()

ax = fig.add_axes([0,0,1,1])

bp = [Link](data_to_plot)
[Link]()

66
The output will be as follows:

67
Sentiment Analysis Using Python

In today’s digital age, platforms like Twitter, Goodreads, and Amazon overflow with people’s
opinions, making it crucial for organizations to extract insights from this massive volume of data.
Sentiment Analysis in Python offers a powerful solution to this challenge. This technique, a subset
of Natural Language Processing (NLP), involves classifying texts into sentiments such as positive,
negative, or neutral. By employing various Python libraries and models, analysts can automate this
process efficiently.

Sentiment Analysis is a use case of Natural Language Processing (NLP) and comes under the
category of text classification. To put it simply, Sentiment Analysis involves classifying a text into
various sentiments, such as positive or negative, Happy, Sad or Neutral, etc. Thus, the ultimate goal
of sentiment analysis is to decipher the underlying mood, emotion, or sentiment of a text. This is also
referred to as Opinion Mining.

How Does Sentiment Analysis Work?

Sentiment analysis in Python typically works by employing natural language processing(NLP) techniques

to analyze and understand the sentiment expressed in text. The process involves several steps:

 Text Preprocessing: The text cleaning process involves removing irrelevant information, such as

special characters, punctuation, and stopwords, from the text data.

 Tokenization: The text is divided into individual words or tokens to facilitate analysis.

 Feature Extraction: The text extraction process involves extracting relevant features from the text,

such as words, n-grams, or even parts of speech.

 Sentiment Classification: Machine learning algorithms or pre-trained models are used to classify the

sentiment of each text instance. Researchers achieve this through supervised learning, where they

train models on labeled data, or through pre-trained models that have learned sentiment patterns from

large datasets.

 Post-processing: The sentiment analysis results may undergo additional processing, such as

aggregating sentiment scores or applying threshold rules to classify sentiments as positive, negative,

or neutral.

68
 Evaluation: Researchers assess the performance of the sentiment analysis model using evaluation

metrics, such as accuracy, precision, recall, or F1 score.

Types of Sentiment Analysis

Various types of sentiment analysis can be performed, depending on the specific focus and objective of the

analysis. Some common types include:

 Document-Level Sentiment Analysis: This type of analysis determines the overall sentiment

expressed in a document, such as a review or an article. It aims to classify the entire text as positive,

negative, or neutral.

 Sentence-Level Sentiment Analysis: Here, the sentiment of each sentence within a document is

analyzed. This type provides a more granular understanding of the sentiment expressed in different

text parts.

 Aspect-Based Sentiment Analysis: This approach focuses on identifying and extracting the

sentiment associated with specific aspects or entities mentioned in the text. For example, in a product

review, the sentiment towards different features of the product (e.g., performance, design, usability)

can be analyzed separately.

 Entity-Level Sentiment Analysis: This type of analysis identifies the sentiment expressed towards

specific entities or targets mentioned in the text, such as people, companies, or products. It helps

understand the sentiment associated with different entities within the same document.

 Comparative Sentiment Analysis: This approach involves comparing the sentiment between

different entities or aspects mentioned in the text. It aims to identify the relative sentiment or

preferences expressed towards various entities or features.

69
Sentiment Analysis Use Cases

We just saw how sentiment analysis can empower organizations with insights that can help them make data-

driven decisions. Now, let’s peep into some more use cases of sentiment analysis:

 Social Media Monitoring for Brand Management: Brands can use sentiment analysis to gauge

their Brand’s public outlook. For example, a company can gather all Tweets with the company’s

mention or tag and perform sentiment analysis to learn the company’s public outlook.

 Product/Service Analysis: Brands/Organizations can perform sentiment analysis on customer

reviews to see how well a product or service is doing in the market and make future decisions

accordingly.

 Stock Price Prediction: Predicting whether the stocks of a company will go up or down is crucial

for investors. One can determine the same by performing sentiment analysis on News Headlines of

articles containing the company’s name. If the news headlines pertaining to a particular organization

happen to have a positive sentiment — its stock prices should go up and vice-versa.

Ways to Perform Sentiment Analysis in Python

Python is one of the most powerful tools when it comes to performing data science tasks — it offers a

multitude of ways to perform sentiment analysis in Python. The most popular ones are enlisted here:

 Using Text Blob

 Using Vader

 Using Bag of Words Vectorization-based Models

 Using LSTM-based Models

70
 Using Transformer-based Models

Note: For the purpose of demonstrations of methods 3 & 4 (Using Bag of Words Vectorization-based Models

and Using LSTM-based Models ) sentiment analysis has been used. It comprises more than 5000 text labelled

as positive, negative or neutral. The dataset lies under the Creative Commons license.

Using Text Blob

Text Blob is a Python library for Natural Language Processing. Using Text Blob for sentiment analysis is

quite simple. It takes text as an input and can return polarity and subjectivity as outputs.

 Polarity determines the sentiment of the text. Its values lie in [-1,1] where -1 denotes a highly

negative sentiment and 1 denotes a highly positive sentiment.

 Subjectivity determines whether a text input is a factual information or a personal opinion. Its value

lies between [0,1] where a value closer to 0 denotes a piece of factual information and a value closer

to 1 denotes a personal opinion.

Here is Steps to perform sentiment analysis using python and putting sentiment analysis code in python.

Step1: Installation
pip install textblobCopy Code

Step2: Importing Text Blob

from textblob import TextBlobCopy Code

Step3: Code Implementation for Sentiment Analysis Using Text Blob

Writing code for sentiment analysis using TextBlob is fairly simple. Just import the TextBlob object and pass

the text to be analyzed with appropriate attributes as follows:

from textblob import TextBlob

text_1 = "The movie was so awesome."

text_2 = "The food here tastes terrible."

71
#Determining the Polarity
p_1 = TextBlob(text_1).[Link]
p_2 = TextBlob(text_2).[Link]

#Determining the Subjectivity

s_1 = TextBlob(text_1).[Link]
s_2 = TextBlob(text_2).[Link]

print("Polarity of Text 1 is", p_1)

print("Polarity of Text 2 is", p_2)
print("Subjectivity of Text 1 is", s_1)
print("Subjectivity of Text 2 is", s_2)Copy Code

Output

Polarity of Text 1 is 1.0

Polarity of Text 2 is -1.0
Subjectivity of Text 1 is 1.0
Subjectivity of Text 2 is 1.0Copy Code

Using VADER

VADER (Valence Aware Dictionary and Sentiment Reasoner) is a rule-based sentiment analyzer that has

been trained on social media text. Just like Text Blob, its usage in Python is pretty simple. We’ll see its usage

in code implementation with an example in a while.

Step1: Installation
pip install vaderSentimentCopy Code

Step2: Importing SentimentIntensityAnalyzer class from Vader

[Link] import SentimentIntensityAnalyzerCopy Code

Step3: Code for Sentiment Analysis Using Vader

Firstly, we need to create an object of the SentimentIntensityAnalyzer class; then we need to pass the text to

the polarity_scores() function of the object as follows:

from [Link] import SentimentIntensityAnalyzer

sentiment = SentimentIntensityAnalyzer()
text_1 = "The book was a perfect balance between wrtiting style and plot."
text_2 = "The pizza tastes terrible."
sent_1 = sentiment.polarity_scores(text_1)
sent_2 = sentiment.polarity_scores(text_2)
print("Sentiment of text 1:", sent_1)
print("Sentiment of text 2:", sent_2)Copy Code

72
Output:

Sentiment of text 1: {'neg': 0.0, 'neu': 0.73, 'pos': 0.27, 'compound': 0.5719}
Sentiment of text 2: {'neg': 0.508, 'neu': 0.492, 'pos': 0.0, 'compound': -0.4767}

As we can see, a VaderSentiment object returns a dictionary of sentiment scores for the text to be analyzed.

Using Bag of Words Vectorization-Based Models

In the two approaches discussed as yet i.e. Text Blob and Vader, we have simply used Python libraries to

perform sentiment analysis. Now we’ll discuss an approach wherein we’ll train our own model for the task.

The steps involved in performing sentiment analysis using the Bag of Words Vectorization method are as

follows:

 Pre-Process the text of training data (Text pre-processing involves Normalization, Tokenization,

Stopwords Removal, and Stemming/Lemmatization.)

 Create a Bag of Words for the pre-processed text data using the Count Vectorization or TF-IDF

Vectorization approach.

 Train a suitable classification model on the processed data for sentiment classification.

Code for Sentiment Analysis using Bag of Words Vectorization Approach:

To build a sentiment analysis in python model using the BOW Vectorization Approach we need a labeled

dataset. As stated earlier, the dataset used for this demonstration has been obtained from Kaggle. We have

simply used sklearn’s count vectorizer to create the BOW. After, we trained a Multinomial Naive Bayes

classifier, for which an accuracy score of 0.84 was obtained.

Dataset can be obtained from here.

#Loading the Dataset

73
import pandas as pd
data = pd.read_csv('Finance_data.csv')
#Pre-Prcoessing and Bag of Word Vectorization using Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from [Link] import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = [Link])
text_counts = cv.fit_transform(data['sentences'])
#Splitting the data into trainig and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, data['feedback'], test_size=0.25,
random_state=5)
#Training the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
[Link](X_train, Y_train)
#Caluclating the accuracy score of the model
from sklearn import metrics
predicted = [Link](X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score)Copy Code

Output:

Accuracuy Score: 0.9111675126903553

The trained classifier can be used to predict the sentiment of any given text input.

Using LSTM-Based Models

Though we were able to obtain a decent accuracy score with the Bag of Words Vectorization method, it might

fail to yield the same results when dealing with larger datasets. This gives rise to the need to employ deep

learning-based models for the training of the sentiment analysis in python model.

For NLP tasks, we generally use RNN-based models since they are designed to deal with sequential data.

Here, we’ll train an LSTM (Long Short Term Memory) model using TensorFlow with Keras. The steps to

perform sentiment analysis using LSTM-based models are as follows:

 Pre-Process the text of training data (Text pre-processing involves Normalization, Tokenization,

Stopwords Removal, and Stemming/Lemmatization.)

74
 Tokenizer is imported from [Link] and created, fitting it to the entire training text.

Text embeddings are generated using texts_to_sequence() and stored after padding to equal length.

Embeddings are numerical/vectorized representations of text, not directly fed to the model.

 The model is built using TensorFlow, including input, LSTM, and dense layers. Dropouts and

hyperparameters are adjusted for accuracy. In inner layers, we use ReLU or LeakyReLU activation

functions to avoid vanishing gradient problems, while in the output layer, we use Softmax or

Sigmoid activation functions.

Code for Sentiment Analysis Using LSTM-based Model

Here, we have used the same dataset as we used in the case of the BOW approach. A training accuracy of

0.90 was obtained.

#Importing necessary libraries

import nltk
import pandas as pd
from textblob import Word
from [Link] import stopwords
from [Link] import LabelEncoder
from [Link] import classification_report,confusion_matrix,accuracy_score
from [Link] import Sequential
from [Link] import Tokenizer
from [Link] import pad_sequences
from [Link] import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
#Loading the dataset
data = pd.read_csv('Finance_data.csv')
#Pre-Processing the text
def cleaning(df, stop_words):
df['sentences'] = df['sentences'].apply(lambda x: ' '.join([Link]() for x in [Link]()))
# Replacing the digits/numbers
df['sentences'] = df['sentences'].[Link]('d', '')
# Removing stop words
df['sentences'] = df['sentences'].apply(lambda x: ' '.join(x for x in [Link]() if x not in stop_words))
# Lemmatization
df['sentences'] = df['sentences'].apply(lambda x: ' '.join([Word(x).lemmatize() for x in [Link]()]))
return df
stop_words = [Link]('english')
data_cleaned = cleaning(data, stop_words)
#Generating Embeddings using tokenizer
tokenizer = Tokenizer(num_words=500, split=' ')
tokenizer.fit_on_texts(data_cleaned['verified_reviews'].values)

75
X = tokenizer.texts_to_sequences(data_cleaned['verified_reviews'].values)
X = pad_sequences(X)
#Model Building
model = Sequential()
[Link](Embedding(500, 120, input_length = [Link][1]))
[Link](SpatialDropout1D(0.4))
[Link](LSTM(704, dropout=0.2, recurrent_dropout=0.2))
[Link](Dense(352, activation='LeakyReLU'))
[Link](Dense(3, activation='softmax'))
[Link](loss = 'categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
print([Link]())
#Model Training
[Link](X_train, y_train, epochs = 20, batch_size=32, verbose =1)
#Model Testing
[Link](X_test,y_test)Copy Code

Using Transformer-Based Models

Transformer-based models are one of the most advanced Natural Language Processing Techniques. They

follow an Encoder-Decoder-based architecture and employ the concepts of self-attention to yield impressive

results. Though one can always build a transformer model from scratch, it is quite tedious a task. Thus, we

can use pre-trained transformer models available on Hugging Face. Hugging Face is an open-source AI

community that offers a multitude of pre-trained models for NLP applications. You can use these models as

they are or fine-tune them for specific tasks.

Step1: Installation
pip install transformersCopy Code

Step2: Importing SentimentIntensityAnalyzer class from Vader

import transformersCopy Code

Step3: Code for Sentiment Analysis Using Transformer based models

To perform any task using transformers, we first need to import the pipeline function from transformers.

Then, an object of the pipeline function is created and the task to be performed is passed as an argument (i.e

sentiment analysis in our case). We can also specify the model that we need to use to perform the task. Here,

since we have not mentioned the model to be used, the distillery-base-uncased-finetuned-sst-2-English mode

is used by default for sentiment analysis. You can check out the list of available tasks and models here.

from transformers import pipeline

76
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["It was the best of times.", "t was the worst of times."]
sentiment_pipeline(data)Copy Code

Output

[{'label': 'POSITIVE', 'score': 0.999457061290741}, {'label': 'NEGATIVE', 'score':

0.9987301230430603}]Copy Code

What is the best Python library for sentiment analysis?

No single best library for sentiment analysis in Python, depends on your needs. Here’s a quick comparison:

NLTK: Powerful, versatile, good for multiple NLP tasks, but complex for sentiment analysis.

TextBlob: Beginner-friendly, simple interface for sentiment analysis (polarity, subjectivity).

Pattern: More comprehensive analysis (comparatives, superlatives, fact/opinion), steeper learning curve.

Polyglot: Fast, multilingual support (136+ languages), ideal for multiple languages.

Key Takeaways

 Python provides a versatile environment for performing sentiment analysis tasks due to its rich

ecosystem of libraries and frameworks.

 We explored multiple approaches including Text Blob, VADER, Bag of Words, LSTM, and

Transformer-based models to analyze sentiment in textual data.

 The process involves text preprocessing, tokenization, feature extraction, and applying machine

learning or deep learning models to classify sentiments.

 We applied these methods to real-world examples like customer reviews and social media data to

classify sentiments as positive, negative, or neutral.

77
 Sentiment analysis helps organizations monitor brand perception, analyze customer feedback, and

make data-driven decisions.

 With advancements in natural language processing, sentiment analysis in Python continues to evolve,

offering more accurate and sophisticated methods for understanding textual sentiment.

78
Object Oriented Programming
Object Oriented Programming empowers developers to build modular, maintainable and scalable applications. OOP
is a way of organizing code that uses objects and classes to represent real-world entities and their behavior. In OOP,
object has attributes thing that has specific data and can perform certain actions using methods.
 Organizes code into classes and objects.
 Supports encapsulation to group data and methods together.
 Enables inheritance for reusability and hierarchy.
 Allows polymorphism for flexible method implementation.
 Improves modularity, scalability and maintainability.

Class
A class is a collection of objects. Classes are blueprints for creating objects. A class defines a set of attributes and
methods that the created objects (instances) can have. Some points on Python class:
 Classes are created by keyword class.
 Attributes are the variables that belong to a class.
 Attributes are always public and can be accessed using the dot (.) operator. Example: [Link]

Creating a Class

Here, class keyword indicates that we are creating a class followed by name of the class (Dog in this case).
class Dog:
species = "Canine" # Class attribute

def init(self, name, age):

[Link] = name # Instance attribute
[Link] = age # Instance attribute
Explanation:
 class Dog: creates a class named Dog, which acts as a blueprint for dog objects.
 species is a class attribute, meaning it is shared by all instances of the class.
 __init__() is a constructor method that runs automatically when a new object is created. It is used to initialize
object data.
 self refers to the current object, allowing each object to store and access its own data.
 [Link] and [Link] are instance attributes, unique to each Dog object created from the class.

Objects
An Object is an instance of a Class. It represents a specific implementation of the class and holds its own data. An
object consists of:
 State: It is represented by the attributes and reflects the properties of an object.
 Behavior: It is represented by the methods of an object and reflects the response of an object to other objects.
 Identity: It gives a unique name to an object and enables one object to interact with other objects.

Creating Object

Creating an object in Python involves instantiating a class to create a new instance of that class. This process is also
referred to as object instantiation.
class Dog:

79
species = "Canine" # Class attribute

def init(self, name, age):

[Link] = name # Instance attribute
[Link] = age # Instance attribute

# Creating an object of the Dog class

dog1 = Dog("Buddy", 3)

print([Link])
print([Link])

Output
Buddy
Canine
Explanation:
 dog1 = Dog("Buddy", 3): Creates an object of the Dog class with name as "Buddy" and age as 3.
 [Link]: Accesses the instance attribute name of the dog1 object.
 [Link]: Accesses the class attribute species of the dog1 object.

Four Pillars of OOP in Python

The Four Pillars of Object-Oriented Programming (OOP) form the foundation for designing structured, reusable,
and maintainable software.

1. Inheritance

Inheritance allows a class (child class) to acquire properties and methods of another class (parent class). It supports
hierarchical classification and promotes code reuse.

80
2. Polymorphism

Polymorphism in Python means "same operation, different behavior." It allows functions or methods with the same
name to work differently depending on the type of object they are acting upon.

3. Encapsulation

Encapsulation is the bundling of data (attributes) and methods (functions) within a class, restricting access to some
components to control interactions. A class is an example of encapsulation as it encapsulates all the data that is
member functions, variables, etc.
Encapsulation in Python

4. Data Abstraction
Abstraction hides the internal implementation details while exposing only the necessary functionality. It helps focus
on "what to do" rather than "how to do it."
Class Properties

Properties are variables that belong to a class. They store data for each object created from the class.

Create a class with properties:

class Person:
def __init__(self, name, age):
[Link] = name
[Link] = age

p1 = Person("Emil", 36)

print([Link])
print([Link])

Access Properties

You can access object properties using dot notation:

class Car:
def __init__(self, brand, model):
[Link] = brand
[Link] = model

car1 = Car("Toyota", "Corolla")

print([Link])
print([Link])

Class Methods

Methods are functions that belong to a class. They define the behavior of objects created from the class.

81
Create a method in a class:
class Person:
def __init__(self, name):
[Link] = name

def greet(self):
print("Hello, my name is " + [Link])

p1 = Person("Emil")
[Link]()

Python Inheritance

Inheritance allows us to define a class that inherits all the methods and properties from another class.

Parent class is the class being inherited from, also called base class.

Child class is the class that inherits from another class, also called derived class.

Create a Parent Class

Any class can be a parent class, so the syntax is the same as creating any other class:

Example: Create a class named Person, with firstname and lastname properties, and a printname method:

class Person:
def __init__(self, fname, lname):
[Link] = fname
[Link] = lname

def printname(self):
print([Link], [Link])

#Use the Person class to create an object, and then execute the printname method:

x = Person("John", "Doe")
[Link]()

Create a Child Class

To create a class that inherits the functionality from another class, send the parent class as a parameter when creating
the child class:

Example: Create a class named Student, which will inherit the properties and methods from the Person class:

class Student(Person):
pass

Note: Use the pass keyword when you do not want to add any other properties or methods to the class.

Use the Student class to create an object, and then execute the printname method:

82
x = Student("Mike", "Olsen")
[Link]()

Add the init() Function

So far we have created a child class that inherits the properties and methods from its parent.

We want to add the __init__() function to the child class (instead of the pass keyword).

Note: The __init__() function is called automatically every time the class is being used to create a new object.

Example: Add the init() function to the Student class:

class Student(Person):
def __init__(self, fname, lname):
#add properties etc.

When you add the __init__() function, the child class will no longer inherit the parent's __init__() function.

Note: The child's __init__() function overrides the inheritance of the parent's __init__() function

To keep the inheritance of the parent's __init__() function, add a call to the parent's __init__() function:

class Student(Person):
def __init__(self, fname, lname):
Person.__init__(self, fname, lname)

Now we have successfully added the __init__() function, and kept the inheritance of the parent class, and we are
ready to add functionality in the __init__() function.

Use the super() Function

Python also has a super() function that will make the child class inherit all the methods and properties from its
parent:

Example:

class Student(Person):
def __init__(self, fname, lname):
super().__init__(fname, lname)

By using the super() function, you do not have to use the name of the parent element, it will automatically inherit the
methods and properties from its parent.

Add Properties

Add a property called graduationyear to the Student class:

class Student(Person):
def __init__(self, fname, lname):

83
super().__init__(fname, lname)
[Link] = 2019

In the example below, the year 2019 should be a variable, and passed into the Student class when creating student
objects. To do so, add another parameter in the __init__() function:

Example: Add a year parameter, and pass the correct year when creating objects:

class Student(Person):
def __init__(self, fname, lname, year):
super().__init__(fname, lname)
[Link] = year

x = Student("Mike", "Olsen", 2019)

Example: Add a method called welcome to the Student class:

lass Student(Person):
def __init__(self, fname, lname, year):
super().__init__(fname, lname)
[Link] = year

def welcome(self):
print("Welcome", [Link], [Link], "to the class of", [Link])

If you add a method in the child class with the same name as a function in the parent class, the inheritance of the
parent method will be overridden.

Python Polymorphism

The word "polymorphism" means "many forms", and in programming it refers to methods/functions/operators with
the same name that can be executed on many objects or classes.

Function Polymorphism

An example of a Python function that can be used on different objects is the len() function.

String

For strings len() returns the number of characters:

x = "Hello World!"

print(len(x))

Tuple

For tuples len() returns the number of items in the tuple:

mytuple = ("apple", "banana", "cherry")

print(len(mytuple))

84
Dictionary

For dictionaries len() returns the number of key/value pairs in the dictionary:

thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}

print(len(thisdict))

Class Polymorphism

Polymorphism is often used in Class methods, where we can have multiple classes with the same method name.

For example, say we have three classes: Car, Boat, and Plane, and they all have a method called move():

Different classes with the same method:

class Car:
def __init__(self, brand, model):
[Link] = brand
[Link] = model

def move(self):
print("Drive!")

class Boat:
def __init__(self, brand, model):
[Link] = brand
[Link] = model

def move(self):
print("Sail!")

class Plane:
def __init__(self, brand, model):
[Link] = brand
[Link] = model

def move(self):
print("Fly!")

car1 = Car("Ford", "Mustang") #Create a Car object

boat1 = Boat("Ibiza", "Touring 20") #Create a Boat object
plane1 = Plane("Boeing", "747") #Create a Plane object

for x in (car1, boat1, plane1):

[Link]()

Look at the for loop at the end. Because of polymorphism we can execute the same method for all three classes.

85
Inheritance Class Polymorphism

What about classes with child classes with the same name? Can we use polymorphism there?

Yes. If we use the example above and make a parent class called Vehicle, and make Car, Boat, Plane child classes
of Vehicle, the child classes inherit the Vehicle methods, but can override them:

Example: Create a class called Vehicle and make Car, Boat, Plane child classes of Vehicle:

class Vehicle:
def __init__(self, brand, model):
[Link] = brand
[Link] = model

def move(self):
print("Move!")

class Car(Vehicle):
pass

class Boat(Vehicle):
def move(self):
print("Sail!")

class Plane(Vehicle):
def move(self):
print("Fly!")

car1 = Car("Ford", "Mustang") #Create a Car object

boat1 = Boat("Ibiza", "Touring 20") #Create a Boat object
plane1 = Plane("Boeing", "747") #Create a Plane object

for x in (car1, boat1, plane1):

print([Link])
print([Link])
[Link]()

Child classes inherits the properties and methods from the parent class.

In the example above you can see that the Car class is empty, but it inherits brand, model, and move() from Vehicle.

The Boat and Plane classes also inherit brand, model, and move() from Vehicle, but they both override
the move() method.

Because of polymorphism we can execute the same method for all classes.

Encapsulation

Encapsulation is about protecting data inside a class.

It means keeping data (properties) and methods together in a class, while controlling how the data can be accessed
from outside the class.

86
This prevents accidental changes to your data and hides the internal details of how your class works.

Private Properties

In Python, you can make properties private by using a double underscore __ prefix:

Example: Create a private class property named __age:

class Person:
def __init__(self, name, age):
[Link] = name
self.__age = age # Private property

p1 = Person("Emil", 25)
print([Link])
print(p1.__age) # This will cause an error
Note: Private properties cannot be accessed directly from outside the class.
Get Private Property Value

To access a private property, you can create a getter method:

Example

Use a getter method to access a private property:

class Person:
def __init__(self, name, age):
[Link] = name
self.__age = age

def get_age(self):
return self.__age

p1 = Person("Tobias", 25)
print(p1.get_age())

Set Private Property Value

To modify a private property, you can create a setter method.

The setter method can also validate the value before setting it:

Example

Use a setter method to change a private property:

class Person:
def __init__(self, name, age):
[Link] = name
self.__age = age

def get_age(self):
return self.__age

87
def set_age(self, age):
if age > 0:
self.__age = age
else:
print("Age must be positive")

p1 = Person("Tobias", 25)
print(p1.get_age())

p1.set_age(26)
print(p1.get_age())

Why Use Encapsulation?

Encapsulation provides several benefits:

 Data Protection: Prevents accidental modification of data

 Validation: You can validate data before setting it
 Flexibility: Internal implementation can change without affecting external code
 Control: You have full control over how data is accessed and modified

Example

Use encapsulation to protect and validate data:

class Student:
def __init__(self, name):
[Link] = name
self.__grade = 0

def set_grade(self, grade):

if 0 <= grade <= 100:
self.__grade = grade
else:
print("Grade must be between 0 and 100")

def get_grade(self):
return self.__grade

def get_status(self):
if self.__grade >= 60:
return "Passed"
else:
return "Failed"

student = Student("Emil")
student.set_grade(85)
print(student.get_grade())
print(student.get_status())

Protected Properties

Python also has a convention for protected properties using a single underscore _ prefix:

88
Example

Create a protected property:

class Person:
def __init__(self, name, salary):
[Link] = name
self._salary = salary # Protected property

p1 = Person("Linus", 50000)
print([Link])
print(p1._salary) # Can access, but shouldn't

Note: A single underscore _ is just a convention. It tells other programmers that the property is intended for internal
use, but Python doesn't enforce this restriction.

Private Methods

You can also make methods private using the double underscore prefix:

Example

Create a private method:

lass Calculator:
def __init__(self):
[Link] = 0

def __validate(self, num):

if not isinstance(num, (int, float)):
return False
return True

def add(self, num):

if self.__validate(num):
[Link] += num
else:
print("Invalid number")

calc = Calculator()
[Link](10)
[Link](5)
print([Link])
# calc.__validate(5) # This would cause an error

Note: Just like private properties with double underscores, private methods cannot be called directly from outside
the class. The __validate method can only be used by other methods inside the class.

Name Mangling

Name mangling is how Python implements private properties and methods.

When you use double underscores __, Python automatically renames it internally by adding _ClassName in front.

89
For example, __age becomes _Person__age.

Example

See how Python mangles the name:

lass Person:
def __init__(self, name, age):
[Link] = name
self.__age = age

p1 = Person("Emil", 30)

# This is how Python mangles the name:

print(p1._Person__age) # Not recommended!

While you can access private properties using the mangled name, it's not recommended. It defeats the purpose of
encapsulation.

90
File Handling
The key function for working with files in Python is the open() function.
The open() function takes two parameters; filename, and mode.
There are four different methods (modes) for opening a file:
"r" - Read - Default value. Opens a file for reading, error if the file does not exist
"a" - Append - Opens a file for appending, creates the file if it does not exist
"w" - Write - Opens a file for writing, creates the file if it does not exist
"x" - Create - Creates the specified file, returns an error if the file exists
Besides, you can specify if the file should be handled as binary or text mode
"t" - Text - Default value. Text mode
"b" - Binary - Binary mode (e.g. images)

Syntax
To open a file for reading it is enough to specify the name of the file:

Myfile = open("[Link]")
The code above is the same as:

f = open("[Link]", "rt")

Because "r" for read, and "t" for text are the default values, you do not need to specify them.

Open a File on the Server

Assume we have the following file, located in the same folder as Python:

To open the file, use the built-in open() function.

The open() function returns a file object, which has a read() method for reading the content of the file:

f = open("[Link]")
print([Link]())

If the file is located in a different location, you will have to specify the file path, like this:
Example

Open a file on a different location:

f = open("C:\\Users\\USER\\Desktop\\ML\\[Link]")
print([Link]())
Using the with statement
You can also use the with statement when opening a file:
Example
Using the with keyword:
with open("[Link]") as f:
print([Link]())

Then you do not have to worry about closing your files, the with statement takes care of that.
Close Files

It is a good practice to always close the file when you are done with it.
If you are not using the with statement, you must write a close statement in order to close the file:
Example
Close the file when you are finished with it:

91
f = open("[Link]")
print([Link]())
[Link]()

Read Only Parts of the File

By default the read() method returns the whole text, but you can also specify how many characters you want to
return:

Example
Return the 5 first characters of the file:
with open("[Link]") as f:
print([Link](5))

Read Lines
You can return one line by using the readline() method:

Example
Read one line of the file:
with open("[Link]") as f:
print([Link]())

By calling readline() two times, you can read the two first lines:

By looping through the lines of the file, you can read the whole file, line by line:
Example
Loop through the file line by line:

with open("[Link]") as f:
for x in f:
print(x)

Write to an Existing File

To write to an existing file, you must add a parameter to the open() function:
"a" - Append - will append to the end of the file
"w" - Write - will overwrite any existing content

Example
Open the file "[Link]" and append content to the file:
with open("[Link]", "a") as f:
[Link]("Now the file has more content!")

#open and read the file after the appending:

with open("[Link]") as f:
print([Link]())

Overwrite Existing Content

To overwrite the existing content to the file, use the w parameter:

Example
Open the file "[Link]" and overwrite the content:
with open("[Link]", "w") as f:
[Link]("Woops! I have deleted the content!")

#open and read the file after the overwriting:

92
with open("[Link]") as f:
print([Link]())

Create a New File

To create a new file in Python, use the open() method, with one of the following parameters:
"x" - Create - will create a file, returns an error if the file exists
"a" - Append - will create a file if the specified file does not exists
"w" - Write - will create a file if the specified file does not exists

Example
Create a new file called "[Link]":
f = open("[Link]", "x")

Delete a File

To delete a file, you must import the OS module, and run its [Link]() function:
Example
Remove the file "[Link]":

import os
[Link]("[Link]")

Check if File exist:

To avoid getting an error, you might want to check if the file exists before you try to delete it:

Example
Check if file exists, then delete it:

import os
if [Link]("[Link]"):
[Link]("[Link]")
else:
print("The file does not exist")

Delete Folder

To delete an entire folder, use the [Link]() method:

Example
Remove the folder "myfolder":

import os
[Link]("myfolder")

Note: You can only remove empty folders.

93
Machine Learning with Multiple Linear Regression

Before implementing multiple linear regression, it is essential to ensure that the following assumptions are met:

1. Linearity: The relationship between the dependent variable and independent variables is linear.
2. Independence of Errors: Residuals (errors) are independent of each other. This is often verified using
the Durbin-Watson test.
3. Homoscedasticity: The variance of residuals is constant across all levels of the independent variables. A
residual plot can help verify this.
4. No Multicollinearity: Independent variables are not highly correlated. Variance Inflation Factor (VIF) is
commonly used to detect multicollinearity.
5. Normality of Residuals: Residuals should follow a normal distribution. This can be checked using a Q-Q
plot.
6. Outlier Influence: Outliers or high-leverage points should not disproportionately influence the model.

These assumptions ensure that the regression model is valid and the results are reliable. Failing to meet these
assumptions may lead to biased or misleading results.

Preprocess the Data

In this section, you will learn to use the Multiple Linear Regression model in Python to predict house prices based
on features from the California Housing Dataset. You’ll learn how to preprocess data, fit a regression model, and
evaluate its performance while addressing common challenges like multicollinearity, outliers, and feature selection.
Step 1 - Load the Dataset
You will use the California Housing Dataset, a popular dataset for regression tasks. This dataset contains 13 features
about houses in Boston suburbs and their corresponding median house price.

First, let’s install the necessary packages:

pip install numpy pandas matplotlib seaborn scikit-learn statsmodels

from [Link] import fetch_california_housing # Import the fetch_california_housing function from

[Link] to load the California Housing dataset.
import pandas as pd # Import pandas for data manipulation and analysis.
import numpy as np # Import numpy for numerical computing.

# Load the California Housing dataset using the fetch_california_housing function.

housing = fetch_california_housing()

# Convert the dataset's data into a pandas DataFrame, using the feature names as column headers.
housing_df = [Link]([Link], columns=housing.feature_names)

# Add the target variable 'MedHouseValue' to the DataFrame, using the dataset's target values.
housing_df['MedHouseValue'] = [Link]

# Display the first few rows of the DataFrame to get an overview of the dataset.
print(housing_df.head())

You should observe the following output of the dataset:

MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseValue

94
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526

1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585

2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521

3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413

4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Here is what each of the attributes mean:

Variable Description
MedInc Median income in block
HouseAge Median house age in block
AveRooms Average number of rooms
AveBedrms Average number of bedrooms
Population Block population
AveOccup Average house occupancy
Latitude House block latitude
Longitude House block longitude

Check for Missing Values

Ensures there are no missing values in the dataset which might affect the analysis.

print(housing_df.isnull().sum())

MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
MedHouseValue 0
dtype: int64

Feature Selection

Let’s first create a correlation matrix to understand the dependencies between the variables.

correlation_matrix = housing_df.corr()
print(correlation_matrix['MedHouseValue'])

Output:

MedInc 0.688075
HouseAge 0.105623

95
AveRooms 0.151948
AveBedrms -0.046701
Population -0.024650
AveOccup -0.023737
Latitude -0.144160
Longitude -0.045967
MedHouseValue 1.000000

You can analyze the above correlation matrix to select the dependent and independent variables for our regression
model. The correlation matrix provides insights into the relationships between each pair of variables in the dataset.

In the given correlation matrix, MedHouseValue is the dependent variable, as it is the variable we are trying to
predict. The independent variables have a significant correlation with MedHouseValue.

Based on the correlation matrix, you can identify the following independent variables that have a significant
correlation with MedHouseValue:

 MedInc: This variable has a strong positive correlation (0.688075) with MedHouseValue, indicating that as
median income increases, median house value also tends to increase.
 AveRooms: This variable has a moderate positive correlation (0.151948) with MedHouseValue, suggesting
that as the average number of rooms per household increases, median house value also tends to increase.
 AveOccup: This variable has a weak negative correlation (-0.023737) with MedHouseValue, indicating
that as the average occupancy per household increases, median house value tends to decrease, but the effect
is relatively small.

By selecting these independent variables, you can build a regression model that captures the relationships between
these variables and MedHouseValue, allowing us to make predictions about median house value based on median
income, average number of rooms, and average occupancy.

You can also plot the correlation matrix in Python using the below:

import seaborn as sns

import [Link] as plt

# Assuming 'housing_df' is the DataFrame containing the data

# Plotting the correlation matrix
[Link](figsize=(10, 8))
[Link](housing_df.corr(), annot=True, cmap='coolwarm')
[Link]('Correlation Matrix')
[Link]()

96
We shall focus on a few key features for simplicity based on the above, such as MedInc (median
income), AveRooms (average rooms per household), and AveOccup (average occupancy per household).

selected_features = ['MedInc', 'AveRooms', 'AveOccup']

X = housing_df[selected_features]
y = housing_df['MedHouseValue']

The above code block selects specific features from the housing_df data frame for analysis. The selected features
are MedInc, AveRooms, and AveOccup, which are stored in the selected_features list.
The DataFrame housing_df is then subset to include only these selected features and the result is stored in X list.
The target variable MedHouseValue is extracted from housing_df and stored in the y list.

Scaling Features

We shall use Standardization to ensure all features are on the same scale, improving model performance and
comparability.

Standardization is a preprocessing technique that scales numerical features to have a mean of 0 and a standard
deviation of 1. This process ensures that all features are on the same scale, which is essential for machine learning
models sensitive to the input features’ scale. By standardizing the features, you can improve model performance and
comparability by reducing the effect of features with large ranges dominating the model.

from [Link] import StandardScaler

# Initialize the StandardScaler object

scaler = StandardScaler()

# Fit the scaler to the data and transform it

X_scaled = scaler.fit_transform(X)

97
# Print the scaled data
print(X_scaled)

The output represents the scaled values of the features MedInc, AveRooms, and AveOccup after applying the
StandardScaler. The values are now centered around 0 with a standard deviation of 1, ensuring all features are on the
same scale.
The first row [ 2.34476576 0.62855945 -0.04959654] indicates that for the first data point, the scaled MedInc value
is 2.34476576, AveRooms is 0.62855945, and AveOccup is -0.04959654. Similarly, the second row [ 2.33223796
0.32704136 -0.09251223] represents the scaled values for the second data point, and so on.

The scaled values range from approximately -1.14259331 to 2.34476576, indicating that the features are now
normalized and comparable. This is essential for machine learning models that are sensitive to the scale of input
features, as it prevents features with large ranges from dominating the model.

Implement Multiple Linear Regression

Now that you are done with data preprocessing let’s implement multiple linear regression in python.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from [Link] import mean_squared_error, r2_score
import [Link] as plt
import seaborn as sns

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# The 'LinearRegression' model is initialized and fitted to the training data.

model = LinearRegression()
[Link](X_train, y_train)

# The model is used to predict the target variable for the test set.
y_pred = [Link](X_test)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

print("R-squared:", r2_score(y_test, y_pred))

The train_test_split function is used to split the data into training and testing sets. Here, 80% of the data is used for
training and 20% for testing.

The model is evaluated using Mean Squared Error and R-squared. Mean Squared Error (MSE) measures the average
of the squares of the errors or deviations.

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that’s
explained by an independent variable or variables in a regression model.

Output:

Mean Squared Error: 0.7006855912225249

R-squared: 0.4652924370503557

The output above provides two key metrics to evaluate the performance of the multiple linear regression model:

98
Mean Squared Error (MSE): 0.7006855912225249 The MSE measures the average squared difference between
the predicted and actual values of the target variable. A lower MSE indicates better model performance, as it means
the model is making more accurate predictions. In this case, the MSE is 0.7006855912225249, indicating that the
model is not perfect but has a reasonable level of accuracy. The MSE values typically should be closer to 0, with
lower values indicating better performance.
R-squared (R2): 0.4652924370503557 R-squared measures the proportion of the variance in the dependent variable
that is predictable from the independent variables. It ranges from 0 to 1, where 1 is perfect prediction and 0 indicates
no linear relationship. In this case, the R-squared value is 0.4652924370503557, indicating that about 46.53% of the
variance in the target variable can be explained by the independent variables used in the model. This suggests that
the model is able to capture a significant portion of the relationships between the variables but not all of it.

Let’s check out some important plots:

# Residual Plot
residuals = y_test - y_pred
[Link](y_pred, residuals, alpha=0.5)
[Link]('Predicted Values')
[Link]('Residuals')
[Link]('Residual Plot')
[Link](y=0, color='red', linestyle='--')
[Link]()

# Predicted vs Actual Plot

[Link](y_test, y_pred, alpha=0.5)
[Link]('Actual Values')
[Link]('Predicted Values')
[Link]('Predicted vs Actual Values')
[Link]([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=4)
[Link]()

Using Statsmodels

The Statsmodels library in Python is a powerful tool for statistical analysis. It provides a wide range of statistical
models and tests, including linear regression, time series analysis, and nonparametric methods.
In the context of multiple linear regression, statsmodels can be used to fit a linear model to the data, and then
perform various statistical tests and analyses on the model. This can be particularly useful for understanding the
relationships between the independent and dependent variables, and for making predictions based on the model.

import [Link] as sm

# Add a constant to the model

X_train_sm = sm.add_constant(X_train)
model_sm = [Link](y_train, X_train_sm).fit()
print(model_sm.summary())

# Q-Q Plot for residuals

[Link](model_sm.resid, line='s')
[Link]('Q-Q Plot of Residuals')
[Link]()

Output:

99
==============================================================================
Dep. Variable: MedHouseValue R-squared: 0.485
Model: OLS Adj. R-squared: 0.484
Method: Least Squares F-statistic: 5173.
Date: Fri, 17 Jan 2025 Prob (F-statistic): 0.00
Time: 09:40:54 Log-Likelihood: -20354.
No. Observations: 16512 AIC: 4.072e+04
Df Residuals: 16508 BIC: 4.075e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.0679 0.006 320.074 0.000 2.055 2.081
x1 0.8300 0.007 121.245 0.000 0.817 0.843
x2 -0.1000 0.007 -14.070 0.000 -0.114 -0.086
x3 -0.0397 0.006 -6.855 0.000 -0.051 -0.028
==============================================================================
Omnibus: 3981.290 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11583.284
Skew: 1.260 Prob(JB): 0.00
Kurtosis: 6.239 Cond. No. 1.42
==============================================================================

Here is the summary of the above tables:

Model Summary
The model is an Ordinary Least Squares regression model, which is a type of linear regression model. The
dependent variable is MedHouseValue, and the model has an R-squared value of 0.485, indicating that about 48.5%
of the variation in MedHouseValue can be explained by the independent variables. The adjusted R-squared value is
0.484, which is a modified version of R-squared that penalizes the model for including additional independent
variables.

Model Fit

The model was fit using the Least Squares method, and the F-statistic is 5173, indicating that the model is a good fit.
The probability of observing an F-statistic at least as extreme as the one observed, assuming that the null hypothesis
is true, is approximately 0. This suggests that the model is statistically significant.

Model Coefficients

The model coefficients are as follows:

 The constant term is 2.0679, indicating that when all independent variables are 0, the
predicted MedHouseValue is approximately 2.0679.
 The coefficient for x1(In this case MedInc) is 0.8300, indicating that for every unit increase in MedInc, the
predicted MedHouseValue increases by approximately 0.83 units, assuming all other independent variables
are held constant.
 The coefficient for x2(In this case AveRooms) is -0.1000, indicating that for every unit increase in x2, the
predicted MedHouseValue decreases by approximately 0.10 units, assuming all other independent variables
are held constant.
 The coefficient for x3(In this case AveOccup) is -0.0397, indicating that for every unit increase in x3, the
predicted MedHouseValue decreases by approximately 0.04 units, assuming all other independent variables
are held constant.

100
Model Diagnostics

The model diagnostics are as follows:

 The Omnibus test statistic is 3981.290, indicating that the residuals are not normally distributed.
 The Durbin-Watson statistic is 1.983, indicating that there is no significant autocorrelation in the residuals.
 The Jarque-Bera test statistic is 11583.284, indicating that the residuals are not normally distributed.
 The skewness of the residuals is 1.260, indicating that the residuals are skewed to the right.
 The kurtosis of the residuals is 6.239, indicating that the residuals are leptokurtic (i.e., they have a higher
peak and heavier tails than a normal distribution).
 The condition number is 1.42, indicating that the model is not sensitive to small changes in the data.

Handling Multicollinearity

Multicollinearity is a common issue in multiple linear regression, where two or more independent variables are
highly correlated with each other. This can lead to unstable and unreliable estimates of the coefficients.
To detect and handle multicollinearity, you can use the Variance Inflation Factor. The VIF measures how much the
variance of an estimated regression coefficient increases if your predictors are correlated. A VIF of 1 means that
there is no correlation between a given predictor and the other predictors. VIF values exceeding 5 or 10 indicate a
problematic amount of collinearity.

In the code block below, let’s calculate the VIF for each independent variable in our model. If any VIF value is
above 5, you should consider removing the variable from the model.

from [Link].outliers_influence import variance_inflation_factor

vif_data = [Link]()
vif_data['Feature'] = selected_features
vif_data['VIF'] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]
print(vif_data)

# Bar Plot for VIF Values

vif_data.plot(kind='bar', x='Feature', y='VIF', legend=False)
[Link]('Variance Inflation Factor (VIF) by Feature')
[Link]('VIF Value')
[Link]()

The Output:
Feature VIF
0 MedInc 1.120166
1 AveRooms 1.119797
2 AveOccup 1.000488

The VIF values for each feature are as follows:

 MedInc: The VIF value is 1.120166, indicating a very low correlation with other independent variables.
This suggests that MedInc is not highly correlated with other independent variables in the model.
 AveRooms: The VIF value is 1.119797, indicating a very low correlation with other independent variables.
This suggests that AveRooms is not highly correlated with other independent variables in the model.
 AveOccup: The VIF value is 1.000488, indicating no correlation with other independent variables. This
suggests that AveOccup is not correlated with other independent variables in the model.

101
In general, these VIF values are all below 5, indicating that there is no significant multicollinearity between the
independent variables in the model. This suggests that the model is stable and reliable, and that the coefficients of
the independent variables are not significantly affected by multicollinearity.

Cross-Validation Techniques

Cross-validation is a technique used to evaluate the performance of a machine learning model. It is a resampling
procedure used to evaluate a model if we have a limited data sample. The procedure has a single parameter
called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is
often called k-fold cross-validation.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
print("Cross-Validation Scores:", scores)
print("Mean CV R^2:", [Link]())

# Line Plot for Cross-Validation Scores

[Link](range(1, 6), scores, marker='o', linestyle='--')
[Link]('Fold')
[Link]('R-squared')

102
[Link]('Cross-Validation R-squared Scores')
[Link]()

Output:
Cross-Validation Scores: [0.42854821 0.37096545 0.46910866 0.31191043 0.51269138]
Mean CV R^2: 0.41864482644003276

The cross-validation scores indicate how well the model performs on unseen data. The scores range from
0.31191043 to 0.51269138, indicating that the model’s performance varies across different folds. A higher score
indicates better performance.

The mean CV R^2 score is 0.41864482644003276, which suggests that, on average, the model explains about
41.86% of the variance in the target variable. This is a moderate level of explanation, indicating that the model is
somewhat effective in predicting the target variable but may benefit from further improvement or refinement.

These scores can be used to evaluate the model’s generalizability and identify potential areas for improvement.

Feature selection methods

The Recursive Feature Elimination method is a feature selection technique that recursively eliminates the least
important features until a specified number of features is reached. This method is particularly useful when dealing
with a large number of features and the goal is to select a subset of the most informative features.
In the provided code, you first import the RFE class from sklearn.feature_selection. Then create an instance
of RFE with a specified estimator (in this case, LinearRegression) and set n_features_to_select to 2, indicating that
we want to select the top 2 features.
Next, we fit the RFE object to our scaled features X_scaled and target variable y. The support_ attribute of
the RFE object returns a boolean mask indicating which features were selected.
To visualize the ranking of features, you create a DataFrame with the feature names and their corresponding
rankings. The ranking_ attribute of the RFE object returns the ranking of each feature, with lower values indicating
more important features. You then plot a bar chart of the feature rankings, sorted by their ranking values. This plot
helps us understand the relative importance of each feature in the model.
from sklearn.feature_selection import RFE

103
rfe = RFE(estimator=LinearRegression(), n_features_to_select=3)
[Link](X_scaled, y)
print("Selected Features:", rfe.support_)

# Bar Plot of Feature Rankings

feature_ranking = [Link]({
'Feature': selected_features,
'Ranking': rfe.ranking_
})
feature_ranking.sort_values(by='Ranking').plot(kind='bar', x='Feature', y='Ranking', legend=False)
[Link]('Feature Ranking (Lower is Better)')
[Link]('Ranking')
[Link]()

Output:
Selected Features: [ True True False]

Based on the above chart, the 2 most suitable features are MedInc and AveRooms. This can also be verified by the
model’s output above as dependent variable MedHouseValue, is mostly dependent on MedInc and AveRooms.

Feature Simple Linear Regression Multiple Linear Regression

Number of
Independent One More than one
Variables
Model Equation y = β0 + β1x + ε y = β0 + β1x1 + β2x2 + … + βnxn + ε
Same as simple linear regression, but with
Same as multiple linear regression, but with a
Assumptions additional assumptions for multiple
single independent variable
independent variables
Interpretation of The change in the target variable for a unit change The change in the target variable for a unit
Coefficients in the independent variable, while holding all other change in one independent variable, while
variables constant (not applicable in simple linear holding all other independent variables

104
Feature Simple Linear Regression Multiple Linear Regression
regression) constant
Model
Less complex More complex
Complexity
Model Flexibility Less flexible More flexible
Overfitting Risk Lower Higher
Interpretability Easier to interpret More challenging to interpret
Suitable for complex relationships with
Applicability Suitable for simple relationships
multiple factors
Predicting house prices based on the
Predicting house prices based on the number of
Example number of bedrooms, square footage, and
bedrooms
location

105
Model Building, Identification and Evaluation in Python

A professor who wants to buy a car uses a Simple Linear Regression model to estimate the price of the car.

The regression model created by the professor predicts price based on the engine size. One dependent variable
predicted using one independent variable.

Multivariate Regression Models

The simple linear regression model was formulated as:

price = β0 + β1 x engine size

The statistical package computed the parameters. The linear equation is estimated as:

price = -6870.1 + 156.9 x engine size

The model was evaluated on two fronts:

 Robustness- using hypothesis testing

 Accuracy- using the coefficient of determination a.k.a R-squared

Recall that the metric R-squared explains the fraction of the variance between the values predicted by the model and
the value as opposed to the mean of the actual. This value is between 0 and 1. The higher it is, the better the model
can explain the variance. The R-squared for the model created by Fernando is 0.7503 i.e. 75.03% on the training set.
It means that the model can explain more than 75% of the variation.

However, Fernando wants to make it better.

He contemplates:

 What if I can feed the model with more inputs? Will it improve the accuracy?

106
Fernando decides to enhance the model by feeding the model with more input data i.e. more independent variables.
He has now entered into the world of the multivariate regression model.

The Concept:

Linear regression models provide a simple approach towards supervised learning. They are simple yet effective.

Recall that linear implies the following: arranged in or extending along a straight or nearly straight line. Linear
suggests that the relationship between dependent and independent variable can be expressed in a straight line.

The equation of the line is y = mx + c. One dimension is y-axis, another dimension is x-axis. It can be plotted in a
two-dimensional plane. It looks something like this:

The equation of line is y = mx + c. One dimension is y-axis, another dimension is x-axis. It can be plotted in a two-
dimensional plane. It looks something like this:

The generalization of this relationship can be expressed as:

y = f(x).

It doesn’t mean anything fancy. All it means is:

Define y as a function of x. i.e. define the dependent variable as a function of the independent variable.

What if the dependent variable needs to be expressed in terms of more than one independent variable? The
generalized function becomes:

y = f(x, z) i.e. express y as some function/combination of x and z.

There are three dimensions now y-axis, x-axis and z-axis. It can be plotted as:

107
Now we have more than one dimension (x and z). We have an additional dimension. We want to express y as a
combination of x and z.

For a simple regression linear model a straight line expresses y as a function of x. Now we have an additional
dimension (z). What will happen if an additional dimension is added to a line? It becomes a plane.

The plane is the function that expresses y as a function of x and z. Extrapolating the linear regression equation, it
can now be expressed as:

108
y = m1.x + m2.z+ c

 y is the dependent variable i.e. the variable that needs to be estimated and predicted.
 x is the first independent variable i.e. the variable that is controllable. It is the first input.
 m1 is the slope of x1. It determines what will be the angle of the line (x).
 z is the second independent variable i.e. the variable that is controllable. It is the second input.
 m2 is the slope of z. It determines what will be the angle of the line (z).
 c is the intercept. A constant that determines the value of y when x and z are 0.

This is the genesis of the multivariate linear regression model. There are more than one input variables used to
estimate the target. A model with two input variables can be expressed as:

y = β0 + β1.x1 + β2.x2

Let us take it a step further. What if we had three variables as inputs? Human visualization capabilities are limited
here. It can only visualize three dimensions. In machine learning world, there can be many dimensions. A model
with three input variables can be expressed as:

y = β0 + β1.x1 + β2.x2 + β3.x3

A generalized equation for the multivariate regression model can be:

y = β0 + β1.x1 + β2.x2 +….. + β[Link]

Model Formulation:

Now that there is familiarity with the concept of a multivariate linear regression model let us get back to Fernando.

Fernando reaches out to his friend for more data. He asks him to provide more data on other characteristics of the
cars.

109
The following were the data points he already had:

 make: make of the car.

 fuelType: type of fuel used by the car.
 nDoor: number of doors.
 engineSize: size of the engine of the car.
 price: the price of the car.

He gets additional data points. They are:

 horsePower: horse power of the car.

 peakRPM: Revolutions per minute around peak power output.
 length: length of the car.
 width: width of the car.
 height: height of the car.

Fernando now wants to build a model that predicts the price based on the additional data points.

The multivariate regression model that he formulates is:

Estimate price as a function of engine size, horse power, peakRPM, length, width and height.

=> price = f(engine size, horse power, peak RPM, length, width, height)

110
=> price = β0 + β1. engine size + β[Link] power + β3. peak RPM + β[Link]+ β[Link] + β[Link]

Model Building:

Fernando inputs these data into his statistical package. The package computes the parameters. The output is the
following:

The multivariate linear regression model provides the following equation for the price estimation.

price = -85090 + 102.85 * engineSize + 43.79 * horse power + 1.52 * peak RPM - 37.91 * length + 908.12 *
width + 364.33 * height

Model Interpretation:

The interpretation of multivariate model provides the impact of each independent variable on the dependent variable
(target).

Remember, the equation provides an estimation of the average value of price. Each coefficient is interpreted with
all other predictors held constant.

Let us now interpret the coefficients.

 Engine Size: With all other predictors held constant, if the engine size is increased by one unit, the average
price increases by $102.85.
 Horse Power: With all other predictors held constant, if the horse power is increased by one unit, the
average price increases by $43.79.
 Peak RPM: With all other predictors held constant, if the peak RPM is increased by one unit, the average
price increases by $1.52.

111
 Length: With all other predictors held constant, if the length is increased by one unit, the average price
decreases by $37.91 (length has a -ve coefficient).
 Width: With all other predictors held constant, if the width is increased by one unit, the average price
increases by $908.12
 Height: With all other predictors held constant, if the height is increased by one unit, the average price
increases by $364.33

Model Evaluation

The model is built. It is interpreted. Are all the coefficients important? Which ones are more significant? How much
variation does the model explains. The statistical package provides the metrics to evaluate the model. Let us evaluate
the model now.

Recall the discussion on the definition of t-stat, p-value and coefficient of determination. Those concepts apply in
multivariate regression models too. The evaluation of the model is as follows:

 coefficients: All coefficients are greater than zero. This implies that all variables have an impact on the
average price.
 t-value: Except for length, t-value for all coefficients are significantly above zero. For length, the t-stat is -
0.70. It implies that the length of the car may not have an impact on the average price.
 p-value: The probability of observing the p-value purely by chance is quite low for all of the variables
except for length. The p-value for length is 0.4854. This implies that probability that the observed t-stat is
by chance is 48.54%. This number is quite high.

Recall the discussion of how R-squared help to explain the variations in the model. When more variables are added
to the model, the r-square will not decrease. It only increases. However, there has to be a balance. Adjusted R-
squared strives to keep that balance. The adjusted R-squared is a modified version of R-squared that has been
adjusted for the number of predictors in the model. The adjusted R-squared compensates for the addition of
variables and only increases if the new term enhances the model.

 Adjusted R-squared: The r-squared value is 0.811. This implies that the model can explain 81.1% of
variations seen in training data. It is better than the previous model (75.03%).

Based on these evaluations, Fernando concludes the following:

 All variables except for the length of the car has an impact on the price.

112
 The length of the car does not have the significant impact on price.
 The model explains 81.1% of the variation in data.

Conclusion:

The engineer has a better model now. However, he is perplexed. He knows that length of the car doesn’t impact the
price.

He wonders:

How can one select the best set of variables for model building? Is there any method to choose the best subsets of
variables?

In the next part of this series, we will discuss variable selection methods.

113
Complex Real-World Challenges in ML Models

Complex Real-world challenges requires complex models to be build to give out predictions with utmost accuracy.
However, they do not end up being highly interpretable. In this article, we will be looking into the relationship
between complexity, accuracy and interpretability.

Model Accuracy vs Interpretability

In real-world, while working on any problem its important to understand the trade-off between Model Accuracy and
Model Interpretability. Business users want Data Scientists to build models with higher accuracy while Data
Scientist face the issue to explain to them how these model makes predictions.

What is more important?? — Having a model that gives best accuracy on unseen data or understanding the
predictions even when the accuracy is poor. Below we have a comparison of traditional models accuracy vs their
ability to be interpretable.

Accuracy vs Interpretability

The graph shows some of the most used algorithms of Machine learning and how interpretable they are. The
complexity increases in terms of how the Machine learning model works underneath. It can be parametric model
(Linear Models) or non-parametric models (K-Nearest Neighbour), Simple Decision trees (CART) or Ensemble
models (Bagging method — Random Forest or Boosting method— Gradient Boosting Trees). Complex models
mostly give better accuracy in their predictions. However, interpreting them is more difficult.

114
Model Complexity and Accuracy

Typical accuracy-complexity trade-off

Goal of any supervised machine learning algorithm is to achieve low bias and low variance. However, its not
possible in real life and we have a trade-off between Bias and Variance.

Linear Regression assumes linearity when in reality the relationship is quite complex. These simplifying
assumptions give high Bias(train and test errors high) and the model tends to be underfit. High bias can be reduced
by using a complex functions or adding more features. That is when the Complexity increases and accuracy
increases. At a certain point, the model will become too complex, and tend to overfit the training data i.e. low Bias
but high Variance for test data. Complex models like Decision Trees tend to overfit.

There is usually a tendency to overfit a Machine learning model, hence to overcome this we can use resampling
technique (Cross Validation) to improve the performance on unseen data.

Importance of Model Interpretability

In use cases when the impact of the prediction is high, understanding “Why” a certain prediction is made is really
important. Knowing the ‘why’ can help you learn more about the problem, the data and the reason why a model
might fail.

Reasons to learn about interpretability:

1. Curiosity & Learning

2. Safety Measure — Ensure learning is error-free
3. Debugging to detect Bias in model training
4. Interpretability increases social acceptance
5. Debug and audit Machine learning models

Implementation:

Dataset — Bike Rental Prediction

115
Bike Rental Dataset can be found from UCI Machine Learning Respository:
[Link]

This dataset contains daily counts of rented bicycles from the bicycle rental company Capital-Bikeshare in
Washington D.C., along with weather and seasonal information.

Goal: Predict how many bikes will be rented depending on the weather and the day.

Input Variables:

1. Total_count (target): Count of total rental bikes including both casual and registered
2. Yr: Year (0: 2011, 1:2012)
3. Month: Month (1 to 12)
4. Hr: hour (0 to 23)
5. Temp: Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8,
t_max=+39 (only in hourly scale)
6. Atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min),
t_min=-16, t_max=+50 (only in hourly scale)
7. Humidity: Normalized humidity. The values are divided to 100 (max)
8. Windspeed: Normalized wind speed. The values are divided to 67 (max)
9. Holiday: Whether day is holiday or not
10. Weekday: Day of the week
11. Workingday: If day is neither weekend nor holiday is 1, otherwise is 0
12. Season : Season (1:winter, 2:spring, 3:summer, 4:fall)
13. Weather:

 1: Clear, Few clouds, Partly cloudy, Partly cloudy

 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

Features:

Index(['month', 'hr', 'workingday', 'temp', 'atemp', 'humidity', 'windspeed','total_count', 'season_Fall', 'season_Spring',

'season_Summer','season_Winter', 'weather_1', 'weather_2', 'weather_3', 'weather_4','weekday_0', 'weekday_1',
'weekday_2', 'weekday_3', 'weekday_4','weekday_5', 'weekday_6', 'holiday', 'year'],dtype='object')

Exploratory Data Analysis:

116
Bike rides increases over a period of time

The number of bike rides increases over the period of 2 years from 2011 to 2012.

117
Correlation matrix

Windspeed and humidity have slightly negative correlation. Temp and atemp carry the same information and hence
are highly positively correlated. So for building the model, we can use either temp or atemp.

118
Histogram of target: Most of the days bike rides have been around 20–30 rides/hr

Preprocessing:

Dropping features like causal, registered as they are same as total_count. Similarly, for features like atemp which is
same as temp, dropping one to reduce multicollinearity. For categorical features, using OneHotEncoding method to
transform into a format that works better with regression models.

Model Implementations:

We will be going through models with increasing complexity and see how the interpretability decreases.

1. Multivariate Linear Regression (Linear, Monocity)

2. Decision Tree Regressor
3. Gradient Boosting Regressor

Multivariate Linear Regression:

Linear regression involving multiple variables is called “multiple linear regression” or “multivariate linear
regression”.

119
Source

Goal of multiple linear regression (MLR) is to model the linear relationship between the explanatory (independent)
variables and response (dependent) variable. In essence, multiple regression is the extension of ordinary least-
squares (OLS) regression that involves more than one explanatory variable.

Regression comes with some assumptions that are not practical in real world datasets.

1. Linearity
2. Homoscedasticity (Constant variance)
3. Independence
4. Fixed features
5. Absense of multicollinearity

Linear Regression implementation:

Linear Regression results:

120
Mean Squared Error: 19592.4703292543
R score: 0.40700134640548247
Mean Absolute Error: 103.67180228987019

Using Cross-validation:

Interpret Multiple Linear Regression:

To interpret Linear models is easier, we can look into the coefficients of each variable to understand its effect on the
prediction and also the slope of intercept.

Intercept of the equation(Bo):

The intercept represents the value of y(target) when none of the features have any effect(x=0).

18.01100142944577

Coefficients corresponding to [Link] helps us understand the effect of each feature on the target outcome.

This means that increase in “temp” by a unit increases Bike rides by 211.05 units. Same applies for rest features

121
Decision Tree Regressor:

Decision trees work by iteratively splitting the data into distinct subsets in a greedy fashion. For regression trees,
they are chosen to minimize either the MSE (mean squared error) or the MAE (mean absolute error) within all of the
subsets.

CART — Classification and Regression Trees:

CART takes a feature and determines which cut-off point minimizes the variance of y for a regression task. The
variance tells us how much the y values in a node are spread around their mean value. Splits are based on features
that minimize the variance based on average of all subsets used in decision tree.

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,max_features=None,

max_leaf_nodes=15,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=10,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')

Decision tree results:

Decision Tree Regression results gives better fit to the data.

Mean Squared Error: 10880.635297455

R score: 0.6706795022162286
Mean Absolute Error: 73.76311613574498Decision tree split:

Decision tree has a better fit to the model than Linear Regression. The R square value is about 0.67.

Using Cross-Validation:

Decision tree graph:

122
Decision Tree Regressor output

Interpret Decision Trees:

Feature Importance:

Feature importance is based on the one that reduces the maximum variance for all the splits the feature was used. A
feature might be used for more than one split or not at all. We can add the contributions for each of the p features
and get an interpretation of how much each feature has contributed to a prediction.

We can see that features: hr, temp, year, workingday, season_Spring are the features that used to split the decision
tree.

123
Decision Tree Regressor — Feature Importance Bar chart

124
Gradient Boosting Regressor:

Boosting is an ensemble technique in which the predictors are not made independently, but sequentially. Gradient
Boosting uses Decision tree as weak models.

Boosting is a method of converting weak learners into strong learners by training many models in a gradual, additive
and sequential manner and minimizing Loss function (i.e squared error for Regression problems) in the final
model.

GBR has better accuracy than other Regression model because of its Boosting technique. It is the most used
Regression algorithm for competitions.

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',init=None, learning_rate=0.1,

loss='ls', max_depth=6,max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None,
presort='deprecated',random_state=None, subsample=1.0, tol=0.0001,validation_fraction=0.1, verbose=0,
warm_start=False)

The result from GBR is as below:

Mean Squared Error: 1388.8979420780786

R score: 0.9579626971080454
Mean Absolute Error: 23.81293483364058

The Gradient Boosting Regressor gives us the best R2 square value of 0.957. However, to interpret this model is
very difficult.

Interpret Ensemble Model:

Ensemble models definitely fall into the category of “Black Box” models since they are composed of many
potentially complex individual models.

Each tree in sequentially fashion is trained on bagged data using random selection of features, so gaining a full
understanding of the decision process by examining each individual tree is infeasible.

Model’s Goodness of Fit Test

Both the KMO and Bartlett’s test of sphericity are commonly used to verify the feasibility of the data for
Exploratory Factor Analysis (EFA).

 Kaiser-Meyer Olkin (KMO) model tests sampling adequacy by measuring the proportion of variance in the
items that may be common variance. Values ranging between .80 and 1.00 indicate sampling adequacy
(Cerny & Kaiser, 1977).
 Bartlett’s test of sphericity examines whether a correlation matrix is significantly different to the identity
matrix, in which diagonal elements are unities and all off-diagonal elements are zeros (Bartlett, 1950).
Significant results indicate that variables in the correlation matrix are suitable for factor analysis.

Classification of fit indices: Absolute and Comparative

 The logic behind absolute fit indices is essentially to test how well the model specified by the researcher
reproduces the observed data. Commonly used absolute fit statistics include the χ2

125
 fit statistic, RMSEA, SRMR.
 In contrast, comparative fit indices are based on a different logic, i.e. they assess how well a model
specified by a researcher fits the observed sample data relative to a null model (i.e., a model that is based
on the assumption that all observed variables are not correlated) (Miles & Shevlin, 2007). Popular
comparative model fit indices are the CFI and TLI.

The χ2 fit statistic

 The χ2

 measures the discrepancy between the observed and the implied covariance matrices.

 The χ2

 fit statistic is very popular and frequently reported in both CFA and SEM studies.
 However, it is notoriously sensitive to large sample sizes and increased model complexity (i.e. models with
a large number of indicators and degrees of freedom). Therefore, the current practice is to report it mostly
for historical reasons, and it rarely used to make decisions about the adequacy of model fit.

The RMSEA

 The Root Mean Square Error of Approximation (RMSEA) provides information as to how well the model,
with unknown but optimally chosen parameter estimates, would fit the population covariance matrix
(Byrne, 1998).
 It is a very commonly used fit statistic.
 One of its key advantages is that the RMSEA calculates confidence intervals around its value.
 Values below .060

indicate close fit (Hu & Bentler, 1999). Values up to .080

 are commonly accepted as adequate.

The SRMR

 The Standardized Root Mean Residual (SRMR) is the square root of the difference between the residuals of
the sample covariance matrix and the hypothesized covariance model.
 As SRMR is standardized, its values range between 0

and 1. Commonly, models with values below .05 threshold are considered to indicate good fit (Byrne, 1998). Also,
values up to .08

 are acceptable (Hu & Bentler, 1999).

The CFI and TLI

 Two comparative fit indices commonly reported are the Comparative Fit Index (CFI) and the Tucker Lewis
Index (TLI). The indices are similar; however, note that the CFI is normed while the TLI is not. Therefore,
the CFI’s values range between zero and one, whereas the TLI’s values may fall below zero or be above
one (Hair et al., 2013).

126
 For CFI and TLI values above .95 are indicative of good fit (Hu & Bentler, 1999). In practice, CFI and
TLI values from .90 to .95

 are considered acceptable.

 Note that the TLI is non-normed, so its values can go above 1.00

Note:

Further to the aforementioned information, Hoyle (2012) provides an excellent succinct summary of numerous fit
indices. This table includes, for example, information on the indices' theoretical range, sensitivity to varying sample
size and model complexity. Note that, in contrast to the indices introduced above, a great number of other indices
exist, as illustrated in Hoyle's table. Yet, the frequency of their use is decreasing for various reasons. For example,
RMR is non-normed and thus it is hard to interpret. Here these indices are shown below simply for everyone's
general awareness, i.e. the fact that they exist, who developed them and what their statistical properties are.

# Simple orientation to programming, basic mathematical packages

import statistics
import math
print("Welcome to the world of data science! Artificial Intelligence, Machine Learning and Deep Learning ")
data = [34, 67, 89, 12, 43, 23, 123]
x = [Link](data)
print("The minimum value here is : ", x)
print("Feel most welcome! Volume of the cylinder loading...")
#Enter values of the input variables
radius=float(input("Kindly enter the radius of the cylinder: "))
height=float(input("Kindly enter the height of the cylinder: "))
#Lets implement the process here
volume= [Link]*pow(radius, 2)*height
print("The volume of the cylinder is %.2f" %volume)

127

Introduction to Python Programming
No ratings yet
Introduction to Python Programming
76 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
14 pages
Python Overview and Features Guide
No ratings yet
Python Overview and Features Guide
46 pages
Key Features of Python Programming
No ratings yet
Key Features of Python Programming
5 pages
PWP - Unit I
No ratings yet
PWP - Unit I
36 pages
Python
No ratings yet
Python
102 pages
Key Features of Python Programming
No ratings yet
Key Features of Python Programming
4 pages
CGB1121 - Python Programming (MODULE 1)
No ratings yet
CGB1121 - Python Programming (MODULE 1)
30 pages
Key Features of Python Programming
No ratings yet
Key Features of Python Programming
5 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
28 pages
Python: A Versatile Programming Guide
No ratings yet
Python: A Versatile Programming Guide
6 pages
Learn Python Programming Basics
No ratings yet
Learn Python Programming Basics
48 pages
Introduction to Programming Languages
No ratings yet
Introduction to Programming Languages
118 pages
History and Features of Python Programming
No ratings yet
History and Features of Python Programming
18 pages
Python Programming Essentials
No ratings yet
Python Programming Essentials
120 pages
MGT Lang
No ratings yet
MGT Lang
10 pages
Unit 1 CS
No ratings yet
Unit 1 CS
45 pages
Python for Machine Learning Basics
0% (1)
Python for Machine Learning Basics
36 pages
Python Document
No ratings yet
Python Document
526 pages
Learn Python Programming
No ratings yet
Learn Python Programming
169 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
6 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
7 pages
Comprehensive Python Programming Guide
No ratings yet
Comprehensive Python Programming Guide
150 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
9 pages
Python Programming: Features & Uses
No ratings yet
Python Programming: Features & Uses
22 pages
Python Programming Language Overview
100% (1)
Python Programming Language Overview
103 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
52 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
5 pages
WWW Educba Com Benefits and Limitations of Using Python...
No ratings yet
WWW Educba Com Benefits and Limitations of Using Python...
15 pages
Python Programming Overview and Features
No ratings yet
Python Programming Overview and Features
159 pages
Key Features of Python Programming
No ratings yet
Key Features of Python Programming
51 pages
Python Programming Lab Overview
No ratings yet
Python Programming Lab Overview
22 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
28 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
82 pages
Python Programming for Beginners Guide
No ratings yet
Python Programming for Beginners Guide
4 pages
Python Programming Language Overview
No ratings yet
Python Programming Language Overview
50 pages
Book Final With PageNumbers
No ratings yet
Book Final With PageNumbers
370 pages
Python Programming Diploma Overview
100% (1)
Python Programming Diploma Overview
7 pages
CS-30 (1) (Malay Sir)
No ratings yet
CS-30 (1) (Malay Sir)
66 pages
Python Programming Overview and Applications
No ratings yet
Python Programming Overview and Applications
65 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
27 pages
Python Development and Release History
100% (1)
Python Development and Release History
45 pages
Introduction & Components of Python
No ratings yet
Introduction & Components of Python
119 pages
Python Programming Overview and Projects
No ratings yet
Python Programming Overview and Projects
10 pages
Python for Machine Learning Overview
No ratings yet
Python for Machine Learning Overview
67 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
42 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
94 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
151 pages
Python Programming Overview for Students
No ratings yet
Python Programming Overview for Students
18 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
7 pages
Comparing C++, Java, and Python
No ratings yet
Comparing C++, Java, and Python
53 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
55 pages
History and Features of Python Language
No ratings yet
History and Features of Python Language
24 pages
Introduction to Scientific Python Programming
No ratings yet
Introduction to Scientific Python Programming
18 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
28 pages
Quick Start Guide to Python Programming
No ratings yet
Quick Start Guide to Python Programming
198 pages
Python Programming Overview and Uses
No ratings yet
Python Programming Overview and Uses
7 pages
Coffee Production and Quality Overview
No ratings yet
Coffee Production and Quality Overview
20 pages
Stakeholder Feasibility Analysis for Bistro
No ratings yet
Stakeholder Feasibility Analysis for Bistro
3 pages
Welcome To The Forest: Author: Bhavna Menon Illustrator: Kavita Singh Kale
No ratings yet
Welcome To The Forest: Author: Bhavna Menon Illustrator: Kavita Singh Kale
25 pages
Two-Dimensional Cutting Stock Review
No ratings yet
Two-Dimensional Cutting Stock Review
14 pages
ASTM D3359 Adhesion Tape Test Methods
No ratings yet
ASTM D3359 Adhesion Tape Test Methods
8 pages
Loan Packages and Complaints Overview
No ratings yet
Loan Packages and Complaints Overview
10 pages
Bioseparations in Two-Phase Aqueous Micellar Systems
No ratings yet
Bioseparations in Two-Phase Aqueous Micellar Systems
2 pages
Lemon Tree Guitar Chords and Tab
No ratings yet
Lemon Tree Guitar Chords and Tab
3 pages
SAEP-1152: Concrete Mix Approval Guide
No ratings yet
SAEP-1152: Concrete Mix Approval Guide
1 page
Leadership Skills Questionnaire for Study
No ratings yet
Leadership Skills Questionnaire for Study
3 pages
Writing Improvement Exercises Guide
No ratings yet
Writing Improvement Exercises Guide
2 pages
Caliper Profile: Selection & Development Guide
No ratings yet
Caliper Profile: Selection & Development Guide
4 pages
Symmetric Key Cryptography Exercises
No ratings yet
Symmetric Key Cryptography Exercises
2 pages
Forged Steel Fittings Certification
No ratings yet
Forged Steel Fittings Certification
15 pages
Conversation Analysis and Early Childhood Education: The Co-Production of Knowledge and Relationships
100% (4)
Conversation Analysis and Early Childhood Education: The Co-Production of Knowledge and Relationships
15 pages
Nintendo vs PlayStation Business Models
No ratings yet
Nintendo vs PlayStation Business Models
12 pages
Yamaha Sport Touring Motorcycles 2019
No ratings yet
Yamaha Sport Touring Motorcycles 2019
48 pages
Biomass Feedstock: Types and Uses
No ratings yet
Biomass Feedstock: Types and Uses
8 pages
Women in Indian Carp Culture - Final Draft For Circulation-28.5.11.
No ratings yet
Women in Indian Carp Culture - Final Draft For Circulation-28.5.11.
31 pages
ICT History Timeline in the Philippines
No ratings yet
ICT History Timeline in the Philippines
9 pages
Jio Jio Customer Support Executive Chat Voice Work From Home Earn Upto 40k February 6 2026
No ratings yet
Jio Jio Customer Support Executive Chat Voice Work From Home Earn Upto 40k February 6 2026
1 page
EPS Assignment-421
No ratings yet
EPS Assignment-421
16 pages
Codigos Cics PDF
No ratings yet
Codigos Cics PDF
586 pages
Teamwork Skills for Food Services NC-II
No ratings yet
Teamwork Skills for Food Services NC-II
3 pages
Reliabilityweb Uptime Element Chart
No ratings yet
Reliabilityweb Uptime Element Chart
1 page
Auto Cad
No ratings yet
Auto Cad
140 pages
Motivasi dan Kinerja Karyawan Koperasi
No ratings yet
Motivasi dan Kinerja Karyawan Koperasi
18 pages
OJT Insights in Civil Engineering
No ratings yet
OJT Insights in Civil Engineering
2 pages
FATF: Global Standards Against Money Laundering
No ratings yet
FATF: Global Standards Against Money Laundering
5 pages
Roxul Comfortboard 80: Insulated Sheathing
No ratings yet
Roxul Comfortboard 80: Insulated Sheathing
1 page