Python Notes
Python Notes
Python Environment Setup Using Python Executable version 3.9 and Jupyterlab IDE (Pycharm, Anaconda, Jupyter
Notebook, Spyder, Atom, Thonny, Eclipse)
Python language is incredibly easy to use and learn for new beginners and newcomers. The python language is one
of the most accessible programming languages available because it has simplified syntax and not complicated,
which gives more emphasis on natural language. Due to its ease of learning and usage, python codes can be easily
written and executed much faster than other programming languages.
When Guido van Rossum was creating python in the 1980s, he made sure to design it to be a general-purpose
language. One of the main reasons for the popularity of python would be its simplicity in syntax so that it could be
easily read and understood even by amateur developers also.
Python was created more than 30 years ago, which is a lot of time for any community of programming language to
grow and mature adequately to support developers ranging from beginner to expert levels. There are plenty of
documentation, guides and Video Tutorials for Python language are available that learner and developer of any skill
level or ages can use and receive the support required to enhance their knowledge in python programming language.
Many students get introduced to computer science only through Python language, which is the same language used
for in-depth research projects.
Programming languages grow faster when a corporate sponsor backs it. For example, PHP is backed by Facebook,
Java by Oracle and Sun, Visual Basic & C# by Microsoft. Python Programming language is heavily backed by
Facebook, Amazon Web Services, and especially Google.
Google adopted python language way back in 2006 and have used it for many applications and platforms since then.
Lots of Institutional effort and money have been devoted to the training and success of the python language by
Google. They have even created a dedicated portal only for python. The list of support tools and documentation
keeps on growing for python language in the developers’ world.
Due to its corporate sponsorship and big supportive community of python, python has excellent libraries that you
can use to select and save your time and effort on the initial cycle of development. There are also lots of cloud media
services that offer cross-platform support through library-like tools, which can be extremely beneficial.
Libraries with specific focus are also available like nltk for natural language processing or scikit-learn for machine
learning applications.
There are many frameworks and libraries that are available for python language, such as:
1
matplotib for plotting charts and graphs
SciPy for engineering applications, science, and mathematics
BeautifulSoup and Requests for HTML parsing and XML
NumPy for scientific computing
Django for server-side web development
Ask any python developer, and they will wholeheartedly agree that the python language is efficient, reliable, and
much faster than most modern languages. Python can be used in nearly any kind of environment, and one will not
face any kind of performance loss issue irrespective of the platform one is working.
One more best thing about versatility of python language is that it can be used in many varieties of environments
such as mobile applications, desktop applications, web development, hardware programming, and many more. The
versatility of python makes it more attractive to use due to its high number of applications.
Cloud Computing, Machine Learning, and Big Data are some of the hottest trends in the computer science world
right now, which helps lots of organizations to transform and improve their processes and workflows.
Python language is the second most popular used tool after R language for data science and analytics. Lots of many
data processing workloads in the organization are powered by python language only. Most of the research and
development takes place in python language due to its many applications, including ease of analyzing and
organizing the usable data.
Not only this, but hundreds of python libraries are being used in thousands of machine learning projects every day,
such as TensorFlow for neural networks and OpenCV for computer vision, etc.
7. First-choice Language
Python language is the first choice for many programmers and students due to the main reason for python being in
high demand in the development market. Students and developers always look forward to learning a language that is
in high demand. Python is undoubtedly the hottest cake in the market now.
Many programmers and data science students are using python language for their development projects. Learning
python is one of the important section in data science certification courses. In this way, the python language can
provide plenty of fantastic career opportunities for students. Due to the variety of applications of python, one can
pursue different career options and will not remain stuck to one.
The python language is so flexible that it gives the developer the chance to try something new. The person who is an
expert in python language is not just limited to build similar kinds of things but can also go on to try to make
something different than before.
Python doesn’t restrict developers from developing any sort of application. This kind of freedom and flexibility by
just learning one language is not available in other programming languages.
2
9. Use of python in academics
Now python language is being treated as the core programming language in schools and colleges due to its countless
uses in Artificial Intelligence, Deep Learning, Data Science, etc. It has now become a fundamental part of the
development world that schools and colleges cannot afford not to teach python language.
In this way, it is increasing more python Developers and Programmers and thus further expanding its growth and
popularity.
10. Automation
Python language can help a lot in automation of tasks as there are lots of tools and modules available, which makes
things much more comfortable. It is incredible to know that one can reach an advanced level of automation easily by
just using necessary python codes.
Python is the best performance booster in the automation of software testing also. One will be amazed at how much
less time and few numbers of lines are required to write codes for automation tools.
3
GUI BUILDING IN PYTHON USING TKINTER
Modern computer applications are user-friendly. User interaction is not restricted to console-based I/O. They have a
more ergonomic graphical user interface (GUI) thanks to high speed processors and powerful graphics hardware.
These applications can receive inputs through mouse clicks and can enable the user to choose from alternatives with
the help of radio buttons, dropdown lists, and other GUI elements (or widgets).
Such applications are developed using one of various graphics libraries available. A graphics library is a software
toolkit having a collection of classes that define a functionality of various GUI elements. These graphics libraries are
generally written in C/C++.
GUI elements and their functionality are defined in the Tkinter module. The following code demonstrates the steps
in creating a UI.
[Link]('Hello Python')
[Link]("300x200+10+20")
[Link]()
First of all, import the TKinter module. After importing, setup the application object by calling the Tk() function.
This will create a top-level window (root) having a frame with a title bar, control box with the minimize and close
buttons, and a client area to hold other widgets. The geometry() method defines the width, height and coordinates of
the top left corner of the frame as below (all values are in pixels): [Link]("widthxheight+XPOS+YPOS")
The application object then enters an event listening loop by calling the mainloop() method. The application is now
constantly waiting for any event generated on the elements in it. The event could be text entered in a text field, a
selection made from the dropdown or radio button, single/double click actions of mouse, etc. The application's
functionality involves executing appropriate callback functions in response to a particular type of event. We shall
discuss event handling later in this tutorial. The event loop will terminate as and when the close button on the title
bar is clicked. The above code will create the following window:
Python-Tkinter Window
All Tkinter widget classes are inherited from the Widget class. Let's add the most commonly used widgets.
Button
The button can be created using the Button class. The Button class constructor requires a reference to the main
window and to the options.
4
You can set the following important properties to customize a button:
Example: Button
from tkinter import *
window=Tk()
btn=Button(window, text="This is Button widget", fg='blue')
[Link](x=80, y=100)
[Link]('Hello Python')
[Link]("300x200+10+10")
[Link]()
Label
A label can be created in the UI in Python using the Label class. The Label constructor requires the top-level
window object and options parameters. Option parameters are similar to the Button object.
Example: Label
from tkinter import *
window=Tk()
lbl=Label(window, text="This is Label widget", fg='red', font=("Helvetica", 16))
[Link](x=60, y=50)
[Link]('Hello Python')
[Link]("300x200+10+10")
[Link]()
Here, the label's caption will be displayed in red colour using Helvetica font of 16 point size.
Entry
This widget renders a single-line text box for accepting the user input. For multi-line text input use the Text widget.
Apart from the properties already mentioned, the Entry class constructor accepts the following:
The following example creates a window with a button, label and entry field.
5
btn=Button(window, text="This is Button widget", fg='blue')
[Link](x=80, y=100)
lbl=Label(window, text="This is Label widget", fg='red', font=("Helvetica", 16))
[Link](x=60, y=50)
txtfld=Entry(window, text="This is Entry Widget", bd=5)
[Link](x=80, y=150)
[Link]('Hello Python')
[Link]("300x200+10+10")
[Link]()
Selection Widgets
Radiobutton: This widget displays a toggle button having an ON/OFF state. There may be more than one button,
but only one of them will be ON at a given time.
Checkbutton: This is also a toggle button. A rectangular check box appears before its caption. Its ON state is
displayed by the tick mark in the box which disappears when it is clicked to OFF.
Combobox: This class is defined in the ttk module of tkinterpackage. It populates drop down data from a collection
data type, such as a tuple or a list as values parameter.
Listbox: Unlike Combobox, this widget displays the entire collection of string items. The user can select one or
multiple items.
The following example demonstrates the window with the selection widgets: Radiobutton, Checkbutton, Listbox and
Combobox:
6
var = StringVar()
[Link]("one")
data=("one", "two", "three", "four")
cb=Combobox(window, values=data)
[Link](x=60, y=150)
v0=IntVar()
[Link](1)
r1=Radiobutton(window, text="male", variable=v0,value=1)
r2=Radiobutton(window, text="female", variable=v0,value=2)
[Link](x=100,y=50)
[Link](x=180, y=50)
v1 = IntVar()
v2 = IntVar()
C1 = Checkbutton(window, text = "Cricket", variable = v1)
C2 = Checkbutton(window, text = "Tennis", variable = v2)
[Link](x=100, y=100)
[Link](x=180, y=100)
[Link]('Hello Python')
[Link]("400x300+10+10")
[Link]()
Create UI in Python-Tkinter
Event Handling
An event is a notification received by the application object from various GUI widgets as a result of user interaction.
The Application object is always anticipating events as it runs an event listening loop. User's actions include mouse
button click or double click, keyboard key pressed while control is inside the text box, certain element gains or goes
out of focus etc.
7
Events are expressed as strings in <modifier-type-qualifier> format.
Many events are represented just as qualifier. The type defines the class of the event.
The following table shows how the Tkinter recognizes different events:
An event should be registered with one or more GUI widgets in the application. If it's not, it will be ignored. In
Tkinter, there are two ways to register an event with a widget. First way is by using the bind() method and the
second way is by using the command parameter in the widget constructor.
Bind() Method
The bind() method associates an event to a callback function so that, when the even occurs, the function is called.
Syntax:
[Link](event, callback)
For example, to invoke the MyButtonClicked() function on left button click, use the following code:
The event object is characterized by many properties such as source widget, position coordinates, mouse button
number and event type. These can be passed to the callback function if required.
Command Parameter
Each widget primarily responds to a particular type. For example, Button is a source of the Button event. So, it is by
default bound to it. Constructor methods of many widget classes have an optional parameter called command. This
command parameter is set to callback the function which will be invoked whenever its bound event occurs. This
method is more convenient than the bind() method.
8
In the example given below, the application window has two text input fields and another one to display the result.
There are two button objects with the captions Add and Subtract. The user is expected to enter the number in the two
Entry widgets. Their addition or subtraction is displayed in the third.
The first button (Add) is configured using the command parameter. Its value is the add() method in the class. The
second button uses the bind() method to register the left button click with the sub() method. Both methods read the
contents of the text fields by the get() method of the Entry widget, parse to numbers, perform the
addition/subtraction and display the result in third text field using the insert() method.
Example:
from tkinter import *
class MyWindow:
def __init__(self, win):
self.lbl1=Label(win, text='First number')
self.lbl2=Label(win, text='Second number')
self.lbl3=Label(win, text='Result')
self.t1=Entry(bd=3)
self.t2=Entry()
self.t3=Entry()
self.btn1 = Button(win, text='Add')
self.btn2=Button(win, text='Subtract')
[Link](x=100, y=50)
[Link](x=200, y=50)
[Link](x=100, y=100)
[Link](x=200, y=100)
self.b1=Button(win, text='Add', command=[Link])
self.b2=Button(win, text='Subtract')
[Link]('<Button-1>', [Link])
[Link](x=100, y=150)
[Link](x=200, y=150)
[Link](x=100, y=200)
[Link](x=200, y=200)
def add(self):
[Link](0, 'end')
num1=int([Link]())
num2=int([Link]())
result=num1+num2
[Link](END, str(result))
def sub(self, event):
[Link](0, 'end')
num1=int([Link]())
num2=int([Link]())
result=num1-num2
[Link](END, str(result))
window=Tk()
mywin=MyWindow(window)
[Link]('Hello Python')
[Link]("400x300+10+10")
[Link]()
9
UI in Python-Tkinter
Data cleaning/Data Wrangling is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
10
When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data
is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute
way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset.
But it is crucial to establish a template for your data cleaning process so you know you are doing it the right way
every time.
Data cleaning is the process that removes data that does not belong in your dataset. Data transformation is the
process of converting data from one format or structure into another. Transformation processes can also be referred
to as data wrangling, or data munging, transforming and mapping data from one "raw" data form into another format
for warehousing and analyzing. This article focuses on the processes of cleaning that data.
Python is an easy-to-learn programming language, which makes it the most preferred choice for beginners in Data
Science, Data Analytics, and Machine Learning. It also has a great community of online learners and excellent data-
centric libraries.
With so much data being generated, it becomes important that the data we use for Data Science applications like
Machine Learning and Predictive Modeling is clean. But what do we mean by clean data? And what makes data
dirty in the first place?
Dirty data simply means data that is erroneous. Duplicacy of records, incomplete or outdated data, and improper
parsing can make data dirty. This data needs to be cleaned. Data cleaning (or data cleansing) refers to the process of
“cleaning” this dirty data, by identifying errors in the data and then rectifying them.
Data cleaning is an important step in and Machine Learning project, and we will cover some basic data cleaning
techniques (in Python) in this article.
We will learn more about data cleaning in Python with the help of a toy dataset. We will use the Russian housing
dataset on Kaggle repository.
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import [Link] as plt
%matplotlib inline
Download the data, and then read it into a Pandas DataFrame by using the read_csv() function, and specifying the
file path. Then use the shape attribute to check the number of rows and columns in the dataset. The code for this is
as below:
df = pd.read_csv('housing_data.csv')
[Link]
11
We will now separate the numeric columns from the categorical columns.
We are now through with the preliminary steps. We can now move on to data cleaning. We will start by identifying
columns that contain missing values and try to fix them.
Missing values
We will start by calculating the percentage of values missing in each column, and then storing this information in a
DataFrame.
The DataFrame pct_missing_df now contains the percentage of missing values in each column along with the
column names.
We can also create a visual out of this information for better understanding using the code below:
The output after execution of the above line of code should look like this:
12
It is clear that some columns have very few values missing, while other columns have a substantial % of values
missing. We will now fix these missing values.
There are a number of ways in which we can fix these missing values. Some of them are:
Drop Observations
One way could be to drop those observations that contain any null value in them for any of the columns. This will
work when the percentage of missing values in each column is very less. We will drop observations that contain null
in those columns that have less than 0.5% nulls. These columns would be metro_min_walk, metro_km_walk,
railroad_station_walk_km, railroad_station_walk_min, and ID_railroad_station_walk.
This will reduce the number of records in our dataset to 30,446 records.
13
Another way to tackle missing values in a dataset would be to drop those columns or features that have a significant
percentage of values missing. Such columns don’t contain a lot of information and can be dropped altogether from
the dataset. In our case, let us drop all those columns that have more than 40% values missing in them. These
columns would be build_year, state, hospital_beds_raion, cafe_sum_500_min_price_avg,
cafe_sum_500_max_price_avg, and cafe_avg_price_500.
There is still missing data left in our dataset. We will now impute the missing values in each numerical column with
the median value of that column.
df_numeric = df.select_dtypes(include=[[Link]])
numeric_cols = df_numeric.[Link]
for col in numeric_cols:
missing = df[col].isnull()
num_missing = [Link](missing)
if num_missing > 0: # impute values only for columns that have missing values
med = df[col].median() #impute with the median
df[col] = df[col].fillna(med)
Missing values in numerical columns are now fixed. In the case of categorical columns, we will replace missing
values with the mode values of that column.
df_non_numeric = df.select_dtypes(exclude=[[Link]])
non_numeric_cols = df_non_numeric.[Link]
for col in non_numeric_cols:
missing = df[col].isnull()
num_missing = [Link](missing)
if num_missing > 0: # impute values only for columns that have missing values
mod = df[col].describe()['top'] # impute with the most frequently occuring value
df[col] = df[col].fillna(mod)
All missing values in our dataset have now been treated. We can verify this by running the following piece of code:
[Link]().sum().sum()
If the output is zero, it means that there are no missing values left in our dataset now.
We can also replace missing values with a particular value (like -9999 or ‘missing’) which will indicate the fact that
the data was missing in this place. This can be a substitute for missing value imputation.
Outliers
An outlier is an unusual observation that lies away from the majority of the data. Outliers can affect the performance
of a Machine Learning model significantly. Hence, it becomes important to identify outliers and treat them.
14
Let us take the ‘life_sq’ column as an example. We will first use the describe() method to look at the descriptive
statistics and see if we can gather any information from it.
df.life_sq.describe()
count 30446.000000
mean 33.482658
std 46.538609
min 0.000000
25% 22.000000
50% 30.000000
75% 38.000000
max 7478.000000
Name: life_sq, dtype: float64
From the output, it is clear that something is not correct. The max value seems to be abnormally large compared to
the mean and median values. Let us make a boxplot of this data to get a better idea.
15
It is clear from the boxplot that the observation corresponding to the maximum value (7478) is an outlier in this data.
Descriptive statistics, boxplots, and scatter plots help us in identifying outliers in the data.
We can deal with outliers just like we dealt with missing values. We can either drop the observations that we think
are outliers, or we can replace the outliers with suitable values, or we can perform some sort of transformation on
the data (like log or exponential). In our case, let us drop the record where the value of ‘life_sq’ is 7478.
Duplicate records
Data can sometimes contain duplicate values. It is important to remove duplicate records from your dataset before
you proceed with any Machine Learning project. In our data, since the ID column is a unique identifier, we will drop
duplicate records by considering all but the ID column.
This will help us in dropping the duplicate records. By using the shape method, you can check that duplicate records
have actually been dropped. The number of observations is 30,434 now.
16
Fixing Datatype
Often in the dataset, values are not stored in the correct data type. This can create a problem in later stages, and we
may not get the desired output or may get errors while execution. One common data type error is with dates. Dates
are often parsed as objects in Python. There is a separate data type for dates in Pandas, called DateTime.
We will first check the data type of the timestamp column in our data.
[Link]
This returns the data type ‘object’. We now know the timestamp is not stored correctly. To fix this, let’s convert the
timestamp column to the DateTime format.
We now have the timestamp in the correct format. Similarly, there can be columns where integers are stored as
objects. Identifying such features and correcting the data type is important before you proceed on to Machine
Learning. Fortunately for us, we don’t have any such issue in our dataset.
Web Scraping
Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can
extract, parse, download and organize useful information from the web automatically.
Web scraping is the process of constructing an agent which can extract, parse, download and organize useful
information from the web automatically. In other words, instead of manually saving the data from websites, the web
scraping software will automatically load and extract data from multiple websites as per our requirement.
In this section, we are going to discuss about useful Python libraries for web scraping.
Requests
It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the
help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before
using requests, let us understand its installation.
Urllib3
It is another Python library that can be used for retrieving data from URLs similar to the requests library.
Selenium
It is an open source automated testing suite for web applications across different browsers and platforms. It is not a
single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript. Here we
are going to perform web scraping by using selenium and its Python bindings. You can learn more about Selenium
with Java on the link Selenium.
Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome,
Remote etc. The current supported Python versions are 2.7, 3.5 and above.
17
Scrapy
Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page
with the help of selectors based on XPath. Scrapy was first released on June 26, 2008 licensed under BSD, with a
milestone 1.0 releasing in June 2015. It provides us all the tools we need to extract, process and structure the data
from websites.
Analyzing a web page means understanding its sructure . Now, the question arises why it is important for web
scraping? In this chapter, let us understand this in detail.
Web page analysis is important because without analyzing we are not able to know in which form we are going to
receive the data from (structured or unstructured) that web page after extraction. We can do web page analysis in the
following ways −
This is a way to understand how a web page is structured by examining its source code. To implement this, we need
to right click the page and then must select the View page source option. Then, we will get the data of our interest
from that web page in the form of HTML. But the main concern is about whitespaces and formatting which is
difficult for us to format.
This is another way of analyzing web page. But the difference is that it will resolve the issue of formatting and
whitespaces in the source code of web page. You can implement this by right clicking and then selecting the Inspect
or Inspect element option from menu. It will provide the information about particular area or element of that web
page.
The following methods are mostly used for extracting data from a web page −
Regular Expression
They are highly specialized programming language embedded in Python. We can use it through re module of
Python. It is also called RE or regexes or regex patterns. With the help of regular expressions, we can specify some
rules for the possible set of strings we want to match from the data.
Example
In the following example, we are going to scrape data about India from [Link] after
matching the contents of <td> with the help of regular expression.
import re
import [Link]
response =
[Link]('[Link]
html = [Link]()
text = [Link]()
[Link]('<td class="w2p_fw">(.*?)</td>',text)
18
Output
[
'<img src="/places/static/images/flags/[Link]" />',
'3,287,590 square kilometres',
'1,173,108,018',
'IN',
'India',
'New Delhi',
'<a href="/places/default/continent/AS">AS</a>',
'.in',
'INR',
'Rupee',
'91',
'######',
'^(\\d{6})$',
'enIN,hi,bn,te,mr,ta,ur,gu,kn,ml,or,pa,as,bh,sat,ks,ne,sd,kok,doi,mni,sit,sa,fr,lus,inc',
'<div>
<a href="/places/default/iso/CN">CN </a>
<a href="/places/default/iso/NP">NP </a>
<a href="/places/default/iso/MM">MM </a>
<a href="/places/default/iso/BT">BT </a>
<a href="/places/default/iso/PK">PK </a>
<a href="/places/default/iso/BD">BD </a>
</div>'
]
Observe that in the above output you can see the details about country India by using regular expression.
Beautiful Soup
Suppose we want to collect all the hyperlinks from a web page, then we can use a parser called BeautifulSoup which
can be known in more detail at [Link] In simple words,
BeautifulSoup is a Python library for pulling data out of HTML and XML files. It can be used with requests,
because it needs an input (document or url) to create a soup object asit cannot fetch a web page by itself. You can
use the following Python script to gather the title of web page and hyperlinks.
Example
Note that in this example, we are extending the above example implemented with requests python module. we are
using [Link] for creating a soup object which will further be used to fetch details like title of the webpage.
import requests
from bs4 import BeautifulSoup
In this following line of code we use requests to make a GET HTTP requests for the url:
[Link] by making a GET request.
r = [Link]('[Link]
19
Now we need to create a Soup object as follows −
Output
In the following example, we are scraping a particular element of the web page from [Link] by
using lxml and requests −
First, we need to import the requests and html from lxml library as follows −
import requests
from lxml import html
url = '[Link]
Now we need to provide the path (Xpath) to particular element of that web page −
path = '//*[@id="panel-836-0-0-1"]/div/div/p[1]'
response = [Link](url)
byte_string = [Link]
source_code = [Link](byte_string)
tree = source_code.xpath(path)
print(tree[0].text_content())
Output
The Sprint Burndown or the Iteration Burndown chart is a powerful tool to communicate
daily progress to the stakeholders. It tracks the completion of work for a given sprint
or an iteration. The horizontal axis represents the days within a Sprint. The vertical
axis represents the hours remaining to complete the committed work.
Data Analysis
To process the data that has been scraped, we must store the data on our local machine in a particular format like
spreadsheet (CSV), JSON or sometimes in databases like MySQL.
20
CSV and JSON Data Processing
First, we are going to write the information, after grabbing from web page, into a CSV file or a spreadsheet. Let us
first understand through a simple example in which we will first grab the information using BeautifulSoup module,
as did earlier, and then by using Python CSV module we will write that textual information into CSV file.
import requests
from bs4 import BeautifulSoup
import csv
In this following line of code, we use requests to make a GET HTTP requests for the url:
[Link] by making a GET request.
r = [Link]('[Link]
Now, with the help of next lines of code, we will write the grabbed data into a CSV file named [Link].
After running this script, the textual information or the title of the webpage will be saved in the above mentioned
CSV file on your local machine.
Similarly, we can save the collected information in a JSON file. The following is an easy to understand Python
script for doing the same in which we are grabbing the same information as we did in last Python script, but this
time the grabbed information is saved in [Link] by using JSON Python module.
import requests
from bs4 import BeautifulSoup
import csv
import json
r = [Link]('[Link]
soup = BeautifulSoup([Link], 'lxml')
y = [Link]([Link])
with open('[Link]', 'wt') as outfile:
[Link](y, outfile)
After running this script, the grabbed information i.e. title of the webpage will be saved in the above mentioned text
file on your local machine.
Sometimes we may want to save scraped data in our local storage for archive purpose. But what if there were need
to store and analyze this data at a massive scale? The answer is cloud storage service named Amazon S3 or AWS S3
21
(Simple Storage Service). Basically AWS S3 is an object storage which is built to store and retrieve any amount of
data from anywhere.
Step 1 − First we need an AWS account which will provide us the secret keys for using in our Python script while
storing the data. It will create a S3 bucket in which we can store our data.
Step 2 − Next, we need to install boto3 Python library for accessing S3 bucket. It can be installed with the help of
the following command −
Step 3 − Next, we can use the following Python script for scraping data from web page and saving it to AWS S3
bucket.
First, we need to import Python libraries for scraping, here we are working with requests, and boto3 saving data to
S3 bucket.
import requests
import boto3
s3 = [Link]('s3')
bucket_name = "our-content"
Now you can check the bucket with name our-content from your AWS account.
Let us learn how to process data using MySQL. If you want to learn about MySQL, then you can follow the link
[Link]
With the help of following steps, we can scrape and process data into MySQL table −
Step 1 − First, by using MySQL we need to create a Assumptions and table in which we want to save our scraped
data. For example, we are creating the table with following query −
22
Step 2 − Next, we need to deal with Unicode. Note that MySQL does not handle Unicode by default. We need to
turn on this feature with the help of following commands which will change the default character set for the
database, for the table and for both of the columns.
Step 3 − Now, integrate MySQL with Python. For this, we will need PyMySQL which can be installed with the help
of the following command:
Step 4 − Now, our database named Scrap, created earlier, is ready to save the data, after scraped from web, into
table named Scrap_pages. Here in our example we are going to scrape data from Wikipedia and it will be saved into
our database.
def getLinks(articleUrl):
html = urlopen('[Link]
bs = BeautifulSoup(html, '[Link]')
title = [Link]('h1').get_text()
content = [Link]('div', {'id':'mw-content-text'}).find('p').get_text()
store(title, content)
return [Link]('div', {'id':'bodyContent'}).findAll('a',href=[Link]('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
try:
while len(links) > 0:
23
newArticle = links[[Link](0, len(links)-1)].attrs['href']
print(newArticle)
links = getLinks(newArticle)
finally:
[Link]()
[Link]()
This will save the data gather from Wikipedia into table named scrap_pages. If you are familiar with MySQL and
web scraping, then the above code would not be tough to understand.
PostgreSQL, developed by a worldwide team of volunteers, is an open source relational database Management
system (RDMS). The process of processing the scraped data using PostgreSQL is similar to that of MySQL. There
would be two changes: First, the commands would be different to MySQL and second, here we will use psycopg2
Python library to perform its integration with Python.
If you are not familiar with PostgreSQL then you can learn it at [Link] And
with the help of following command we can install psycopg2 Python library:
24
Introduction to Natural Language Processing
You can perform text analysis by using Python library called Natural Language Tool Kit (NLTK). Before
proceeding into the concepts of NLTK, let us understand the relation between text analysis and web scraping.
Analyzing the words in the text can lead us to know about which words are important, which words are unusual,
how words are grouped. This analysis eases the task of web scraping.
The Natural language toolkit (NLTK) is collection of Python libraries which is designed especially for identifying
and tagging parts of speech found in the text of natural language like English.
Installing NLTK
If you are using Anaconda, then a conda package for NLTK can be built by using the following command −
After installing NLTK, we have to download preset text repositories. But before downloading text preset
repositories, we need to import NLTK with the help of import command as follows −
import nltk
Now, with the help of following command NLTK data can be downloaded −
[Link]()
Installation of all available packages of NLTK will take some time, but it is always recommended to install all the
packages.
We also need some other Python packages like gensim and pattern for doing text analysis as well as building
building natural language processing applications by using NLTK.
gensim − A robust semantic modeling library which is useful for many applications. It can be installed by the
following command −
pattern − Used to make gensim package work properly. It can be installed by the following command −
25
Tokenization
The Process of breaking the given text, into the smaller units called tokens, is called tokenization. These tokens can
be the words, numbers or punctuation marks. It is also called word segmentation.
Example
NLTK module provides different packages for tokenization. We can use these packages as per our requirement.
Some of the packages are described here −
sent_tokenize package − This package will divide the input text into sentences. You can use the following
command to import this package −
word_tokenize package − This package will divide the input text into words. You can use the following command
to import this package −
WordPunctTokenizer package − This package will divide the input text as well as the punctuation marks into
words. You can use the following command to import this package −
Stemming
In any language, there are different forms of a words. A language includes lots of variations due to the grammatical
reasons. For example, consider the words democracy, democratic, and democratization. For machine learning as
well as for web scraping projects, it is important for machines to understand that these different words have the same
base form. Hence we can say that it can be useful to extract the base forms of the words while analyzing the text.
This can be achieved by stemming which may be defined as the heuristic process of extracting the base forms of the
words by chopping off the ends of words.
NLTK module provides different packages for stemming. We can use these packages as per our requirement. Some
of these packages are described here −
PorterStemmer package − Porter’s algorithm is used by this Python stemming package to extract the base form.
You can use the following command to import this package −
For example, after giving the word ‘writing’ as the input to this stemmer, the output would be the word ‘write’ after
stemming.
LancasterStemmer package − Lancaster’s algorithm is used by this Python stemming package to extract the base
form. You can use the following command to import this package −
26
from [Link] import LancasterStemmer
For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘writ’
after stemming.
SnowballStemmer package − Snowball’s algorithm is used by this Python stemming package to extract the base
form. You can use the following command to import this package −
For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘write’
after stemming.
Lemmatization
Another way to extract the base form of words is by lemmatization, normally aiming to remove inflectional endings
by using vocabulary and morphological analysis. The base form of any word after lemmatization is called lemma.
WordNetLemmatizer package − It will extract the base form of the word depending upon whether it is used as
noun as a verb. You can use the following command to import this package −
Chunking
Chunking, which means dividing the data into small chunks, is one of the important processes in natural language
processing to identify the parts of speech and short phrases like noun phrases. Chunking is to do the labeling of
tokens. We can get the structure of the sentence with the help of chunking process.
Example
In this example, we are going to implement Noun-Phrase chunking by using NLTK Python module. NP chunking is
a category of chunking which will find the noun phrases chunks in the sentence.
We need to follow the steps given below for implementing noun-phrase chunking −
In the first step we will define the grammar for chunking. It would consist of the rules which we need to follow.
Now, we will create a chunk parser. It would parse the grammar and give the output.
27
First, we need to import the NLTK package as follows −
import nltk
Next, we need to define the sentence. Here DT: the determinant, VBP: the verb, JJ: the adjective, IN: the preposition
and NN: the noun.
grammar = "NP:{<DT>?<JJ>*<NN>}"
Now, next line of code will define a parser for parsing the grammar.
parser_chunking = [Link](grammar)
parser_chunking.parse(sentence)
Output = parser_chunking.parse(sentence)
With the help of following code, we can draw our output in the form of a tree as shown below.
[Link]()
Bag of Word (BoW) Model Extracting and converting the Text into Numeric Form
Bag of Word (BoW), a useful model in natural language processing, is basically used to extract the features from
text. After extracting the features from the text, it can be used in modeling in machine learning algorithms because
raw data cannot be used in ML applications.
28
Working of BoW Model
Initially, model extracts a vocabulary from all the words in the document. Later, using a document term matrix, it
would build a model. In this way, BoW model represents the document as a bag of words only and the order or
structure is discarded.
Example
Now, by considering these two sentences, we have the following 14 distinct words −
This
is
an
example
bag
of
words
model
we
can
extract
features
by
using
Let us look into the following Python script which will build a BoW model in NLTK.
Output
{
'this': 10, 'is': 7, 'an': 0, 'example': 4, 'of': 9,
29
'bag': 1, 'words': 13, 'model': 8, 'we': 12, 'can': 3,
'extract': 5, 'features': 6, 'by': 2, 'using':11
}
Generally documents are grouped into topics and topic modeling is a technique to identify the patterns in a text that
corresponds to a particular topic. In other words, topic modeling is used to uncover abstract themes or hidden
structure in a given set of documents.
Text Classification
Classification can be improved by topic modeling because it groups similar words together rather than using each
word separately as a feature.
Recommender Systems
Latent Dirichlet Allocation(LDA) − It is one of the most popular algorithm that uses the probabilistic graphical
models for implementing topic modeling.
Latent Semantic Analysis(LDA) or Latent Semantic Indexing(LSI) − It is based upon Linear Algebra and uses
the concept of SVD (Singular Value Decomposition) on document term matrix.
Non-Negative Matrix Factorization (NMF) − It is also based upon Linear Algebra as like LDA.
30
Python File I/O - Read and Write Files
In Python, the IO module provides methods of three types of IO operations; raw binary files, buffered binary files,
and text files. The canonical way to create a file object is by using the open() function.
1. Open the file to get the file object using the built-in open() function. There are different access modes,
which you can specify while opening a file using the open() function.
2. Perform read, write, append operations using the file object retrieved from the open() function.
3. Close and dispose the file object.
Reading File
File object includes the following methods to read data from the file.
read(chars): reads the specified number of characters starting from the current position.
readline(): reads the characters starting from the current reading position up to a newline character.
readlines(): reads all lines until the end of file and returns a list object.
The following C:\[Link] file will be used in all the examples of reading and writing files.
C:\[Link]
This is the first line.
This is the second line.
This is the third line.
The following example performs the read operation using the read(chars) method.
Above, f = open('C:\[Link]') opens the [Link] in the default read mode from the current directory and returns
a file object. [Link]() function reads all the content until EOF as a string. If you specify the char size argument in the
read(chars) method, then it will read that many chars only. [Link]() will flush and close the stream.
Reading a Line
31
'This is the second line.\n'
>>> line3 = [Link]() # reading a line
>>> line3
'This is the third line.'
>>> line4 = [Link]() # reading a line
>>> line4
''
>>> [Link]() # closing file object
As you can see, we have to open the file in 'r' mode. The readline() method will return the first line, and then will
point to the second line in the file.
The file object has an inbuilt iterator. The following program reads the given file line by line until StopIteration is
raised, i.e., the EOF is reached.
Writing to a File
32
write(s): Write the string s to the stream and return the number of characters written.
writelines(lines): Write a list of lines to the stream. Each line must have a separator at the end of it.
The following creates a new file if it does not exist or overwrites to an existing file.
# reading file
>>> f = open('C:\[Link]','r')
>>> [Link]()
'Hello'
>>> [Link]()
In the above example, the f=open("[Link]","w") statement opens [Link] in write mode, the open() method
returns the file object and assigns it to a variable f. 'w' specifies that the file should be writable. Next,
[Link]("Hello") overwrites an existing content of the [Link] file. It returns the number of characters written to a
file, which is 5 in the above example. In the end, [Link]() closes the file object.
The following appends the content at the end of the existing file by passing 'a' or 'a+' mode in the open() method.
# reading file
>>> f = open('C:\[Link]','r')
>>> [Link]()
'Hello World!'
>>> [Link]()
Python provides the writelines() method to save the contents of a list object in a file. Since the newline character is
not automatically written to the file, it must be provided as a part of the string.
Opening a file with "w" mode or "a" mode can only be written into and cannot be read from. Similarly "r" mode
allows reading only and not writing. In order to perform simultaneous read/append operations, use "a+" mode.
33
Writing to a Binary File
The open() function opens a file in text format by default. To open a file in binary format, add 'b' to the mode
parameter. Hence the "rb" mode opens the file in binary format for reading, while the "wb" mode opens the file in
binary format for writing. Unlike text files, binary files are not human-readable. When opened using any text editor,
the data is unrecognizable.
The following code stores a list of numbers in a binary file. The list is first converted in a byte array before writing.
The built-in function bytearray() returns a byte representation of the object.
34
Exception Handling in Python
The cause of an exception is often external to the program itself. For example, an incorrect input, a malfunctioning
IO device etc. Because the program abruptly terminates on encountering an exception, it may cause damage to
system resources, such as files. Hence, the exceptions should be properly handled so that an abrupt termination of
the program is prevented.
Python uses try and except keywords to handle exceptions. Both keywords are followed by indented blocks.
Syntax:
try :
#statements in try block
except :
#executed when error in try block
The try: block contains one or more statements which are likely to encounter an exception. If the statements in this
block are executed without an exception, the subsequent except: block is skipped.
If the exception does occur, the program flow is transferred to the except: block. The statements in the except: block
are meant to handle the cause of the exception appropriately. For example, returning an appropriate error message.
You can specify the type of exception after the except keyword. The subsequent block will be executed only if the
specified exception occurs. There may be multiple except clauses with different exception types in a single try
block. If the type of exception doesn't match any of the except blocks, it will remain unhandled and the program will
terminate.
The rest of the statements after the except block will continue to be executed, regardless if the exception is
encountered or not.
The following example will throw an exception when we try to devide an integer by a string.
You can mention a specific type of exception in front of the except keyword. The subsequent block will be executed
only if the specified exception occurs. There may be multiple except clauses with different exception types in a
single try block. If the type of exception doesn't match any of the except blocks, it will remain unhandled and the
program will terminate.
35
a=5
b='0'
print (a+b)
except TypeError:
print('Unsupported operation')
print ("Out of try except blocks")
Output
Unsupported operation
Out of try except blocks
As mentioned above, a single try block may have multiple except blocks. The following example uses two except
blocks to process two different exception types:
However, if variable b is set to '0', TypeError will be encountered and processed by corresponding except block.
In Python, keywords else and finally can also be used along with the try and except clauses. While the except block
is executed if the exception occurs inside the try block, the else block gets processed if the try block is found to be
exception free.
Syntax:
try:
#statements in try block
except:
#executed when error in try block
else:
#executed if try block is error-free
finally:
#executed irrespective of exception occured or not
The finally block consists of statements which should be processed regardless of an exception occurring in the try
block or not. As a consequence, the error-free try block skips the except clause and enters the finally block before
going on to execute the rest of the code. If, however, there's an exception in the try block, the appropriate except
block will be processed, and the statements in the finally block will be processed before proceeding to the rest of the
code.
The example below accepts two numbers from the user and performs their division. It demonstrates the uses of else
and finally blocks.
36
Example: try, except, else, finally blocks
try:
print('try block')
x=int(input('Enter a number: '))
y=int(input('Enter another number: '))
z=x/y
except ZeroDivisionError:
print("except ZeroDivisionError block")
print("Division by 0 not accepted")
else:
print("else block")
print("Division = ", z)
finally:
print("finally block")
x=0
y=0
print ("Out of try, except, else and finally blocks." )
The first run is a normal case. The out of the else and finally blocks is displayed because the try block is error-free.
Output
try block
Enter a number: 10
Enter another number: 2
else block
Division = 5.0
finally block
Out of try, except, else and finally blocks.
The second run is a case of division by zero, hence, the except block and the finally block are executed, but the else
block is not executed.
Output
try block
Enter a number: 10
Enter another number: 0
except ZeroDivisionError block
Division by 0 not accepted
finally block
Out of try, except, else and finally blocks.
In the third run case, an uncaught exception occurs. The finally block is still executed but the program terminates
and does not execute the program after the finally block.
Output
try block
Enter a number: 10
Enter another number: xyz
finally block
Traceback (most recent call last):
File "C:\python36\codes\[Link]", line 3, in <module>
y=int(input('Enter another number: '))
ValueError: invalid literal for int() with base 10: 'xyz'
37
Typically the finally clause is the ideal place for cleaning up the operations in a process. For example closing a file
irrespective of the errors in read/write operations. This will be dealt with in the next chapter.
Raise an Exception
Python also provides the raise keyword to be used in the context of exception handling. It causes an exception to be
generated explicitly. Built-in errors are raised implicitly. However, a built-in or custom exception can be forced
during execution.
The following code accepts a number from the user. The try block raises a ValueError exception if the number is
outside the allowed range.
Output
Enter a number upto 100: 200
200 is out of allowed range
Enter a number upto 100: 50
50 is within the allowed range
Here, the raised exception is a ValueError type. However, you can define your custom exception type to be raised
38
Testing Linear Regression Assumptions in Python
Checking model assumptions is like commenting code. Everybody should be doing it often, but it sometimes ends
up being overlooked in reality. A failure to do either can result in a lot of time being confused, going down rabbit
holes, and can have pretty serious consequences from the model not being interpreted correctly.
Linear regression is a fundamental tool that has distinct advantages over other regression algorithms. Due to its
simplicity, it’s an exceptionally quick algorithm to train, thus typically makes it a good baseline algorithm for
common regression scenarios. More importantly, models trained with linear regression are the most interpretable
kind of regression models available - meaning it’s easier to take action from the results of a linear regression model.
However, if the assumptions are not satisfied, the interpretation of the results will not always be valid. This can be
very dangerous depending on the application.
This post contains code for tests on the assumptions of linear regression and examples with both a real-world dataset
and a toy dataset.
The Data
For our real-world dataset, we’ll use the Boston house prices dataset from the late 1970’s. The toy dataset will be
created using scikit-learn’s make_regression function which creates a dataset that should perfectly satisfy all of our
assumptions.
One thing to note is that I’m assuming outliers have been removed in this blog post. This is an important part of any
exploratory data analysis (which isn’t being performed in this post in order to keep it short) that should happen in
real world scenarios, and outliers in particular will cause significant issues with linear regression. See Anscombe’s
Quartet for examples of outliers causing issues with fitting linear regression models.
Here are the variable descriptions for the Boston housing dataset straight from the documentation:
ZN: Proportion of residential land zoned for lots over 25,000 [Link].
39
B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
import numpy as np
import pandas as pd
import [Link] as plt
import seaborn as sns
from sklearn import datasets
%matplotlib inline
"""
Real-world data of Boston housing prices
Additional Documentation: [Link]
Attributes:
data: Features/predictors
label: Target/label/response variable
feature_names: Abbreviations of names of features
"""
boston = datasets.load_boston()
"""
Artificial linear data using the same number of features and observations as the
Boston housing prices dataset for assumption test comparison
"""
linear_X, linear_y = datasets.make_regression(n_samples=[Link][0],
n_features=[Link][1],
noise=75, random_state=46)
# Setting feature names to x1, x2, x3, etc. if they are not defined
linear_feature_names = ['X'+str(feature+1) for feature in range(linear_X.shape[1])]
df = [Link]([Link], columns=boston.feature_names)
df['HousePrice'] = [Link]
[Link]()
df = [Link]([Link], columns=boston.feature_names)
df['HousePrice'] = [Link]
[Link]()
Initial Setup
40
Before we test the assumptions, we’ll need to fit our linear regression models. I have a master function for
performing all of the assumption testing at the bottom of this post that does this automatically, but to abstract the
assumption tests out to view them independently we’ll have to re-write the individual tests to take the trained model
as a parameter.
return df_results
The Assumptions
I) Linearity Assumption
This assumes that there is a linear relationship between the predictors (e.g. independent variables or features) and the
response variable (e.g. dependent variable or label). This also assumes that the predictors are additive.
Why it can happen: There may not just be a linear relationship among the data. Modeling is about trying to
estimate a function that explains a process, and linear regression would not be a fitting estimator (pun intended) if
there is no linear relationship.
What it will affect: The predictions will be extremely inaccurate because our model is underfitting. This is a serious
violation that should not be ignored.
How to detect it: If there is only one predictor, this is pretty easy to test with a scatter plot. Most cases aren’t so
simple, so we’ll have to modify this by using a scatter plot to see our predicted values versus the actual values (in
other words, view the residuals). Ideally, the points should lie on or around a diagonal line on the scatter plot.
41
How to fix it: Either adding polynomial terms to some of the predictors or applying nonlinear transformations . If
those do not work, try adding additional variables to help capture the relationship between the predictors and the
label.
Checking with a scatter plot of actual vs. predicted. Predictions should follow the diagonal line.
42
We can see a relatively even spread around the diagonal line.
Checking with a scatter plot of actual vs. predicted. Predictions should follow the diagonal line.
43
We can see in this case that there is not a perfect linear relationship. Our predictions are biased towards lower values
in both the lower end (around 5-10) and especially at the higher values (above 40).
More specifically, this assumes that the error terms of the model are normally distributed. Linear regressions other
than Ordinary Least Squares (OLS) may also assume normality of the predictors or the label, but that is not the case
here.
Why it can happen: This can actually happen if either the predictors or the label are significantly non-normal.
Other potential reasons could include the linearity assumption being violated or outliers affecting our model.
What it will affect: A violation of this assumption could cause issues with either shrinking or inflating our
confidence intervals.
How to detect it: There are a variety of ways to do so, but we’ll look at both a histogram and the p-value from the
Anderson-Darling test for normality.
How to fix it: It depends on the root cause, but there are a few options. Nonlinear transformations of the variables,
excluding specific variables (such as long-tailed variables), or removing outliers may solve this problem.
44
def normal_errors_assumption(model, features, label, p_value_thresh=0.05):
"""
Normality: Assumes that the error terms are normally distributed. If they are not,
nonlinear transformations of variables may solve this.
This assumption being violated primarily causes issues with the confidence intervals
"""
from [Link] import normal_ad
print('Assumption 2: The error terms are normally distributed', '\n')
print()
if p_value > p_value_thresh:
print('Assumption satisfied')
else:
print('Assumption not satisfied')
print()
print('Confidence intervals will likely be affected')
print('Try performing nonlinear transformations on variables')
As with our previous assumption, we’ll start with the linear dataset:
45
Assumption satisfied
46
Assumption not satisfied
This isn’t ideal, and we can see that our model is biasing towards under-estimating.
This assumes that the predictors used in the regression are not correlated with each other. This won’t render our
model unusable if violated, but it will cause issues with the interpretability of the model.
Why it can happen: A lot of data is just naturally correlated. For example, if trying to predict a house price with
square footage, the number of bedrooms, and the number of bathrooms, we can expect to see correlation between
those three variables because bedrooms and bathrooms make up a portion of square footage.
What it will affect: Multicollinearity causes issues with the interpretation of the coefficients. Specifically, you can
interpret a coefficient as “an increase of 1 in this predictor results in a change of (coefficient) in the response
variable, holding all other predictors constant.” This becomes problematic when multicollinearity is present because
we can’t hold correlated predictors constant. Additionally, it increases the standard error of the coefficients, which
results in them potentially showing as statistically insignificant when they might actually be significant.
How to detect it: There are a few ways, but we will use a heatmap of the correlation as a visual aid and examine the
variance inflation factor (VIF).
How to fix it: This can be fixed by other removing predictors with a high variance inflation factor (VIF) or
performing dimensionality reduction.
47
def multicollinearity_assumption(model, features, label, feature_names=None):
"""
Multicollinearity: Assumes that predictors are not correlated with each other. If there is
correlation among the predictors, then either remove prepdictors with high
Variance Inflation Factor (VIF) values or perform dimensionality reduction
if definite_multicollinearity == 0:
if possible_multicollinearity == 0:
print('Assumption satisfied')
else:
print('Assumption possibly satisfied')
print()
print('Coefficient interpretability may be problematic')
print('Consider removing variables with a high Variance Inflation Factor (VIF)')
else:
print('Assumption not satisfied')
print()
print('Coefficient interpretability will be problematic')
print('Consider removing variables with a high Variance Inflation Factor (VIF)')
48
Variance Inflation Factors (VIF)
> 10: An indication that multicollinearity may be present
> 100: Certain multicollinearity among the variables
-------------------------------------
X1: 1.030931170297102
X2: 1.0457176802992108
X3: 1.0418076962011933
X4: 1.0269600632251443
X5: 1.0199882018822783
X6: 1.0404194675991594
X7: 1.0670847781889177
X8: 1.0229686036798158
X9: 1.0292923730360835
X10: 1.0289003332516535
X11: 1.052043220821624
X12: 1.0336719449364813
X13: 1.0140788728975834
Assumption satisfied
49
Everything looks peachy keen. Onto the Boston dataset:
50
10 cases of possible multicollinearity
0 cases of definite multicollinearity
This isn’t quite as egregious as our normality assumption violation, but there is possible multicollinearity for most of
the variables in this dataset.
This assumes no autocorrelation of the error terms. Autocorrelation being present typically indicates that we are
missing some information that should be captured by the model.
Why it can happen: In a time series scenario, there could be information about the past that we aren’t capturing. In
a non-time series scenario, our model could be systematically biased by either under or over predicting in certain
conditions. Lastly, this could be a result of a violation of the linearity assumption.
How to detect it: We will perform a Durbin-Watson test to determine if either positive or negative correlation is
present. Alternatively, you could create plots of residual autocorrelations.
How to fix it: A simple fix of adding lag variables can fix this problem. Alternatively, interaction terms, additional
variables, or additional transformations may fix this.
51
print('Assumption not satisfied')
else:
print('Little to no autocorrelation', '\n')
print('Assumption satisfied')
Assumption satisfied
We’re having signs of positive autocorrelation here, but we should expect this since we know our model is
consistently under-predicting and our linearity assumption is being violated. Since this isn’t a time series dataset, lag
variables aren’t possible. Instead, we should look into either interaction terms or additional transformations.
V) Homoscedasticity/Heteroscedasticity
This assumes homoscedasticity, which is the same variance within our error terms. Heteroscedasticity, the violation
of homoscedasticity, occurs when we don’t have an even variance across the error terms.
Why it can happen: Our model may be giving too much weight to a subset of the data, particularly where the error
variance was the largest.
What it will affect: Significance tests for coefficients due to the standard errors being biased. Additionally, the
confidence intervals will be either too wide or too narrow.
How to detect it: Plot the residuals and see if the variance appears to be uniform.
52
How to fix it: Heteroscedasticity (can you tell I like the scedasticity words?) can be solved either by using weighted
least squares regression instead of the standard OLS or transforming either the dependent or highly skewed
variables. Performing a log transformation on the dependent variable is not a bad place to start.
53
There don’t appear to be any obvious problems with that.
54
We can’t see a fully uniform variance across our residuals, so this is potentially problematic. However, we know
from our other tests that our model has several issues and is under predicting in many cases.
Conclusion
We can clearly see that a linear regression model on the Boston dataset violates a number of assumptions which
cause significant problems with the interpretation of the model itself. It’s not uncommon for assumptions to be
violated on real-world data, but it’s important to check them so we can either fix them and/or be aware of the flaws
in the model for the presentation of the results or the decision making process.
It is dangerous to make decisions on a model that has violated assumptions because those decisions are effectively
being formulated on made-up numbers. Not only that, but it also provides a false sense of security due to trying to be
empirical in the decision making process. Empiricism requires due diligence, which is why these assumptions exist
and are stated up front. Hopefully this code can help ease the due diligence process and make it less painful.
This function performs all of the assumption tests listed in this blog post:
# Setting feature names to x1, x2, x3, etc. if they are not defined
if feature_names is None:
55
feature_names = ['X'+str(feature+1) for feature in range([Link][1])]
[Link](features, label)
def linear_assumption():
"""
Linearity: Assumes there is a linear relationship between the predictors and
the response variable. If not, either a polynomial term or another
algorithm should be used.
"""
print('\
n==================================================================================
=====')
print('Assumption 1: Linear Relationship between the Target and the Features')
print('Checking with a scatter plot of actual vs. predicted. Predictions should follow the diagonal line.')
def normal_errors_assumption(p_value_thresh=0.05):
56
"""
Normality: Assumes that the error terms are normally distributed. If they are not,
nonlinear transformations of variables may solve this.
This assumption being violated primarily causes issues with the confidence intervals
"""
from [Link] import normal_ad
print('\
n==================================================================================
=====')
print('Assumption 2: The error terms are normally distributed')
print()
print()
if p_value > p_value_thresh:
print('Assumption satisfied')
else:
print('Assumption not satisfied')
print()
print('Confidence intervals will likely be affected')
print('Try performing nonlinear transformations on variables')
def multicollinearity_assumption():
"""
Multicollinearity: Assumes that predictors are not correlated with each other. If there is
correlation among the predictors, then either remove prepdictors with high
Variance Inflation Factor (VIF) values or perform dimensionality reduction
57
# Plotting the heatmap
[Link](figsize = (10,8))
[Link]([Link](features, columns=feature_names).corr(), annot=True)
[Link]('Correlation of Variables')
[Link]()
if definite_multicollinearity == 0:
if possible_multicollinearity == 0:
print('Assumption satisfied')
else:
print('Assumption possibly satisfied')
print()
print('Coefficient interpretability may be problematic')
print('Consider removing variables with a high Variance Inflation Factor (VIF)')
else:
print('Assumption not satisfied')
print()
print('Coefficient interpretability will be problematic')
print('Consider removing variables with a high Variance Inflation Factor (VIF)')
def autocorrelation_assumption():
"""
Autocorrelation: Assumes that there is no autocorrelation in the residuals. If there is
autocorrelation, then there is a pattern that is not explained due to
the current value being dependent on the previous value.
This may be resolved by adding a lag variable of either the dependent
variable or some of the predictors.
"""
from [Link] import durbin_watson
print('\
n==================================================================================
=====')
print('Assumption 4: No Autocorrelation')
print('\nPerforming Durbin-Watson Test')
print('Values of 1.5 < d < 2.5 generally show that there is no autocorrelation in the data')
print('0 to 2< is positive autocorrelation')
print('>2 to 4 is negative autocorrelation')
58
print('-------------------------------------')
durbinWatson = durbin_watson(df_results['Residuals'])
print('Durbin-Watson:', durbinWatson)
if durbinWatson < 1.5:
print('Signs of positive autocorrelation', '\n')
print('Assumption not satisfied', '\n')
print('Consider adding lag variables')
elif durbinWatson > 2.5:
print('Signs of negative autocorrelation', '\n')
print('Assumption not satisfied', '\n')
print('Consider adding lag variables')
else:
print('Little to no autocorrelation', '\n')
print('Assumption satisfied')
def homoscedasticity_assumption():
"""
Homoscedasticity: Assumes that the errors exhibit constant variance
"""
print('\
n==================================================================================
=====')
print('Assumption 5: Homoscedasticity of Error Terms')
print('Residuals should have relative constant variance')
linear_assumption()
normal_errors_assumption()
multicollinearity_assumption()
autocorrelation_assumption()
homoscedasticity_assumption()
1. Linearity – There should be linear relationship between dependent and independent variable. This is very logical
and most essential assumption of Linear Regression. Visually it can be check by making a scatter plot between
dependent and independent variable
2. Homoscedasticity – Constant Error Variance, i.e, the variance of the error term is same across all values of the
independent variable. It can be easily checked by making a scatter plot between Residual and Fitted Values. If there
is no trend then the variance of error term is constant.
59
import seaborn as sns
[Link](x ="expected",
y = "residual", data = result)
A close observation of the above plot shows that the variance of residual term is relatively more for higher fitted
values. Note: In many real-life scenarios, it is practically difficult to ensure all assumptions of linear regression will
hold 100%
3. Normal Error – The error term should be normally distributed. QQ plot is a good way of checking normality. If
the plot forms a line that is roughly straight then we can assume there is normality.
import [Link] as sm
4. No Autocorrelation of residual – This is typically applicable to time series data. Autocorrelation means the
current value of Yt is dependent on historic value of Yt-n with n as lag period. Durbin-Watson test is a quick way to
find if there is any autocorrelation.
6. Exogeneity – Exogeneity is a standard assumption of regression and it means that each X variable does not
depend on the dependent variable Y, rather Y depends on the Xs and on Error (e). In simple terms X is completely
unaffected by Y.
7. Sample Size – In linear regression, it is desirable that the number of records should be at least 10 or more times
the number of independent variables to avoid the curse of dimensionality.
60
Skewness & Kurtosis
If you will ask Mother Nature — What is her favorite probability distribution?
The answer will be — ‘Normal’ and the reason behind it is the existence of chance/random causes that influence
every known variable on earth. What if a process is under the influence of assignable/significant causes as well?
This is surely going to modify the shape of the distribution (distort) and that’s when we need a measure like
skewness to capture it. Below is a normal distribution visual, also known as a bell curve. It is a symmetrical graph
with all measures of central tendency in the middle.
(Author, 2021)
But what if we encounter an asymmetrical distribution, how do we detect the extent of asymmetry? Let’s see
visually what happens to the measures of central tendency when we encounter such graphs.
( Author, 2021)
Notice how these central tendency measures tend to spread when the normal distribution is distorted. For the
nomenclature just follow the direction of the tail — For the left graph since the tail is to the left, it is left-skewed
(negatively skewed) and the right graph has the tail to the right, so it is right-skewed (positively skewed).
61
How about deriving a measure that captures the horizontal distance between the Mode and the Mean of the
distribution? It’s intuitive to think that the higher the skewness, the more apart these measures will be. So let’s jump
to the formula for skewness now:
Division by Standard Deviation enables the relative comparison among distributions on the same standard scale.
Since mode calculation as a central tendency for small data sets is not recommended, so to arrive at a more robust
formula for skewness we will replace mode with the derived calculation from the median and the mean.
( Author, 2021)
Think of punching or pulling the normal distribution curve from the top, what impact will it have on the shape of the
distribution? Let’s visualize:
62
(Author, 2021)
So there are two things to notice — The peak of the curve and the tails of the curve, Kurtosis measure is responsible
for capturing this phenomenon. The formula for kurtosis calculation is complex (4th moment in the moment-based
calculation) so we will stick to the concept and its visual clarity. A normal distribution has a kurtosis of 3 and is
called mesokurtic. Distributions greater than 3 are called leptokurtic and less than 3 are called platykurtic. So the
greater the value more the peakedness. Kurtosis ranges from 1 to infinity. As the kurtosis measure for a normal
distribution is 3, we can calculate excess kurtosis by keeping reference zero for normal distribution. Now excess
kurtosis will vary from -2 to infinity.
63
(Author, 2021)
The topic of Kurtosis has been controversial for decades now, the basis of kurtosis all these years has been linked
with the peakedness but the ultimate verdict is that outliers (fatter tails) govern the kurtosis effect far more than the
values near the mean (peak).
So we can conclude from the above discussions that the horizontal push or pull distortion of a normal distribution
curve gets captured by the Skewness measure and the vertical push or pull distortion gets captured by the Kurtosis
measure. Also, it is the impact of outliers that dominate the kurtosis effect which has its roots of proof sitting in the
fourth-order moment-based formula. I hope this blog helped you clarify the idea of Skewness & Kurtosis in a
simplified manner, watch out for more similar blogs in the future.
The Violin Plot is used to indicate the probability density of data at different values and it is quite similar to the
Matplotlib Box Plot.
Here is a figure showing common components of the Box Plot and Violin Plot:
64
Creation of the Violin Plot
The violinplot() method is used for the creation of the violin plot.
Parameters
dataset
This parameter denotes the array or sequence of vectors. It is the input data.
positions
This parameter is used to set the positions of the violins. In this, the ticks and limits are set automatically in
order to match the positions. It is an array-like structured data with the default as = [1, 2, …, n].
vert
This parameter contains the boolean value. If the value of this parameter is set to true then it will create a
vertical plot, otherwise, it will create a horizontal plot.
showmeans
65
This parameter contains a boolean value with false as its default value. If the value of this parameter is
True, then it will toggle the rendering of the means.
showextrema
This parameter contains the boolean values with false as its default value. If the value of this parameter is
True, then it will toggle the rendering of the extrema.
showmedians
This parameter contains the boolean values with false as its default [Link] the value of this parameter is
True, then it will toggle the rendering of the medians.
quantiles
This is an array-like data structure having None as its default [Link] value of this parameter is not None
then,it set a list of floats in interval [0, 1] for each violin,which then stands for the quantiles that will be
rendered for that violin.
points
It is scalar in nature and is used to define the number of points to evaluate each of the Gaussian kernel
density estimations.
bw_method
This method is used to calculate the estimator bandwidth, for which there are many different ways of
calculation. The default rule used is Scott's Rule, but you can choose ‘silverman’, a scalar constant, or a
callable.
Now its time to dive into some examples in order to clear the concepts:
Below we have a simple example where we will create violin plots for a different collection of data.
[Link](10)
collectn_1 = [Link](120, 10, 200)
collectn_2 = [Link](150, 30, 200)
collectn_3 = [Link](50, 20, 200)
collectn_4 = [Link](100, 25, 200)
fig = [Link]()
ax = fig.add_axes([0,0,1,1])
bp = [Link](data_to_plot)
[Link]()
66
The output will be as follows:
67
Sentiment Analysis Using Python
In today’s digital age, platforms like Twitter, Goodreads, and Amazon overflow with people’s
opinions, making it crucial for organizations to extract insights from this massive volume of data.
Sentiment Analysis in Python offers a powerful solution to this challenge. This technique, a subset
of Natural Language Processing (NLP), involves classifying texts into sentiments such as positive,
negative, or neutral. By employing various Python libraries and models, analysts can automate this
process efficiently.
Sentiment Analysis is a use case of Natural Language Processing (NLP) and comes under the
category of text classification. To put it simply, Sentiment Analysis involves classifying a text into
various sentiments, such as positive or negative, Happy, Sad or Neutral, etc. Thus, the ultimate goal
of sentiment analysis is to decipher the underlying mood, emotion, or sentiment of a text. This is also
referred to as Opinion Mining.
Sentiment analysis in Python typically works by employing natural language processing(NLP) techniques
to analyze and understand the sentiment expressed in text. The process involves several steps:
Text Preprocessing: The text cleaning process involves removing irrelevant information, such as
Tokenization: The text is divided into individual words or tokens to facilitate analysis.
Feature Extraction: The text extraction process involves extracting relevant features from the text,
Sentiment Classification: Machine learning algorithms or pre-trained models are used to classify the
sentiment of each text instance. Researchers achieve this through supervised learning, where they
train models on labeled data, or through pre-trained models that have learned sentiment patterns from
large datasets.
Post-processing: The sentiment analysis results may undergo additional processing, such as
aggregating sentiment scores or applying threshold rules to classify sentiments as positive, negative,
or neutral.
68
Evaluation: Researchers assess the performance of the sentiment analysis model using evaluation
Various types of sentiment analysis can be performed, depending on the specific focus and objective of the
Document-Level Sentiment Analysis: This type of analysis determines the overall sentiment
expressed in a document, such as a review or an article. It aims to classify the entire text as positive,
negative, or neutral.
Sentence-Level Sentiment Analysis: Here, the sentiment of each sentence within a document is
analyzed. This type provides a more granular understanding of the sentiment expressed in different
text parts.
Aspect-Based Sentiment Analysis: This approach focuses on identifying and extracting the
sentiment associated with specific aspects or entities mentioned in the text. For example, in a product
review, the sentiment towards different features of the product (e.g., performance, design, usability)
Entity-Level Sentiment Analysis: This type of analysis identifies the sentiment expressed towards
specific entities or targets mentioned in the text, such as people, companies, or products. It helps
understand the sentiment associated with different entities within the same document.
Comparative Sentiment Analysis: This approach involves comparing the sentiment between
different entities or aspects mentioned in the text. It aims to identify the relative sentiment or
69
Sentiment Analysis Use Cases
We just saw how sentiment analysis can empower organizations with insights that can help them make data-
driven decisions. Now, let’s peep into some more use cases of sentiment analysis:
Social Media Monitoring for Brand Management: Brands can use sentiment analysis to gauge
their Brand’s public outlook. For example, a company can gather all Tweets with the company’s
mention or tag and perform sentiment analysis to learn the company’s public outlook.
reviews to see how well a product or service is doing in the market and make future decisions
accordingly.
Stock Price Prediction: Predicting whether the stocks of a company will go up or down is crucial
for investors. One can determine the same by performing sentiment analysis on News Headlines of
articles containing the company’s name. If the news headlines pertaining to a particular organization
happen to have a positive sentiment — its stock prices should go up and vice-versa.
Python is one of the most powerful tools when it comes to performing data science tasks — it offers a
multitude of ways to perform sentiment analysis in Python. The most popular ones are enlisted here:
Using Vader
70
Using Transformer-based Models
Note: For the purpose of demonstrations of methods 3 & 4 (Using Bag of Words Vectorization-based Models
and Using LSTM-based Models ) sentiment analysis has been used. It comprises more than 5000 text labelled
as positive, negative or neutral. The dataset lies under the Creative Commons license.
Text Blob is a Python library for Natural Language Processing. Using Text Blob for sentiment analysis is
quite simple. It takes text as an input and can return polarity and subjectivity as outputs.
Polarity determines the sentiment of the text. Its values lie in [-1,1] where -1 denotes a highly
Subjectivity determines whether a text input is a factual information or a personal opinion. Its value
lies between [0,1] where a value closer to 0 denotes a piece of factual information and a value closer
Here is Steps to perform sentiment analysis using python and putting sentiment analysis code in python.
Step1: Installation
pip install textblobCopy Code
Writing code for sentiment analysis using TextBlob is fairly simple. Just import the TextBlob object and pass
71
#Determining the Polarity
p_1 = TextBlob(text_1).[Link]
p_2 = TextBlob(text_2).[Link]
Output
Using VADER
VADER (Valence Aware Dictionary and Sentiment Reasoner) is a rule-based sentiment analyzer that has
been trained on social media text. Just like Text Blob, its usage in Python is pretty simple. We’ll see its usage
Step1: Installation
pip install vaderSentimentCopy Code
Firstly, we need to create an object of the SentimentIntensityAnalyzer class; then we need to pass the text to
72
Output:
Sentiment of text 1: {'neg': 0.0, 'neu': 0.73, 'pos': 0.27, 'compound': 0.5719}
Sentiment of text 2: {'neg': 0.508, 'neu': 0.492, 'pos': 0.0, 'compound': -0.4767}
As we can see, a VaderSentiment object returns a dictionary of sentiment scores for the text to be analyzed.
In the two approaches discussed as yet i.e. Text Blob and Vader, we have simply used Python libraries to
perform sentiment analysis. Now we’ll discuss an approach wherein we’ll train our own model for the task.
The steps involved in performing sentiment analysis using the Bag of Words Vectorization method are as
follows:
Pre-Process the text of training data (Text pre-processing involves Normalization, Tokenization,
Create a Bag of Words for the pre-processed text data using the Count Vectorization or TF-IDF
Vectorization approach.
Train a suitable classification model on the processed data for sentiment classification.
To build a sentiment analysis in python model using the BOW Vectorization Approach we need a labeled
dataset. As stated earlier, the dataset used for this demonstration has been obtained from Kaggle. We have
simply used sklearn’s count vectorizer to create the BOW. After, we trained a Multinomial Naive Bayes
73
import pandas as pd
data = pd.read_csv('Finance_data.csv')
#Pre-Prcoessing and Bag of Word Vectorization using Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from [Link] import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = [Link])
text_counts = cv.fit_transform(data['sentences'])
#Splitting the data into trainig and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, data['feedback'], test_size=0.25,
random_state=5)
#Training the model
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB()
[Link](X_train, Y_train)
#Caluclating the accuracy score of the model
from sklearn import metrics
predicted = [Link](X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracuy Score: ",accuracy_score)Copy Code
Output:
The trained classifier can be used to predict the sentiment of any given text input.
Though we were able to obtain a decent accuracy score with the Bag of Words Vectorization method, it might
fail to yield the same results when dealing with larger datasets. This gives rise to the need to employ deep
learning-based models for the training of the sentiment analysis in python model.
For NLP tasks, we generally use RNN-based models since they are designed to deal with sequential data.
Here, we’ll train an LSTM (Long Short Term Memory) model using TensorFlow with Keras. The steps to
Pre-Process the text of training data (Text pre-processing involves Normalization, Tokenization,
74
Tokenizer is imported from [Link] and created, fitting it to the entire training text.
Text embeddings are generated using texts_to_sequence() and stored after padding to equal length.
Embeddings are numerical/vectorized representations of text, not directly fed to the model.
The model is built using TensorFlow, including input, LSTM, and dense layers. Dropouts and
hyperparameters are adjusted for accuracy. In inner layers, we use ReLU or LeakyReLU activation
functions to avoid vanishing gradient problems, while in the output layer, we use Softmax or
Here, we have used the same dataset as we used in the case of the BOW approach. A training accuracy of
75
X = tokenizer.texts_to_sequences(data_cleaned['verified_reviews'].values)
X = pad_sequences(X)
#Model Building
model = Sequential()
[Link](Embedding(500, 120, input_length = [Link][1]))
[Link](SpatialDropout1D(0.4))
[Link](LSTM(704, dropout=0.2, recurrent_dropout=0.2))
[Link](Dense(352, activation='LeakyReLU'))
[Link](Dense(3, activation='softmax'))
[Link](loss = 'categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
print([Link]())
#Model Training
[Link](X_train, y_train, epochs = 20, batch_size=32, verbose =1)
#Model Testing
[Link](X_test,y_test)Copy Code
Transformer-based models are one of the most advanced Natural Language Processing Techniques. They
follow an Encoder-Decoder-based architecture and employ the concepts of self-attention to yield impressive
results. Though one can always build a transformer model from scratch, it is quite tedious a task. Thus, we
can use pre-trained transformer models available on Hugging Face. Hugging Face is an open-source AI
community that offers a multitude of pre-trained models for NLP applications. You can use these models as
Step1: Installation
pip install transformersCopy Code
To perform any task using transformers, we first need to import the pipeline function from transformers.
Then, an object of the pipeline function is created and the task to be performed is passed as an argument (i.e
sentiment analysis in our case). We can also specify the model that we need to use to perform the task. Here,
since we have not mentioned the model to be used, the distillery-base-uncased-finetuned-sst-2-English mode
is used by default for sentiment analysis. You can check out the list of available tasks and models here.
76
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["It was the best of times.", "t was the worst of times."]
sentiment_pipeline(data)Copy Code
Output
No single best library for sentiment analysis in Python, depends on your needs. Here’s a quick comparison:
NLTK: Powerful, versatile, good for multiple NLP tasks, but complex for sentiment analysis.
Pattern: More comprehensive analysis (comparatives, superlatives, fact/opinion), steeper learning curve.
Polyglot: Fast, multilingual support (136+ languages), ideal for multiple languages.
Key Takeaways
Python provides a versatile environment for performing sentiment analysis tasks due to its rich
We explored multiple approaches including Text Blob, VADER, Bag of Words, LSTM, and
The process involves text preprocessing, tokenization, feature extraction, and applying machine
We applied these methods to real-world examples like customer reviews and social media data to
77
Sentiment analysis helps organizations monitor brand perception, analyze customer feedback, and
With advancements in natural language processing, sentiment analysis in Python continues to evolve,
offering more accurate and sophisticated methods for understanding textual sentiment.
78
Object Oriented Programming
Object Oriented Programming empowers developers to build modular, maintainable and scalable applications. OOP
is a way of organizing code that uses objects and classes to represent real-world entities and their behavior. In OOP,
object has attributes thing that has specific data and can perform certain actions using methods.
Organizes code into classes and objects.
Supports encapsulation to group data and methods together.
Enables inheritance for reusability and hierarchy.
Allows polymorphism for flexible method implementation.
Improves modularity, scalability and maintainability.
Class
A class is a collection of objects. Classes are blueprints for creating objects. A class defines a set of attributes and
methods that the created objects (instances) can have. Some points on Python class:
Classes are created by keyword class.
Attributes are the variables that belong to a class.
Attributes are always public and can be accessed using the dot (.) operator. Example: [Link]
Creating a Class
Here, class keyword indicates that we are creating a class followed by name of the class (Dog in this case).
class Dog:
species = "Canine" # Class attribute
Objects
An Object is an instance of a Class. It represents a specific implementation of the class and holds its own data. An
object consists of:
State: It is represented by the attributes and reflects the properties of an object.
Behavior: It is represented by the methods of an object and reflects the response of an object to other objects.
Identity: It gives a unique name to an object and enables one object to interact with other objects.
Creating Object
Creating an object in Python involves instantiating a class to create a new instance of that class. This process is also
referred to as object instantiation.
class Dog:
79
species = "Canine" # Class attribute
print([Link])
print([Link])
Output
Buddy
Canine
Explanation:
dog1 = Dog("Buddy", 3): Creates an object of the Dog class with name as "Buddy" and age as 3.
[Link]: Accesses the instance attribute name of the dog1 object.
[Link]: Accesses the class attribute species of the dog1 object.
1. Inheritance
Inheritance allows a class (child class) to acquire properties and methods of another class (parent class). It supports
hierarchical classification and promotes code reuse.
80
2. Polymorphism
Polymorphism in Python means "same operation, different behavior." It allows functions or methods with the same
name to work differently depending on the type of object they are acting upon.
3. Encapsulation
Encapsulation is the bundling of data (attributes) and methods (functions) within a class, restricting access to some
components to control interactions. A class is an example of encapsulation as it encapsulates all the data that is
member functions, variables, etc.
Encapsulation in Python
4. Data Abstraction
Abstraction hides the internal implementation details while exposing only the necessary functionality. It helps focus
on "what to do" rather than "how to do it."
Class Properties
Properties are variables that belong to a class. They store data for each object created from the class.
class Person:
def __init__(self, name, age):
[Link] = name
[Link] = age
p1 = Person("Emil", 36)
print([Link])
print([Link])
Access Properties
class Car:
def __init__(self, brand, model):
[Link] = brand
[Link] = model
print([Link])
print([Link])
Class Methods
Methods are functions that belong to a class. They define the behavior of objects created from the class.
81
Create a method in a class:
class Person:
def __init__(self, name):
[Link] = name
def greet(self):
print("Hello, my name is " + [Link])
p1 = Person("Emil")
[Link]()
Python Inheritance
Inheritance allows us to define a class that inherits all the methods and properties from another class.
Parent class is the class being inherited from, also called base class.
Child class is the class that inherits from another class, also called derived class.
Any class can be a parent class, so the syntax is the same as creating any other class:
Example: Create a class named Person, with firstname and lastname properties, and a printname method:
class Person:
def __init__(self, fname, lname):
[Link] = fname
[Link] = lname
def printname(self):
print([Link], [Link])
#Use the Person class to create an object, and then execute the printname method:
x = Person("John", "Doe")
[Link]()
To create a class that inherits the functionality from another class, send the parent class as a parameter when creating
the child class:
Example: Create a class named Student, which will inherit the properties and methods from the Person class:
class Student(Person):
pass
Note: Use the pass keyword when you do not want to add any other properties or methods to the class.
Use the Student class to create an object, and then execute the printname method:
82
x = Student("Mike", "Olsen")
[Link]()
So far we have created a child class that inherits the properties and methods from its parent.
We want to add the __init__() function to the child class (instead of the pass keyword).
Note: The __init__() function is called automatically every time the class is being used to create a new object.
class Student(Person):
def __init__(self, fname, lname):
#add properties etc.
When you add the __init__() function, the child class will no longer inherit the parent's __init__() function.
Note: The child's __init__() function overrides the inheritance of the parent's __init__() function
To keep the inheritance of the parent's __init__() function, add a call to the parent's __init__() function:
class Student(Person):
def __init__(self, fname, lname):
Person.__init__(self, fname, lname)
Now we have successfully added the __init__() function, and kept the inheritance of the parent class, and we are
ready to add functionality in the __init__() function.
Python also has a super() function that will make the child class inherit all the methods and properties from its
parent:
Example:
class Student(Person):
def __init__(self, fname, lname):
super().__init__(fname, lname)
By using the super() function, you do not have to use the name of the parent element, it will automatically inherit the
methods and properties from its parent.
Add Properties
class Student(Person):
def __init__(self, fname, lname):
83
super().__init__(fname, lname)
[Link] = 2019
In the example below, the year 2019 should be a variable, and passed into the Student class when creating student
objects. To do so, add another parameter in the __init__() function:
Example: Add a year parameter, and pass the correct year when creating objects:
class Student(Person):
def __init__(self, fname, lname, year):
super().__init__(fname, lname)
[Link] = year
lass Student(Person):
def __init__(self, fname, lname, year):
super().__init__(fname, lname)
[Link] = year
def welcome(self):
print("Welcome", [Link], [Link], "to the class of", [Link])
If you add a method in the child class with the same name as a function in the parent class, the inheritance of the
parent method will be overridden.
Python Polymorphism
The word "polymorphism" means "many forms", and in programming it refers to methods/functions/operators with
the same name that can be executed on many objects or classes.
Function Polymorphism
An example of a Python function that can be used on different objects is the len() function.
String
x = "Hello World!"
print(len(x))
Tuple
print(len(mytuple))
84
Dictionary
For dictionaries len() returns the number of key/value pairs in the dictionary:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(len(thisdict))
Class Polymorphism
Polymorphism is often used in Class methods, where we can have multiple classes with the same method name.
For example, say we have three classes: Car, Boat, and Plane, and they all have a method called move():
class Car:
def __init__(self, brand, model):
[Link] = brand
[Link] = model
def move(self):
print("Drive!")
class Boat:
def __init__(self, brand, model):
[Link] = brand
[Link] = model
def move(self):
print("Sail!")
class Plane:
def __init__(self, brand, model):
[Link] = brand
[Link] = model
def move(self):
print("Fly!")
Look at the for loop at the end. Because of polymorphism we can execute the same method for all three classes.
85
Inheritance Class Polymorphism
What about classes with child classes with the same name? Can we use polymorphism there?
Yes. If we use the example above and make a parent class called Vehicle, and make Car, Boat, Plane child classes
of Vehicle, the child classes inherit the Vehicle methods, but can override them:
Example: Create a class called Vehicle and make Car, Boat, Plane child classes of Vehicle:
class Vehicle:
def __init__(self, brand, model):
[Link] = brand
[Link] = model
def move(self):
print("Move!")
class Car(Vehicle):
pass
class Boat(Vehicle):
def move(self):
print("Sail!")
class Plane(Vehicle):
def move(self):
print("Fly!")
Child classes inherits the properties and methods from the parent class.
In the example above you can see that the Car class is empty, but it inherits brand, model, and move() from Vehicle.
The Boat and Plane classes also inherit brand, model, and move() from Vehicle, but they both override
the move() method.
Because of polymorphism we can execute the same method for all classes.
Encapsulation
It means keeping data (properties) and methods together in a class, while controlling how the data can be accessed
from outside the class.
86
This prevents accidental changes to your data and hides the internal details of how your class works.
Private Properties
In Python, you can make properties private by using a double underscore __ prefix:
p1 = Person("Emil", 25)
print([Link])
print(p1.__age) # This will cause an error
Note: Private properties cannot be accessed directly from outside the class.
Get Private Property Value
Example
class Person:
def __init__(self, name, age):
[Link] = name
self.__age = age
def get_age(self):
return self.__age
p1 = Person("Tobias", 25)
print(p1.get_age())
The setter method can also validate the value before setting it:
Example
class Person:
def __init__(self, name, age):
[Link] = name
self.__age = age
def get_age(self):
return self.__age
87
def set_age(self, age):
if age > 0:
self.__age = age
else:
print("Age must be positive")
p1 = Person("Tobias", 25)
print(p1.get_age())
p1.set_age(26)
print(p1.get_age())
Example
class Student:
def __init__(self, name):
[Link] = name
self.__grade = 0
def get_grade(self):
return self.__grade
def get_status(self):
if self.__grade >= 60:
return "Passed"
else:
return "Failed"
student = Student("Emil")
student.set_grade(85)
print(student.get_grade())
print(student.get_status())
Protected Properties
Python also has a convention for protected properties using a single underscore _ prefix:
88
Example
class Person:
def __init__(self, name, salary):
[Link] = name
self._salary = salary # Protected property
p1 = Person("Linus", 50000)
print([Link])
print(p1._salary) # Can access, but shouldn't
Note: A single underscore _ is just a convention. It tells other programmers that the property is intended for internal
use, but Python doesn't enforce this restriction.
Private Methods
You can also make methods private using the double underscore prefix:
Example
lass Calculator:
def __init__(self):
[Link] = 0
calc = Calculator()
[Link](10)
[Link](5)
print([Link])
# calc.__validate(5) # This would cause an error
Note: Just like private properties with double underscores, private methods cannot be called directly from outside
the class. The __validate method can only be used by other methods inside the class.
Name Mangling
When you use double underscores __, Python automatically renames it internally by adding _ClassName in front.
89
For example, __age becomes _Person__age.
Example
lass Person:
def __init__(self, name, age):
[Link] = name
self.__age = age
p1 = Person("Emil", 30)
While you can access private properties using the mangled name, it's not recommended. It defeats the purpose of
encapsulation.
90
File Handling
The key function for working with files in Python is the open() function.
The open() function takes two parameters; filename, and mode.
There are four different methods (modes) for opening a file:
"r" - Read - Default value. Opens a file for reading, error if the file does not exist
"a" - Append - Opens a file for appending, creates the file if it does not exist
"w" - Write - Opens a file for writing, creates the file if it does not exist
"x" - Create - Creates the specified file, returns an error if the file exists
Besides, you can specify if the file should be handled as binary or text mode
"t" - Text - Default value. Text mode
"b" - Binary - Binary mode (e.g. images)
Syntax
To open a file for reading it is enough to specify the name of the file:
Myfile = open("[Link]")
The code above is the same as:
f = open("[Link]", "rt")
Because "r" for read, and "t" for text are the default values, you do not need to specify them.
Assume we have the following file, located in the same folder as Python:
f = open("[Link]")
print([Link]())
If the file is located in a different location, you will have to specify the file path, like this:
Example
f = open("C:\\Users\\USER\\Desktop\\ML\\[Link]")
print([Link]())
Using the with statement
You can also use the with statement when opening a file:
Example
Using the with keyword:
with open("[Link]") as f:
print([Link]())
Then you do not have to worry about closing your files, the with statement takes care of that.
Close Files
It is a good practice to always close the file when you are done with it.
If you are not using the with statement, you must write a close statement in order to close the file:
Example
Close the file when you are finished with it:
91
f = open("[Link]")
print([Link]())
[Link]()
By default the read() method returns the whole text, but you can also specify how many characters you want to
return:
Example
Return the 5 first characters of the file:
with open("[Link]") as f:
print([Link](5))
Read Lines
You can return one line by using the readline() method:
Example
Read one line of the file:
with open("[Link]") as f:
print([Link]())
By calling readline() two times, you can read the two first lines:
By looping through the lines of the file, you can read the whole file, line by line:
Example
Loop through the file line by line:
with open("[Link]") as f:
for x in f:
print(x)
Example
Open the file "[Link]" and append content to the file:
with open("[Link]", "a") as f:
[Link]("Now the file has more content!")
92
with open("[Link]") as f:
print([Link]())
To create a new file in Python, use the open() method, with one of the following parameters:
"x" - Create - will create a file, returns an error if the file exists
"a" - Append - will create a file if the specified file does not exists
"w" - Write - will create a file if the specified file does not exists
Example
Create a new file called "[Link]":
f = open("[Link]", "x")
Delete a File
To delete a file, you must import the OS module, and run its [Link]() function:
Example
Remove the file "[Link]":
import os
[Link]("[Link]")
Example
Check if file exists, then delete it:
import os
if [Link]("[Link]"):
[Link]("[Link]")
else:
print("The file does not exist")
Delete Folder
Example
Remove the folder "myfolder":
import os
[Link]("myfolder")
93
Machine Learning with Multiple Linear Regression
Before implementing multiple linear regression, it is essential to ensure that the following assumptions are met:
1. Linearity: The relationship between the dependent variable and independent variables is linear.
2. Independence of Errors: Residuals (errors) are independent of each other. This is often verified using
the Durbin-Watson test.
3. Homoscedasticity: The variance of residuals is constant across all levels of the independent variables. A
residual plot can help verify this.
4. No Multicollinearity: Independent variables are not highly correlated. Variance Inflation Factor (VIF) is
commonly used to detect multicollinearity.
5. Normality of Residuals: Residuals should follow a normal distribution. This can be checked using a Q-Q
plot.
6. Outlier Influence: Outliers or high-leverage points should not disproportionately influence the model.
These assumptions ensure that the regression model is valid and the results are reliable. Failing to meet these
assumptions may lead to biased or misleading results.
In this section, you will learn to use the Multiple Linear Regression model in Python to predict house prices based
on features from the California Housing Dataset. You’ll learn how to preprocess data, fit a regression model, and
evaluate its performance while addressing common challenges like multicollinearity, outliers, and feature selection.
Step 1 - Load the Dataset
You will use the California Housing Dataset, a popular dataset for regression tasks. This dataset contains 13 features
about houses in Boston suburbs and their corresponding median house price.
# Convert the dataset's data into a pandas DataFrame, using the feature names as column headers.
housing_df = [Link]([Link], columns=housing.feature_names)
# Add the target variable 'MedHouseValue' to the DataFrame, using the dataset's target values.
housing_df['MedHouseValue'] = [Link]
# Display the first few rows of the DataFrame to get an overview of the dataset.
print(housing_df.head())
94
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
Variable Description
MedInc Median income in block
HouseAge Median house age in block
AveRooms Average number of rooms
AveBedrms Average number of bedrooms
Population Block population
AveOccup Average house occupancy
Latitude House block latitude
Longitude House block longitude
Ensures there are no missing values in the dataset which might affect the analysis.
print(housing_df.isnull().sum())
MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
MedHouseValue 0
dtype: int64
Feature Selection
Let’s first create a correlation matrix to understand the dependencies between the variables.
correlation_matrix = housing_df.corr()
print(correlation_matrix['MedHouseValue'])
Output:
MedInc 0.688075
HouseAge 0.105623
95
AveRooms 0.151948
AveBedrms -0.046701
Population -0.024650
AveOccup -0.023737
Latitude -0.144160
Longitude -0.045967
MedHouseValue 1.000000
You can analyze the above correlation matrix to select the dependent and independent variables for our regression
model. The correlation matrix provides insights into the relationships between each pair of variables in the dataset.
In the given correlation matrix, MedHouseValue is the dependent variable, as it is the variable we are trying to
predict. The independent variables have a significant correlation with MedHouseValue.
Based on the correlation matrix, you can identify the following independent variables that have a significant
correlation with MedHouseValue:
MedInc: This variable has a strong positive correlation (0.688075) with MedHouseValue, indicating that as
median income increases, median house value also tends to increase.
AveRooms: This variable has a moderate positive correlation (0.151948) with MedHouseValue, suggesting
that as the average number of rooms per household increases, median house value also tends to increase.
AveOccup: This variable has a weak negative correlation (-0.023737) with MedHouseValue, indicating
that as the average occupancy per household increases, median house value tends to decrease, but the effect
is relatively small.
By selecting these independent variables, you can build a regression model that captures the relationships between
these variables and MedHouseValue, allowing us to make predictions about median house value based on median
income, average number of rooms, and average occupancy.
You can also plot the correlation matrix in Python using the below:
96
We shall focus on a few key features for simplicity based on the above, such as MedInc (median
income), AveRooms (average rooms per household), and AveOccup (average occupancy per household).
The above code block selects specific features from the housing_df data frame for analysis. The selected features
are MedInc, AveRooms, and AveOccup, which are stored in the selected_features list.
The DataFrame housing_df is then subset to include only these selected features and the result is stored in X list.
The target variable MedHouseValue is extracted from housing_df and stored in the y list.
Scaling Features
We shall use Standardization to ensure all features are on the same scale, improving model performance and
comparability.
Standardization is a preprocessing technique that scales numerical features to have a mean of 0 and a standard
deviation of 1. This process ensures that all features are on the same scale, which is essential for machine learning
models sensitive to the input features’ scale. By standardizing the features, you can improve model performance and
comparability by reducing the effect of features with large ranges dominating the model.
97
# Print the scaled data
print(X_scaled)
The output represents the scaled values of the features MedInc, AveRooms, and AveOccup after applying the
StandardScaler. The values are now centered around 0 with a standard deviation of 1, ensuring all features are on the
same scale.
The first row [ 2.34476576 0.62855945 -0.04959654] indicates that for the first data point, the scaled MedInc value
is 2.34476576, AveRooms is 0.62855945, and AveOccup is -0.04959654. Similarly, the second row [ 2.33223796
0.32704136 -0.09251223] represents the scaled values for the second data point, and so on.
The scaled values range from approximately -1.14259331 to 2.34476576, indicating that the features are now
normalized and comparable. This is essential for machine learning models that are sensitive to the scale of input
features, as it prevents features with large ranges from dominating the model.
Now that you are done with data preprocessing let’s implement multiple linear regression in python.
# The model is used to predict the target variable for the test set.
y_pred = [Link](X_test)
The train_test_split function is used to split the data into training and testing sets. Here, 80% of the data is used for
training and 20% for testing.
The model is evaluated using Mean Squared Error and R-squared. Mean Squared Error (MSE) measures the average
of the squares of the errors or deviations.
R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that’s
explained by an independent variable or variables in a regression model.
Output:
The output above provides two key metrics to evaluate the performance of the multiple linear regression model:
98
Mean Squared Error (MSE): 0.7006855912225249 The MSE measures the average squared difference between
the predicted and actual values of the target variable. A lower MSE indicates better model performance, as it means
the model is making more accurate predictions. In this case, the MSE is 0.7006855912225249, indicating that the
model is not perfect but has a reasonable level of accuracy. The MSE values typically should be closer to 0, with
lower values indicating better performance.
R-squared (R2): 0.4652924370503557 R-squared measures the proportion of the variance in the dependent variable
that is predictable from the independent variables. It ranges from 0 to 1, where 1 is perfect prediction and 0 indicates
no linear relationship. In this case, the R-squared value is 0.4652924370503557, indicating that about 46.53% of the
variance in the target variable can be explained by the independent variables used in the model. This suggests that
the model is able to capture a significant portion of the relationships between the variables but not all of it.
# Residual Plot
residuals = y_test - y_pred
[Link](y_pred, residuals, alpha=0.5)
[Link]('Predicted Values')
[Link]('Residuals')
[Link]('Residual Plot')
[Link](y=0, color='red', linestyle='--')
[Link]()
Using Statsmodels
The Statsmodels library in Python is a powerful tool for statistical analysis. It provides a wide range of statistical
models and tests, including linear regression, time series analysis, and nonparametric methods.
In the context of multiple linear regression, statsmodels can be used to fit a linear model to the data, and then
perform various statistical tests and analyses on the model. This can be particularly useful for understanding the
relationships between the independent and dependent variables, and for making predictions based on the model.
import [Link] as sm
Output:
99
==============================================================================
Dep. Variable: MedHouseValue R-squared: 0.485
Model: OLS Adj. R-squared: 0.484
Method: Least Squares F-statistic: 5173.
Date: Fri, 17 Jan 2025 Prob (F-statistic): 0.00
Time: 09:40:54 Log-Likelihood: -20354.
No. Observations: 16512 AIC: 4.072e+04
Df Residuals: 16508 BIC: 4.075e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.0679 0.006 320.074 0.000 2.055 2.081
x1 0.8300 0.007 121.245 0.000 0.817 0.843
x2 -0.1000 0.007 -14.070 0.000 -0.114 -0.086
x3 -0.0397 0.006 -6.855 0.000 -0.051 -0.028
==============================================================================
Omnibus: 3981.290 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11583.284
Skew: 1.260 Prob(JB): 0.00
Kurtosis: 6.239 Cond. No. 1.42
==============================================================================
Model Summary
The model is an Ordinary Least Squares regression model, which is a type of linear regression model. The
dependent variable is MedHouseValue, and the model has an R-squared value of 0.485, indicating that about 48.5%
of the variation in MedHouseValue can be explained by the independent variables. The adjusted R-squared value is
0.484, which is a modified version of R-squared that penalizes the model for including additional independent
variables.
Model Fit
The model was fit using the Least Squares method, and the F-statistic is 5173, indicating that the model is a good fit.
The probability of observing an F-statistic at least as extreme as the one observed, assuming that the null hypothesis
is true, is approximately 0. This suggests that the model is statistically significant.
Model Coefficients
The constant term is 2.0679, indicating that when all independent variables are 0, the
predicted MedHouseValue is approximately 2.0679.
The coefficient for x1(In this case MedInc) is 0.8300, indicating that for every unit increase in MedInc, the
predicted MedHouseValue increases by approximately 0.83 units, assuming all other independent variables
are held constant.
The coefficient for x2(In this case AveRooms) is -0.1000, indicating that for every unit increase in x2, the
predicted MedHouseValue decreases by approximately 0.10 units, assuming all other independent variables
are held constant.
The coefficient for x3(In this case AveOccup) is -0.0397, indicating that for every unit increase in x3, the
predicted MedHouseValue decreases by approximately 0.04 units, assuming all other independent variables
are held constant.
100
Model Diagnostics
The Omnibus test statistic is 3981.290, indicating that the residuals are not normally distributed.
The Durbin-Watson statistic is 1.983, indicating that there is no significant autocorrelation in the residuals.
The Jarque-Bera test statistic is 11583.284, indicating that the residuals are not normally distributed.
The skewness of the residuals is 1.260, indicating that the residuals are skewed to the right.
The kurtosis of the residuals is 6.239, indicating that the residuals are leptokurtic (i.e., they have a higher
peak and heavier tails than a normal distribution).
The condition number is 1.42, indicating that the model is not sensitive to small changes in the data.
Handling Multicollinearity
Multicollinearity is a common issue in multiple linear regression, where two or more independent variables are
highly correlated with each other. This can lead to unstable and unreliable estimates of the coefficients.
To detect and handle multicollinearity, you can use the Variance Inflation Factor. The VIF measures how much the
variance of an estimated regression coefficient increases if your predictors are correlated. A VIF of 1 means that
there is no correlation between a given predictor and the other predictors. VIF values exceeding 5 or 10 indicate a
problematic amount of collinearity.
In the code block below, let’s calculate the VIF for each independent variable in our model. If any VIF value is
above 5, you should consider removing the variable from the model.
The Output:
Feature VIF
0 MedInc 1.120166
1 AveRooms 1.119797
2 AveOccup 1.000488
MedInc: The VIF value is 1.120166, indicating a very low correlation with other independent variables.
This suggests that MedInc is not highly correlated with other independent variables in the model.
AveRooms: The VIF value is 1.119797, indicating a very low correlation with other independent variables.
This suggests that AveRooms is not highly correlated with other independent variables in the model.
AveOccup: The VIF value is 1.000488, indicating no correlation with other independent variables. This
suggests that AveOccup is not correlated with other independent variables in the model.
101
In general, these VIF values are all below 5, indicating that there is no significant multicollinearity between the
independent variables in the model. This suggests that the model is stable and reliable, and that the coefficients of
the independent variables are not significantly affected by multicollinearity.
Cross-Validation Techniques
Cross-validation is a technique used to evaluate the performance of a machine learning model. It is a resampling
procedure used to evaluate a model if we have a limited data sample. The procedure has a single parameter
called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is
often called k-fold cross-validation.
102
[Link]('Cross-Validation R-squared Scores')
[Link]()
Output:
Cross-Validation Scores: [0.42854821 0.37096545 0.46910866 0.31191043 0.51269138]
Mean CV R^2: 0.41864482644003276
The cross-validation scores indicate how well the model performs on unseen data. The scores range from
0.31191043 to 0.51269138, indicating that the model’s performance varies across different folds. A higher score
indicates better performance.
The mean CV R^2 score is 0.41864482644003276, which suggests that, on average, the model explains about
41.86% of the variance in the target variable. This is a moderate level of explanation, indicating that the model is
somewhat effective in predicting the target variable but may benefit from further improvement or refinement.
These scores can be used to evaluate the model’s generalizability and identify potential areas for improvement.
103
rfe = RFE(estimator=LinearRegression(), n_features_to_select=3)
[Link](X_scaled, y)
print("Selected Features:", rfe.support_)
Output:
Selected Features: [ True True False]
Based on the above chart, the 2 most suitable features are MedInc and AveRooms. This can also be verified by the
model’s output above as dependent variable MedHouseValue, is mostly dependent on MedInc and AveRooms.
104
Feature Simple Linear Regression Multiple Linear Regression
regression) constant
Model
Less complex More complex
Complexity
Model Flexibility Less flexible More flexible
Overfitting Risk Lower Higher
Interpretability Easier to interpret More challenging to interpret
Suitable for complex relationships with
Applicability Suitable for simple relationships
multiple factors
Predicting house prices based on the
Predicting house prices based on the number of
Example number of bedrooms, square footage, and
bedrooms
location
105
Model Building, Identification and Evaluation in Python
A professor who wants to buy a car uses a Simple Linear Regression model to estimate the price of the car.
The regression model created by the professor predicts price based on the engine size. One dependent variable
predicted using one independent variable.
The statistical package computed the parameters. The linear equation is estimated as:
Recall that the metric R-squared explains the fraction of the variance between the values predicted by the model and
the value as opposed to the mean of the actual. This value is between 0 and 1. The higher it is, the better the model
can explain the variance. The R-squared for the model created by Fernando is 0.7503 i.e. 75.03% on the training set.
It means that the model can explain more than 75% of the variation.
He contemplates:
What if I can feed the model with more inputs? Will it improve the accuracy?
106
Fernando decides to enhance the model by feeding the model with more input data i.e. more independent variables.
He has now entered into the world of the multivariate regression model.
The Concept:
Linear regression models provide a simple approach towards supervised learning. They are simple yet effective.
Recall that linear implies the following: arranged in or extending along a straight or nearly straight line. Linear
suggests that the relationship between dependent and independent variable can be expressed in a straight line.
The equation of the line is y = mx + c. One dimension is y-axis, another dimension is x-axis. It can be plotted in a
two-dimensional plane. It looks something like this:
The equation of line is y = mx + c. One dimension is y-axis, another dimension is x-axis. It can be plotted in a two-
dimensional plane. It looks something like this:
y = f(x).
Define y as a function of x. i.e. define the dependent variable as a function of the independent variable.
What if the dependent variable needs to be expressed in terms of more than one independent variable? The
generalized function becomes:
There are three dimensions now y-axis, x-axis and z-axis. It can be plotted as:
107
Now we have more than one dimension (x and z). We have an additional dimension. We want to express y as a
combination of x and z.
For a simple regression linear model a straight line expresses y as a function of x. Now we have an additional
dimension (z). What will happen if an additional dimension is added to a line? It becomes a plane.
The plane is the function that expresses y as a function of x and z. Extrapolating the linear regression equation, it
can now be expressed as:
108
y = m1.x + m2.z+ c
y is the dependent variable i.e. the variable that needs to be estimated and predicted.
x is the first independent variable i.e. the variable that is controllable. It is the first input.
m1 is the slope of x1. It determines what will be the angle of the line (x).
z is the second independent variable i.e. the variable that is controllable. It is the second input.
m2 is the slope of z. It determines what will be the angle of the line (z).
c is the intercept. A constant that determines the value of y when x and z are 0.
This is the genesis of the multivariate linear regression model. There are more than one input variables used to
estimate the target. A model with two input variables can be expressed as:
y = β0 + β1.x1 + β2.x2
Let us take it a step further. What if we had three variables as inputs? Human visualization capabilities are limited
here. It can only visualize three dimensions. In machine learning world, there can be many dimensions. A model
with three input variables can be expressed as:
Model Formulation:
Now that there is familiarity with the concept of a multivariate linear regression model let us get back to Fernando.
Fernando reaches out to his friend for more data. He asks him to provide more data on other characteristics of the
cars.
109
The following were the data points he already had:
Fernando now wants to build a model that predicts the price based on the additional data points.
Estimate price as a function of engine size, horse power, peakRPM, length, width and height.
=> price = f(engine size, horse power, peak RPM, length, width, height)
110
=> price = β0 + β1. engine size + β[Link] power + β3. peak RPM + β[Link]+ β[Link] + β[Link]
Model Building:
Fernando inputs these data into his statistical package. The package computes the parameters. The output is the
following:
The multivariate linear regression model provides the following equation for the price estimation.
price = -85090 + 102.85 * engineSize + 43.79 * horse power + 1.52 * peak RPM - 37.91 * length + 908.12 *
width + 364.33 * height
Model Interpretation:
The interpretation of multivariate model provides the impact of each independent variable on the dependent variable
(target).
Remember, the equation provides an estimation of the average value of price. Each coefficient is interpreted with
all other predictors held constant.
Engine Size: With all other predictors held constant, if the engine size is increased by one unit, the average
price increases by $102.85.
Horse Power: With all other predictors held constant, if the horse power is increased by one unit, the
average price increases by $43.79.
Peak RPM: With all other predictors held constant, if the peak RPM is increased by one unit, the average
price increases by $1.52.
111
Length: With all other predictors held constant, if the length is increased by one unit, the average price
decreases by $37.91 (length has a -ve coefficient).
Width: With all other predictors held constant, if the width is increased by one unit, the average price
increases by $908.12
Height: With all other predictors held constant, if the height is increased by one unit, the average price
increases by $364.33
Model Evaluation
The model is built. It is interpreted. Are all the coefficients important? Which ones are more significant? How much
variation does the model explains. The statistical package provides the metrics to evaluate the model. Let us evaluate
the model now.
Recall the discussion on the definition of t-stat, p-value and coefficient of determination. Those concepts apply in
multivariate regression models too. The evaluation of the model is as follows:
coefficients: All coefficients are greater than zero. This implies that all variables have an impact on the
average price.
t-value: Except for length, t-value for all coefficients are significantly above zero. For length, the t-stat is -
0.70. It implies that the length of the car may not have an impact on the average price.
p-value: The probability of observing the p-value purely by chance is quite low for all of the variables
except for length. The p-value for length is 0.4854. This implies that probability that the observed t-stat is
by chance is 48.54%. This number is quite high.
Recall the discussion of how R-squared help to explain the variations in the model. When more variables are added
to the model, the r-square will not decrease. It only increases. However, there has to be a balance. Adjusted R-
squared strives to keep that balance. The adjusted R-squared is a modified version of R-squared that has been
adjusted for the number of predictors in the model. The adjusted R-squared compensates for the addition of
variables and only increases if the new term enhances the model.
Adjusted R-squared: The r-squared value is 0.811. This implies that the model can explain 81.1% of
variations seen in training data. It is better than the previous model (75.03%).
All variables except for the length of the car has an impact on the price.
112
The length of the car does not have the significant impact on price.
The model explains 81.1% of the variation in data.
Conclusion:
The engineer has a better model now. However, he is perplexed. He knows that length of the car doesn’t impact the
price.
He wonders:
How can one select the best set of variables for model building? Is there any method to choose the best subsets of
variables?
In the next part of this series, we will discuss variable selection methods.
113
Complex Real-World Challenges in ML Models
Complex Real-world challenges requires complex models to be build to give out predictions with utmost accuracy.
However, they do not end up being highly interpretable. In this article, we will be looking into the relationship
between complexity, accuracy and interpretability.
In real-world, while working on any problem its important to understand the trade-off between Model Accuracy and
Model Interpretability. Business users want Data Scientists to build models with higher accuracy while Data
Scientist face the issue to explain to them how these model makes predictions.
What is more important?? — Having a model that gives best accuracy on unseen data or understanding the
predictions even when the accuracy is poor. Below we have a comparison of traditional models accuracy vs their
ability to be interpretable.
Accuracy vs Interpretability
The graph shows some of the most used algorithms of Machine learning and how interpretable they are. The
complexity increases in terms of how the Machine learning model works underneath. It can be parametric model
(Linear Models) or non-parametric models (K-Nearest Neighbour), Simple Decision trees (CART) or Ensemble
models (Bagging method — Random Forest or Boosting method— Gradient Boosting Trees). Complex models
mostly give better accuracy in their predictions. However, interpreting them is more difficult.
114
Model Complexity and Accuracy
Goal of any supervised machine learning algorithm is to achieve low bias and low variance. However, its not
possible in real life and we have a trade-off between Bias and Variance.
Linear Regression assumes linearity when in reality the relationship is quite complex. These simplifying
assumptions give high Bias(train and test errors high) and the model tends to be underfit. High bias can be reduced
by using a complex functions or adding more features. That is when the Complexity increases and accuracy
increases. At a certain point, the model will become too complex, and tend to overfit the training data i.e. low Bias
but high Variance for test data. Complex models like Decision Trees tend to overfit.
There is usually a tendency to overfit a Machine learning model, hence to overcome this we can use resampling
technique (Cross Validation) to improve the performance on unseen data.
In use cases when the impact of the prediction is high, understanding “Why” a certain prediction is made is really
important. Knowing the ‘why’ can help you learn more about the problem, the data and the reason why a model
might fail.
Implementation:
115
Bike Rental Dataset can be found from UCI Machine Learning Respository:
[Link]
This dataset contains daily counts of rented bicycles from the bicycle rental company Capital-Bikeshare in
Washington D.C., along with weather and seasonal information.
Goal: Predict how many bikes will be rented depending on the weather and the day.
Input Variables:
1. Total_count (target): Count of total rental bikes including both casual and registered
2. Yr: Year (0: 2011, 1:2012)
3. Month: Month (1 to 12)
4. Hr: hour (0 to 23)
5. Temp: Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8,
t_max=+39 (only in hourly scale)
6. Atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min),
t_min=-16, t_max=+50 (only in hourly scale)
7. Humidity: Normalized humidity. The values are divided to 100 (max)
8. Windspeed: Normalized wind speed. The values are divided to 67 (max)
9. Holiday: Whether day is holiday or not
10. Weekday: Day of the week
11. Workingday: If day is neither weekend nor holiday is 1, otherwise is 0
12. Season : Season (1:winter, 2:spring, 3:summer, 4:fall)
13. Weather:
Features:
116
Bike rides increases over a period of time
The number of bike rides increases over the period of 2 years from 2011 to 2012.
117
Correlation matrix
Windspeed and humidity have slightly negative correlation. Temp and atemp carry the same information and hence
are highly positively correlated. So for building the model, we can use either temp or atemp.
118
Histogram of target: Most of the days bike rides have been around 20–30 rides/hr
Preprocessing:
Dropping features like causal, registered as they are same as total_count. Similarly, for features like atemp which is
same as temp, dropping one to reduce multicollinearity. For categorical features, using OneHotEncoding method to
transform into a format that works better with regression models.
Model Implementations:
We will be going through models with increasing complexity and see how the interpretability decreases.
Linear regression involving multiple variables is called “multiple linear regression” or “multivariate linear
regression”.
119
Source
Goal of multiple linear regression (MLR) is to model the linear relationship between the explanatory (independent)
variables and response (dependent) variable. In essence, multiple regression is the extension of ordinary least-
squares (OLS) regression that involves more than one explanatory variable.
Regression comes with some assumptions that are not practical in real world datasets.
1. Linearity
2. Homoscedasticity (Constant variance)
3. Independence
4. Fixed features
5. Absense of multicollinearity
120
Mean Squared Error: 19592.4703292543
R score: 0.40700134640548247
Mean Absolute Error: 103.67180228987019
Using Cross-validation:
To interpret Linear models is easier, we can look into the coefficients of each variable to understand its effect on the
prediction and also the slope of intercept.
The intercept represents the value of y(target) when none of the features have any effect(x=0).
18.01100142944577
Coefficients corresponding to [Link] helps us understand the effect of each feature on the target outcome.
This means that increase in “temp” by a unit increases Bike rides by 211.05 units. Same applies for rest features
121
Decision Tree Regressor:
Decision trees work by iteratively splitting the data into distinct subsets in a greedy fashion. For regression trees,
they are chosen to minimize either the MSE (mean squared error) or the MAE (mean absolute error) within all of the
subsets.
CART takes a feature and determines which cut-off point minimizes the variance of y for a regression task. The
variance tells us how much the y values in a node are spread around their mean value. Splits are based on features
that minimize the variance based on average of all subsets used in decision tree.
Decision tree has a better fit to the model than Linear Regression. The R square value is about 0.67.
Using Cross-Validation:
122
Decision Tree Regressor output
Feature Importance:
Feature importance is based on the one that reduces the maximum variance for all the splits the feature was used. A
feature might be used for more than one split or not at all. We can add the contributions for each of the p features
and get an interpretation of how much each feature has contributed to a prediction.
We can see that features: hr, temp, year, workingday, season_Spring are the features that used to split the decision
tree.
123
Decision Tree Regressor — Feature Importance Bar chart
124
Gradient Boosting Regressor:
Boosting is an ensemble technique in which the predictors are not made independently, but sequentially. Gradient
Boosting uses Decision tree as weak models.
Boosting is a method of converting weak learners into strong learners by training many models in a gradual, additive
and sequential manner and minimizing Loss function (i.e squared error for Regression problems) in the final
model.
GBR has better accuracy than other Regression model because of its Boosting technique. It is the most used
Regression algorithm for competitions.
The Gradient Boosting Regressor gives us the best R2 square value of 0.957. However, to interpret this model is
very difficult.
Ensemble models definitely fall into the category of “Black Box” models since they are composed of many
potentially complex individual models.
Each tree in sequentially fashion is trained on bagged data using random selection of features, so gaining a full
understanding of the decision process by examining each individual tree is infeasible.
Both the KMO and Bartlett’s test of sphericity are commonly used to verify the feasibility of the data for
Exploratory Factor Analysis (EFA).
Kaiser-Meyer Olkin (KMO) model tests sampling adequacy by measuring the proportion of variance in the
items that may be common variance. Values ranging between .80 and 1.00 indicate sampling adequacy
(Cerny & Kaiser, 1977).
Bartlett’s test of sphericity examines whether a correlation matrix is significantly different to the identity
matrix, in which diagonal elements are unities and all off-diagonal elements are zeros (Bartlett, 1950).
Significant results indicate that variables in the correlation matrix are suitable for factor analysis.
The logic behind absolute fit indices is essentially to test how well the model specified by the researcher
reproduces the observed data. Commonly used absolute fit statistics include the χ2
125
fit statistic, RMSEA, SRMR.
In contrast, comparative fit indices are based on a different logic, i.e. they assess how well a model
specified by a researcher fits the observed sample data relative to a null model (i.e., a model that is based
on the assumption that all observed variables are not correlated) (Miles & Shevlin, 2007). Popular
comparative model fit indices are the CFI and TLI.
The χ2
measures the discrepancy between the observed and the implied covariance matrices.
The χ2
fit statistic is very popular and frequently reported in both CFA and SEM studies.
However, it is notoriously sensitive to large sample sizes and increased model complexity (i.e. models with
a large number of indicators and degrees of freedom). Therefore, the current practice is to report it mostly
for historical reasons, and it rarely used to make decisions about the adequacy of model fit.
The RMSEA
The Root Mean Square Error of Approximation (RMSEA) provides information as to how well the model,
with unknown but optimally chosen parameter estimates, would fit the population covariance matrix
(Byrne, 1998).
It is a very commonly used fit statistic.
One of its key advantages is that the RMSEA calculates confidence intervals around its value.
Values below .060
The SRMR
The Standardized Root Mean Residual (SRMR) is the square root of the difference between the residuals of
the sample covariance matrix and the hypothesized covariance model.
As SRMR is standardized, its values range between 0
and 1. Commonly, models with values below .05 threshold are considered to indicate good fit (Byrne, 1998). Also,
values up to .08
Two comparative fit indices commonly reported are the Comparative Fit Index (CFI) and the Tucker Lewis
Index (TLI). The indices are similar; however, note that the CFI is normed while the TLI is not. Therefore,
the CFI’s values range between zero and one, whereas the TLI’s values may fall below zero or be above
one (Hair et al., 2013).
126
For CFI and TLI values above .95 are indicative of good fit (Hu & Bentler, 1999). In practice, CFI and
TLI values from .90 to .95
Note that the TLI is non-normed, so its values can go above 1.00
Note:
Further to the aforementioned information, Hoyle (2012) provides an excellent succinct summary of numerous fit
indices. This table includes, for example, information on the indices' theoretical range, sensitivity to varying sample
size and model complexity. Note that, in contrast to the indices introduced above, a great number of other indices
exist, as illustrated in Hoyle's table. Yet, the frequency of their use is decreasing for various reasons. For example,
RMR is non-normed and thus it is hard to interpret. Here these indices are shown below simply for everyone's
general awareness, i.e. the fact that they exist, who developed them and what their statistical properties are.
127