Python
Python
com/jobs/india-improves-skills-proficiency-in-technology-but-lags-in-data-science-
coursera-report/articleshow/[Link]
Data Science: is a branch of computer science where we study how to
store, use and analyze data for deriving information from it.
❖ A Python library is a collection of related modules. It contains bundles of code that can be used repeatedly in
different programs.
❖ As we don't need to write the same code again and again for different programs.
❖ Python libraries are used to create applications and models in a variety of fields, for instance, machine
learning, data science, data visualization, image and data manipulation, and many more.
PYTHON BASIC SYNTAX
INTRODUCTION
COMMENTS…
Python Comments
• Python comments are strings that begin with the # (hash/pound sign).
• They are used to document code and to help other programmers understand the same.
• You can use Python comments inline, on independent lines, or on multiple lines to
include larger documentation.
• These comments are statements that are not part of your program.
• For this reason, comment statements are skipped while executing your program.
• Usually we use comments for making brief notes about a chunk of code.
Example
>>> #this is a Python comment.
>>> #print("I will not be executed")
Output:
Different Types of Python Comments
• Example :
Multiple Lines Comments
• Multiple lines comments to note down something
much more in details or to block out an entire chunk
of code.
• Multiple lines comments are slightly different.
• Simply use 3 single quotes before and after the part
you want to be commented.
• Example:
VARIABLES…
PYTHON VARIABLES
• Variables are nothing but reserved memory locations
to store values.
• This means that when you create a variable you
reserve some space in memory.
• Based on the data type of a variable, the interpreter
allocates memory and decides what can be stored in
the reserved memory.
• Therefore, by assigning different data types to
variables, you can store integers, decimals or
characters in these variables.
Valid and Invalid Identifiers
• Variables are the example of identifiers.
• An Identifier is used to identify the literals used in the program.
Creating Variables
Python has no command for declaring a variable.
A variable is created the moment you first assign a value to it.
Variable Names
•A variable can have a short name (like x and y) or a more descriptive name (age, car
name, total_volume). Rules for Python variables:A variable name must start with a letter or
the underscore character
•A variable name cannot start with a number
•A variable name can only contain alpha-numeric characters & underscores (A-z, 0-9,& _ )
•Variable names are case-sensitive (age, Age and AGE are three different variables)
Example
Legal variable names: Example
myvar = "John" Illegal variable names:
my_var = "John" 2myvar = "John"
_my_var = "John" my-var = "John"
myVar = "John" my var = "John"
MYVAR = "John"
myvar2 = "John"
Many Values to Multiple Variables
Example
x, y, z = "Orange", "Banana", "Cherry"
print(x)
print(y)
Python Variables - Assign Multiple Values print(z)
fruits=["apple","banana","cherry"]
x, y, z = fruits
print(x)
print(y)
print(z)
Unpack a Collection
x = y = z = "Orange"
print(x)
print(y)
print(z)
x = "Python is awesome"
print(x)
Python - Output Variables
x = "Python"
The Python print() function is often used to output variables
y = "is"
z = "awesome"
print(x, y, z)
x = 5
y = 10
print(x + y)
x = 5
y = "John"
print(x + y)
Global variables
def myfunc():
print("Python is " + x)
myfunc()
Python Data Types
In programming, data type is an important concept.
Variables can store data of different types, and different types can do different things.
x = 1 print(type(x))
y = 35656222554887711 print(type(y))
z = -3255522 print(type(z))
Float
Float, or "floating point number" is a number, positive or negative, containing one or more decimals.
x = 1.10 print(type(x))
y = 1.0 print(type(y))
Import the random module, and display a
z = -35.59 print(type(z)) random number between 1 and 9:
Complex import random
Complex numbers are written with a "j" as the imaginary part:
print([Link](1, 10))
x = 3+5j print(type(x))
y = 5j print(type(y))
z = -5j print(type(z))
Python Strings
Strings in python are surrounded by either single quotation marks, or double quotation marks.
'hello' is the same as "hello". print("Hello")
You can display a string literal with the print() function: print('Hello')
a = ’‘’data science.'''
print(a)
String Length
To get the length of a string, use the len() function. a = "Hello, World!"
print(len(a))
Check String
To check if a certain phrase or character is present in a string, we can use the keyword in.
Use it in an if statement:
txt = "The best things in life are free!"
if "expensive" not in txt:
print("No, 'expensive' is NOT present.")
Python - Slicing Strings
You can return a range of characters by using the slice syntax.
Specify the start index and the end index, separated by a colon, to return a part of the string.
Split String
a = "Hello, World!"
print([Link](","))
Python - Format - Strings
As we learned in the Python Variables chapter, we cannot
combine strings and numbers like this:
age = 36
txt = "My name is John, I am " + age
print(txt)
age = 36
txt = "My name is John, and I am {}"
print([Link](age))
quantity = 3
item = 567
price = 49.95
myorder = "I want {} pieces of item {} for {}
dollars."
print([Link](quantity, item, price))
Python Booleans
Boolean Values
Booleans represent one of two values: True or False.
You can evaluate any expression in Python, and get one of two answers, True or False.
When you compare two values, the expression is evaluated and Python returns the Boolean answer:
4.a = 200
print(10 > 9)
b = 33
print(10 == 9)
print(10 < 9)
if b > a:print("b is greater than a")
else: print("b is not greater than a")
[Link]("abc")
bool(123)
bool(["apple", "cherry", "banana"])
Python Operators
Operators are used to perform operations on variables and values.
•Arithmetic operators
•Assignment operators
•Comparison operators
•Logical operators
•Identity operators
•Membership operators
•Bitwise operators
Python Arithmetic Operators
Arithmetic operators are used with numeric values to perform common mathematical operations:
Python Comparison Operators
Comparison operators are used to compare two values:
Python Lists
Lists are used to store multiple items in a single variable.
Lists are one of 4 built-in data types in Python used to store collections of data, the other 3
are Tuple, Set, and Dictionary, all with different qualities and usage.
Lists are created using square brackets:
Negative Indexing
Negative indexing means start from the end
-1 refers to the last item, -2 refers to the second last item etc.
Range of Indexes
You can specify a range of indexes by specifying where to start and where to end the range.
When specifying a range, the return value will be a new list with the specified items.
To insert a list item at a specified index, use the insert() method. The insert() method inserts an item at the
specified index:
Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C and
[Link](xpoints, ypoints)
[Link]()
Markers
You can use the keyword argument marker to emphasize each point with a specified marker:
Shorter Syntax
The line style can be written in a shorter syntax:
linestyle can be written as ls.
dotted can be written as :.
dashed can be written as --.
Create Labels for a Plot
With Pyplot, you can use the xlabel() and ylabel() functions to set a label for the x- and y-axis.
import numpy as np
import [Link] as plt
x = [Link]([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = [Link]([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
[Link](x, y)
[Link]("Average Pulse")
[Link]("Calorie Burnage")
[Link]()
Create a Title for a Plot
With Pyplot, you can use the title() function to set a title for the plot.
import numpy as np
import [Link] as plt
x = [Link]([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = [Link]([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
[Link](x, y)
[Link]()
Specify Which Grid Lines to Display
You can use the axis parameter in the grid() function to specify which grid lines to display.
Legal values are: 'x', 'y', and 'both'. Default value is 'both'.
import numpy as np
import [Link] as plt
x = [Link]([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = [Link]([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
[Link](x, y)
[Link](axis = 'x')
[Link]()
Display only grid lines for the y-axis:
import numpy as np
import [Link] as plt
x = [Link]([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = [Link]([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
[Link](x, y)
[Link](axis = 'y')
[Link]()
Matplotlib Scatter
Creating Scatter Plots
With Pyplot, you can use the scatter() function to draw a scatter plot.
The scatter() function plots one dot for each observation. It needs two arrays of the same length, one for the values of the x-axis,
and one for values on the y-axis:
annot – an array of the same shape as data which is used to annotate the heatmap.
A simple scatter plot:
x = [Link]([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = [Link]([99,86,87,88,111,86,103,87,94,78,77,85,86])
[Link](x, y)
[Link]()
Compare Plots
In the example above, there seems to be a relationship between speed and age, but what if we plot
the observations from another day as well? Will the scatter plot tell us something else?
x = [Link]([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = [Link]([99,86,87,88,111,86,103,87,94,78,77,85,86])
[Link](x, y, color = 'hotpink')
x = [Link]([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y = [Link]([100,105,84,105,90,99,90,95,94,100,79,112,91,80,85])
[Link](x, y, color = '#88c999')
[Link]()
Color Each Dot
You can even set a specific color for each dot by using an array of colors as value for the c argument:
x = [Link]([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = [Link]([99,86,87,88,111,86,103,87,94,78,77,85,86])
colors =
[Link](["red","green","blue","yellow","pink","black","orange","purple","beige","brown","g
ray","cyan","magenta"])
[Link](x, y, c=colors)
[Link]()
Color Map
-The Matplotlib module has a number of available colormaps.
-A colormap is like a list of colors, where each color has a value that ranges from 0 to 100.
-This colormap is called 'viridis' and as you can see it ranges from 0, which is a purple color, and
up to 100, which is a yellow color.
x = [Link]([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = [Link]([99,86,87,88,111,86,103,87,94,78,77,85,86])
colors = [Link]([0, 10, 20, 30, 40, 45, 50, 55, 60, 70, 80, 90, 100])
[Link]()
[Link]()
Size
You can change the size of the dots with the s argument.
Just like colors, make sure the array for sizes has the same length as the arrays for the x- and y-axis:
x = [Link]([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = [Link]([99,86,87,88,111,86,103,87,94,78,77,85,86])
sizes = [Link]([20,50,100,200,500,1000,60,90,10,300,600,800,75])
[Link](x, y, s=sizes)
[Link]()
Alpha
You can adjust the transparency of the dots with the alpha argument.
Just like colors, make sure the array for sizes has the same length as the arrays for the x- and y-axis:
x = [Link]([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = [Link]([99,86,87,88,111,86,103,87,94,78,77,85,86])
sizes = [Link]([20,50,100,200,500,1000,60,90,10,300,600,800,75])
[Link]()
Combine Color Size and Alpha
You can combine a colormap with different sizes on the dots. This is best visualized if the dots are
transparent:
Create random arrays with 100 values for x-points, y-points, colors and sizes:
x = [Link](100, size=(100))
y = [Link](100, size=(100))
colors = [Link](100, size=(100))
sizes = 10 * [Link](100, size=(100))
[Link]()
[Link]()
Color Map
The Matplotlib module has a number of available colormaps.
A colormap is like a list of colors, where each color has a value that ranges from 0 to
100.
This colormap is called 'viridis' and as you can see it ranges from 0, which is a purple
color, and up to 100, which is a yellow color.
Matplotlib Histograms
Histogram
A histogram is a graph showing frequency distributions.
It is a graph showing the number of observations within each given interval.
Example: Say you ask for the height of 250 people, you might end up with a histogram like this:
Create Histogram
-In Matplotlib, we use the hist() function to create histograms.
-The hist() function will use an array of numbers to create a histogram, the array is sent into the
function as an argument.
-For simplicity we use NumPy to randomly generate an array with 250 values, where the values will
concentrate around 170, and the standard deviation is 10.
[Link](x,y)
Horizontal Bars
If you want the bars to be displayed horizontally instead of vertically, use the barh() function:
[Link](x, y)
Bar Color
The bar() and barh() takes the keyword argument color to set the color of the bars:
Color Names
Bar Width
The bar() takes the keyword argument width to set the width of the bars:
[Link](y)
[Link]()
Labels
Add labels to the pie chart with the label parameter. The label parameter must be an array with one label for each wedge:
import pandas as pd
a = [1, 7, 2]
myvar = [Link](a)
print(myvar)
The type int64 tells us that Python is storing each value within this column as a 64 bit integer.
Labels
❖ If nothing else is specified, the values are labeled with their index number. First value has index
print(myvar[0])
Create Labels
With the index argument, you can name your own labels.
import pandas as pd
a = [1, 7, 2]
print(myvar)
When you have created labels, you can access an item by referring to the label.
Return the value of "y":
print(myvar["y"])
Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.
import pandas as pd
calories =
{"day1": 420, "day2": 380, "day3": 390}
myvar = [Link](calories)
print(myvar)
Pandas DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Series is like a column, a DataFrame is the whole table.
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with
rows and columns.
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
Locate Row
The DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)
print([Link][0])
print([Link][[0, 1]])
Named Indexes
With the index argument, you can name your own indexes.
import pandas as pd
print(df)
Locate Named Indexes
Use the named index in the loc attribute to return the specified row(s).
print([Link]["day2"])
Pandas Read CSV
-A simple way to store big data sets is to use CSV files (comma separated files).
-CSV files contains plain text and is a well know format that can be read by everyone including
Pandas.
-Using a CSV file called '[Link]’.
Load the CSV into a DataFrame: Print the DataFrame without the to_string() method:
df = pd.read_csv('[Link]') df = pd.read_csv('[Link]')
print(df.to_string()) print(df)
In my system the number is 60, which means that if the DataFrame contains more than 60 rows, the print(df) statement will return only the headers
and the first and last 5 rows.
You can change the maximum rows number with the same statement.
Data: [Link]
If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly:
Pandas - Analyzing DataFrames
Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method.
The head() method returns the headers and a specified number of rows, starting from the top.
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from the bottom.
•[Link] cells
•[Link] data
•[Link]
The data set contains some empty cells ("Date" in row 22, and "Calories" in row 18 and 28).
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result.
import pandas as pd
df = pd.read_csv(‘[Link]')
new_df = [Link]()
print(new_df.to_string())
the dropna() method returns a new DataFrame, and will not change the original.
If you want to change the original DataFrame, use the inplace = True argument:
df = pd.read_csv('[Link]')
[Link](inplace = True)
print(df.to_string())
the dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows containing NULL values from the original DataFrame.
Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.
df = pd.read_csv('[Link]')
Replace NULL values in the "Calories" columns with the number 130:
import pandas as pd
df = pd.read_csv('[Link]')
Calculate the MEAN, and replace any empty values Calculate the MEDIAN, and replace any empty
with it: values with it:
import pandas as pd import pandas as pd
df = pd.read_csv('[Link]') df = pd.read_csv('[Link]')
x = df["Calories"].mean() x = df["Calories"].median()
df = pd.read_csv('[Link]')
x = df["Calories"].mode()[0]
Convert to date:
import pandas as pd
df = pd.read_csv('[Link]')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
the date in row 26 was fixed, but the empty date in row 22 got a NaT (Not a Time) value, in other words an empty value. One way
to deal with empty values is simply removing the entire row.
Removing Rows
The result from the converting in the example above gave us a NaT value, which can be handled as a NULL
value, and we can remove the row by using the dropna() method.
-Sometimes you can spot wrong data by looking at the data set, because you have an
-If you take a look at our data set, you can see that in row 7, the duration is 450, but for
-It doesn't have to be wrong, but taking in consideration that this is the data set of
someone's workout sessions, we conclude with the fact that this person did not work out in
450 minutes.
How can we fix wrong values, like the one for "Duration" in row 7?
Replacing Values
-One way to fix wrong values is to replace them with something else.
-In our example, it is most likely a typo, and the value should be "45" instead of "450", and we could
just insert "45" in row 7:
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.
print([Link]())
Removing Duplicates
To remove duplicates, use the drop_duplicates() method
It also has functions for working in domain of linear algebra, fourier transform, and matrices.
Mainly used for multidimensional (2D,3D) arrays & This is alternative for matlab
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely.
NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very important.
Why is NumPy Faster Than Lists?
❖ NumPy arrays are stored at one continuous place in memory unlike lists, so processes can access
❖ This is the main reason why NumPy is faster than lists. Also it is optimized to work with latest CPU
architectures.
NumPy Creating Arrays
Create a NumPy ndarray Object
NumPy is used to work with arrays. The array object in NumPy is called ndarray.
Example
import numpy as np
print(arr)
A Python Array is a collection of common type of data structures having elements with same data type.
It is used to store collections of data. In Python programming, an arrays are handled by the “array” module.
If you create arrays using the array module, elements of the array
must be of the same numeric type.
You can insert different types of data in it. Like integer, floating,
list, tuple, string, etc.
Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).
0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
Example
Create a 0-D array with value 42
import numpy as np
arr = [Link](42)
print(arr)
1-D Arrays
An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
These are the most common and basic arrays.
Example
import numpy as np
print(arr)
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
These are often used to represent matrix or 2nd order tensors.
Example
import numpy as np
arr = [Link]([[1, 2, 3], [4, 5, 6]])
print(arr)
3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
These are often used to represent a 3rd order tensor.
import numpy as np
print(arr)
Scikit-learn
What is scikit-learn used for?
Scikit-learn is probably the most useful library for machine learning in Python. The sklearn library contains a lot of
efficient tools for machine learning and statistical modeling including classification, regression, clustering and
dimensionality reduction.
•Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering
algorithms including support vector machines, random forests, gradient boosting, k-means, etc.
• Output :
Examples
int long float complex
10 51924361L 0.0 3.14j
100 -0x19323L 15.20 45.j
-786 0122L -21.9 9.322e-36j
080 0xDEFABCECBDAECBF 32.3+e18 .876j
BAEl
-0490 535633629843L -90. -.6545+0J
Example : Output:
[Link]
• Set is an unordered collection of
unique items.
• Set is defined by values
separated by comma inside
braces { }.
• Items in a set are not ordered.
• It is iterable, mutable(can
modify after creation), and has
unique elements.
• In set, the order of the elements
is undefined; it may return the
changed sequence of the
element.
• It can contain various types of
values.
Set – unique values
• We can perform set operations like union, intersection
on two sets.
• Sets have unique values.
• They eliminate duplicates.
Set - indexing
lower()
✓The lower() method returns the string in lower case:
>>>a = “Hello, World!”
>>>print([Link]())
>>>print([Link]())
replace()
• The replace() method replaces a string with another string:
>>> print([Link]("H", "J"))
Some Methods
split()
✓The split() method splits the string into substrings if it finds
instances of the separator:
>>> a = "Hello,World!"
>>> print([Link](","))
# returns ['Hello', ' World!']
Array
Python Comments
Machine Learning
Data everywhere!
1. Google: processes 24 peta bytes of data per day.
6. : : :
of four Eiffel towers. Machine learning helps analyze this data easily and quickly.
Machine learning
Purpose of Machine Learning
Machine learning is a great tool to analyze data, find hidden data patterns and relationships, and extract information
Data
Gain insights into unknown data
• Recommendation system
• Search engines
• Handwriting recognition
• Scene classication
• etc...
Raw Mango vs. Ripen Mango
SUPERVISED LEARNING
Supervised Learning
UNSUPERVISED LEARNING
Unsupervised Learning
TYPES OF SUPERVISED LEARNING
BINARY CLASSIFICATION
MULTICLASS CLASSIFICATION
REGRESSION