Python Module2
Python Module2
Pandas is open-source Python library which is used for data manipulation and analysis. It
consists of data structures and functions to perform efficient operations on data. It is well-
suited for working with tabular data such as spreadsheets or SQL tables. It is used in data
science because it works well with other important libraries. It is built on top of the NumPy
library as it makes easier to manipulate and analyze. Pandas is used in other libraries such
as:
Matplotlib for plotting graphs
SciPy for statistical analysis
Scikit-learn for machine learning algorithms.
It uses many functionalities provided by NumPy library.
Here is a various tasks that we can do using Pandas:
Data Cleaning, Merging and Joining: Clean and combine data from multiple sources,
handling inconsistencies and duplicates.
Handling Missing Data: Manage missing values (NaN) in both floating and non-
floating point data.
Column Insertion and Deletion: Easily add, remove or modify columns in a
DataFrame.
Group By Operations: Use "split-apply-combine" to group and analyze data.
Data Visualization: Create visualizations with Matplotlib and Seaborn, integrated with
Pandas.
Installing Pandas
First step in working with Pandas is to ensure whether it is installed in the system or not. If
not then we need to install it on our system using the pip command.
pip install pandas
Importing Pandas
After the Pandas have been installed in the system we need to import the library. This
module is imported using:
import pandas as pd
Note: pd is just an alias for Pandas. It’s not required but using it makes the code shorter
when calling methods or properties.
Pandas as pd
import pandas as pd
Example
import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = [Link](mydataset)
print(myvar)
import pandas as pd
a = [1, 7, 2]
myvar = [Link](a)
print(myvar)
Labels
If nothing else is specified, the values are labeled with their index number. First value has
index 0, second value has index 1 etc.
This label can be used to access a specified value.
Example
Return the first value of the Series:
print(myvar[0])
Create Labels
With the index argument, you can name your own labels.
Example
Create your own labels:
import pandas as pd
a = [1, 7, 2]
myvar = [Link](a, index = ["x", "y", "z"])
print(myvar)
When you have created labels, you can access an item by referring to the label.
Example
Return the value of "y":
print(myvar["y"])
DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Series is like a column, a DataFrame is the whole table.
Example
Create a DataFrame from two Series:
import pandas as pd
data = {
"calories": [420, 380, 390],
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = [Link](data)
print(df)
Result
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
Return row 0:
#refer to the row index:
print([Link][0])
Result
calories 420
duration 50
Name: 0, dtype: int64
Example
Return row 0 and 1:
#use a list of indexes:
print([Link][[0, 1]])
Result:
calories duration
0 420 50
1 380 40
Named Indexes
With the index argument, you can name your own indexes.
Example
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = [Link](data, index = ["day1", "day2", "day3"])
print(df)
Result
calories duration
day1 420 50
day2 380 40
day3 390 45
import pandas as pd
df = pd.read_csv('[Link]')
print(df)
NumPy Introduction
What is NumPy?
NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier transform, and matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use
it freely.
NumPy stands for Numerical Python.
Why Use NumPy?
In Python we have lists that serve the purpose of arrays, but they are slow to process.
NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that
make working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very
important.
Why is NumPy Faster Than Lists?
NumPy arrays are stored at one continuous place in memory unlike lists, so processes can
access and manipulate them very efficiently.
This behavior is called locality of reference in computer science.
This is the main reason why NumPy is faster than lists. Also it is optimized to work with
latest CPU architectures.
import numpy as np
Now the NumPy package can be referred to as np instead of numpy.
Example
import numpy as np
arr = [Link]([1, 2, 3, 4, 5])
print(arr)
Checking NumPy Version
Example
import numpy as np
print(np.__version__)
NumPy Creating Arrays
Create a NumPy ndarray Object
NumPy is used to work with arrays. The array object in NumPy is called ndarray.
We can create a NumPy ndarray object by using the array() function.
Example
import numpy as np
arr = [Link]([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
type(): This built-in Python function tells us the type of the object passed to it. Like in above
code it shows that arr is [Link] type.
To create an ndarray, we can pass a list, tuple or any array-like object into the array() method,
and it will be converted into an ndarray:
Example
Use a tuple to create a NumPy array:
import numpy as np
arr = [Link]((1, 2, 3, 4, 5))
print(arr)
Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).
nested array: are arrays that have arrays as their elements.
0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
Example
Create a 0-D array with value 42
import numpy as np
arr = [Link](42)
print(arr)
1-D Arrays
An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
These are the most common and basic arrays.
Example
Create a 1-D array containing the values 1,2,3,4,5:
import numpy as np
arr = [Link]([1, 2, 3, 4, 5])
print(arr)
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
These are often used to represent matrix or 2nd order tensors.
NumPy has a whole sub module dedicated towards matrix operations called [Link]
Example
Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:
import numpy as np
arr = [Link]([[1, 2, 3], [4, 5, 6]])
print(arr)
NumPy Array Indexing
Access Array Elements
Array indexing is the same as accessing an array element.
You can access an array element by referring to its index number.
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.
Example
Get the first element from the following array:
import numpy as np
arr = [Link]([1, 2, 3, 4])
print(arr[0])
Example
Get the second element from the following array.
import numpy as np
arr = [Link]([1, 2, 3, 4])
print(arr[1])
Get third and fourth elements from the following array and add them.
import numpy as np
arr = [Link]([1, 2, 3, 4])
print(arr[2] + arr[3])
Access 2-D Arrays
To access elements from 2-D arrays we can use comma separated integers representing the
dimension and the index of the element.
Think of 2-D arrays like a table with rows and columns, where the dimension represents the
row and the index represents the column.
Example
Access the element on the first row, second column:
import numpy as np
arr = [Link]([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])
Example
Access the element on the 2nd row, 5th column:
import numpy as np
arr = [Link]([[1,2,3,4,5], [6,7,8,9,10]])
print('5th element on 2nd row: ', arr[1, 4])
What is Matplotlib?
Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
If you have Python and PIP already installed on a system, then installation of Matplotlib is
very easy.
import matplotlib
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported
under the plt alias:
Example
Draw a line in a diagram from position (0,0) to position (6,250):
import [Link] as plt
import numpy as np
xpoints = [Link]([0, 6])
ypoints = [Link]([0, 250])
[Link](xpoints, ypoints)
[Link]()
Result:
If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and [3,
10] to the plot function.
Example
Draw a line in a diagram from position (1, 3) to position (8, 10):
import [Link] as plt
import numpy as np
xpoints = [Link]([1, 8])
ypoints = [Link]([3, 10])
[Link](xpoints, ypoints)
[Link]()
Example
Draw two points in the diagram, one at position (1, 3) and one in position (8, 10):
import [Link] as plt
import numpy as np
xpoints = [Link]([1, 8])
ypoints = [Link]([3, 10])
[Link](xpoints, ypoints, 'o')
[Link]()
Result:
Multiple Points
You can plot as many points as you like, just make sure you have the same number of points
in both axis.
Example
Draw a line in a diagram from position (1, 3) to (2, 8) then to (6, 1) and finally to
position (8, 10):
import [Link] as plt
import numpy as np
xpoints = [Link]([1, 2, 6, 8])
ypoints = [Link]([3, 8, 1, 10])
[Link](xpoints, ypoints)
[Link]()
If you have a large DataFrame with many rows, Pandas will only return the first 5 rows,
and the last 5 rows:
Example
Print the DataFrame without the to_string() method:
import pandas as pd
df = pd.read_csv('[Link]')
print(df)
excel files.
Pandas can also read Excel files.
import pandas as pd
df = pd.read_excel('path/to/your_file.xlsx')
output:
df.to_csv("your_name.csv")
output:
output:
CSV Files.
Pandas DataFrames can be exported to CSV.
excel files.
Pandas DataFrames can be exported to Excel.
df.to_excel('path/to/output_file.xlsx', index=False)
DATA CLEANING
Data cleaning in Python is a crucial step in preparing raw data for analysis and machine
learning. The pandas library is the most commonly used tool for this purpose, offering a wide
range of functionalities to address various data quality issues.
Empty cells
Data in wrong format
Wrong data
Duplicates
Note: Deletion should be used cautiously to avoid losing valuable data, especially in smaller
datasets.
Mean/Median/Mode Imputation:
# Fill with mean (for numerical data)
df['A'].fillna(df['A'].mean(), inplace=True)
Forward Fill (ffill) / Backward Fill (bfill): Fills missing values with the previous or next
valid observation. Useful for time-series data.
df['B'].fillna(method='ffill', inplace=True)
df['C'].fillna(method='bfill', inplace=True)
Interpolation: Estimates missing values based on the values of surrounding data points.
df_interpolated = [Link](method='linear')
The most widely used library for data manipulation and analysis, offering DataFrames and
Series for structured data.
NumPy:
Essential for numerical operations and array manipulation, often used in conjunction with
Pandas.
DATA ANALYSIS
Data Analysis is the technique of collecting, transforming and organizing data to make
future predictions and informed data-driven decisions. It also helps to find possible
solutions for a business problem.
Output
Matrix a :
[[0 0]
[0 0]]
Matrix b :
[0 0]
Matrix c :
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
Arithmetic Operations
1. Addition:
import numpy as np
a = [Link]([5, 72, 13, 100])
b = [Link]([2, 5, 10, 30])
add_ans = a+b
print(add_ans)
add_ans = [Link](a, b)
print(add_ans)
c = [Link]([1, 2, 3, 4])
add_ans = a+b+c
print(add_ans)
add_ans = [Link](a, b, c)
print(add_ans)
Output
[ 7 77 23 130]
[ 7 77 23 130]
[ 8 79 26 134]
[ 7 77 23 130]
2. Subtraction:
import numpy as np
a = [Link]([5, 72, 13, 100])
b = [Link]([2, 5, 10, 30])
sub_ans = a-b
print(sub_ans)
sub_ans = [Link](a, b)
print(sub_ans)
Output
[ 3 67 3 70]
[ 3 67 3 70]
3. Multiplication:
import numpy as np
a = [Link]([5, 72, 13, 100])
b = [Link]([2, 5, 10, 30])
mul_ans = a*b
print(mul_ans)
mul_ans = [Link](a, b)
print(mul_ans)
Output
[ 10 360 130 3000]
[ 10 360 130 3000]
4. Division:
import numpy as np
a = [Link]([5, 72, 13, 100])
b = [Link]([2, 5, 10, 30])
div_ans = a/b
print(div_ans)
div_ans = [Link](a, b)
print(div_ans)
Output
[ 2.5 14.4 1.3 3.33333333]
[ 2.5 14.4 1.3 3.33333333]
1. Data Aggregation:
Data aggregation involves combining multiple data points into a single, summary
metric. This is typically achieved using the groupby() method in Pandas, followed by
applying an aggregation function.
groupby():
This method splits the DataFrame into groups based on one or more columns. For example,
grouping sales data by 'Product Category' allows for analyzing sales performance for each
category individually.
Aggregation Functions:
After grouping, various functions can be applied to each group to compute summary
statistics. Common aggregation functions include:
sum(): Calculates the total sum of values within each group.
mean(): Computes the average value within each group.
count(): Counts the number of non-null items in each group.
min() and max(): Find the minimum and maximum values within each group.
std() and var(): Calculate the standard deviation and variance within each group.
agg(): Allows applying multiple aggregation functions simultaneously or using custom
functions.
import pandas as pd
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'A'],
'Value': [10, 15, 12, 8, 20, 11]}
df = [Link](data)
# Group by 'Category' and calculate the sum of 'Value'
category_sum = [Link]('Category')['Value'].sum()
print(category_sum)
# Group by 'Category' and apply multiple aggregations
category_summary = [Link]('Category')['Value'].agg(['sum', 'mean', 'count'])
print(category_summary)
2. Data Summarization:
Data summarization focuses on presenting key characteristics of a dataset, often in a
descriptive manner.
Descriptive Statistics:
The describe() method in Pandas provides a quick summary of numerical columns, including
count, mean, standard deviation, min, max, and quartiles.
Pivot Tables:
These tables provide a powerful way to summarize data by rearranging and aggregating data
based on multiple dimensions. pd.pivot_table() allows for flexible summarization with
various aggregation functions.
Visualizations:
Summarized data can be effectively presented using visualizations like bar plots (e.g., using
Seaborn), which can automatically calculate and display means or other aggregates across
categorical groups.
Example of Summarization:
import pandas as pd
data = {'Age': [25, 30, 22, 35, 28],
'Salary': [50000, 60000, 45000, 75000, 55000]}
df = [Link](data)
# Get descriptive statistics for numerical columns
print([Link]())
# Create a pivot table (if applicable with more complex data)
# For this simple example, a direct aggregation is more suitable.
# Example with more columns for a pivot table:
data_pivot = {'Region': ['East', 'West', 'East', 'West'], 'Product': ['A', 'B', 'B', 'A'], 'Sales': [100,
150, 120, 80]}
df_pivot = [Link](data_pivot)
pivot_table = pd.pivot_table(df_pivot, values='Sales', index='Region', columns='Product',
aggfunc='sum')
print(pivot_table)
DATA VISUALIZATION
Data visualization provides a good, organized pictorial representation of the data which
makes it easier to understand, observe, analyse.
Python provides various libraries that come with different features for visualizing data. All
these libraries come with different features and can support various types of graphs. In this
tutorial, we will be discussing four such libraries.
Matplotlib
Seaborn
Bokeh
Plotly
Creating plots and charts with Matplotlib in Python involves using the pyplot module, which
offers a MATLAB-like interface for easy plotting.
1. Installation:
First, ensure Matplotlib is installed. If not, open your terminal or command prompt and run:
pip install matplotlib
2. Basic Line Plot:
To create a simple line plot:
Line Chart is used to represent a relationship between two data X and Y on a different axis.
Explanation:
import [Link] as plt: Imports the pyplot module and assigns it the alias plt for
convenience.
[Link](x, y): Creates the line plot using the provided x and y data.
[Link](), [Link](), [Link](): Add descriptive labels to the axes and a title to the plot.
[Link](): Displays the generated plot in a new window.
Scatter Plot
Scatter plots are used to observe relationships between variables and uses dots to represent
the relationship between them. The scatter() method in the matplotlib library is used to
draw a scatter plot.
import pandas as pd
import [Link] as plt
# reading the database
data = pd.read_csv("[Link]")
# Scatter plot with day against tip
[Link](data['day'], data['tip'])
# Adding Title to the Plot
[Link]("Scatter Plot")
# Setting the X and Y labels
[Link]('Day')
[Link]('Tip')
[Link]()
Bar Chart
A bar plot or bar chart is a graph that represents the category of data with rectangular bars
with lengths and heights that is proportional to the values which they represent. It can be
created using the bar() method
Histogram
A histogram is basically used to represent data in the form of some groups. It is a type of
bar plot where the X-axis represents the bin ranges while the Y-axis gives information
about frequency. The hist() function is used to compute and create a histogram. In
histogram, if we pass categorical data then it will automatically compute the frequency of
that data i.e. how often each value occurred
import pandas as pd
import [Link] as plt
# reading the database
data = pd.read_csv("[Link]")
# histogram of total_bills
[Link](data['total_bill'])
[Link]("Histogram")
# Adding the legends
[Link]()
Seaborn
Seaborn is a high-level interface built on top of the Matplotlib. It provides beautiful design
styles and color palettes to make more attractive graphs.
To install seaborn type the below command in the terminal.
pip install seaborn
Seaborn is built on the top of Matplotlib, therefore it can be used with the Matplotlib as
well. Using both Matplotlib and Seaborn together is a very simple process. We just have to
invoke the Seaborn Plotting function as normal, and then we can use Matplotlib’s
customization function.
Note: Seaborn comes loaded with dataset such as tips, iris, etc. but for the sake of this
tutorial we will use Pandas for loading these datasets.
1. Import Seaborn:
The first step is to import the Seaborn library, typically aliased as sns.
import pandas as pd
data = [Link]({'x': [1, 2, 3, 4, 5], 'y': [2, 4, 5, 4, 6], 'category': ['A', 'B', 'A', 'B', 'A']})
# Or load a built-in dataset:
# tips = sns.load_dataset('tips')
3. Choose a Plot Type:
Seaborn offers a wide range of plot types for various data visualization needs. Some common
examples include: