0% found this document useful (0 votes)
7 views24 pages

Python Module2

The document provides an overview of the Pandas library for data manipulation and analysis in Python, detailing its functionalities, installation, and usage. It also introduces NumPy for array handling and Matplotlib for data visualization, including examples of creating and manipulating data structures like Series and DataFrames. Additionally, it covers how to read CSV files into Pandas for data analysis.

Uploaded by

praveen.24.amba
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views24 pages

Python Module2

The document provides an overview of the Pandas library for data manipulation and analysis in Python, detailing its functionalities, installation, and usage. It also introduces NumPy for array handling and Matplotlib for data visualization, including examples of creating and manipulating data structures like Series and DataFrames. Additionally, it covers how to read CSV files into Pandas for data analysis.

Uploaded by

praveen.24.amba
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MBABA313

Data Manipulation and Analysis

Pandas is open-source Python library which is used for data manipulation and analysis. It
consists of data structures and functions to perform efficient operations on data. It is well-
suited for working with tabular data such as spreadsheets or SQL tables. It is used in data
science because it works well with other important libraries. It is built on top of the NumPy
library as it makes easier to manipulate and analyze. Pandas is used in other libraries such
as:
 Matplotlib for plotting graphs
 SciPy for statistical analysis
 Scikit-learn for machine learning algorithms.
 It uses many functionalities provided by NumPy library.
Here is a various tasks that we can do using Pandas:
 Data Cleaning, Merging and Joining: Clean and combine data from multiple sources,
handling inconsistencies and duplicates.
 Handling Missing Data: Manage missing values (NaN) in both floating and non-
floating point data.
 Column Insertion and Deletion: Easily add, remove or modify columns in a
DataFrame.
 Group By Operations: Use "split-apply-combine" to group and analyze data.
 Data Visualization: Create visualizations with Matplotlib and Seaborn, integrated with
Pandas.

Installing Pandas
First step in working with Pandas is to ensure whether it is installed in the system or not. If
not then we need to install it on our system using the pip command.
pip install pandas

Importing Pandas
After the Pandas have been installed in the system we need to import the library. This
module is imported using:
import pandas as pd
Note: pd is just an alias for Pandas. It’s not required but using it makes the code shorter
when calling methods or properties.
Pandas as pd

Pandas is usually imported under the pd alias.

Create an alias with the as keyword while importing:

import pandas as pd
Example
import pandas as pd

mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = [Link](mydataset)
print(myvar)

Nayana M Ass. Prof. Dept. of MBA Page 1


MBABA313
Data Manipulation and Analysis

Checking Pandas Version


The version string is stored under __version__ attribute.
Example
import pandas as pd
print(pd.__version__)
What is a Series?
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
Example
Create a simple Pandas Series from a list:

import pandas as pd
a = [1, 7, 2]
myvar = [Link](a)
print(myvar)
Labels
If nothing else is specified, the values are labeled with their index number. First value has
index 0, second value has index 1 etc.
This label can be used to access a specified value.
Example
Return the first value of the Series:

print(myvar[0])
Create Labels
With the index argument, you can name your own labels.

Example
Create your own labels:
import pandas as pd
a = [1, 7, 2]
myvar = [Link](a, index = ["x", "y", "z"])
print(myvar)

When you have created labels, you can access an item by referring to the label.
Example
Return the value of "y":
print(myvar["y"])
DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Series is like a column, a DataFrame is the whole table.
Example
Create a DataFrame from two Series:
import pandas as pd
data = {
"calories": [420, 380, 390],

Nayana M Ass. Prof. Dept. of MBA Page 2


MBABA313
Data Manipulation and Analysis

"duration": [50, 40, 45]


}
myvar = [Link](data)
print(myvar)

What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.
Example
Create a simple Pandas DataFrame:

import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = [Link](data)
print(df)
Result

calories duration
0 420 50
1 380 40
2 390 45

Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
Return row 0:
#refer to the row index:
print([Link][0])
Result

calories 420
duration 50
Name: 0, dtype: int64

Example
Return row 0 and 1:
#use a list of indexes:
print([Link][[0, 1]])
Result:
calories duration
0 420 50

Nayana M Ass. Prof. Dept. of MBA Page 3


MBABA313
Data Manipulation and Analysis

1 380 40

Named Indexes
With the index argument, you can name your own indexes.
Example
Add a list of names to give each row a name:

import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = [Link](data, index = ["day1", "day2", "day3"])
print(df)
Result

calories duration
day1 420 50
day2 380 40
day3 390 45

Load Files Into a DataFrame


If your data sets are stored in a file, Pandas can load them into a DataFrame.
Example
Load a comma separated file (CSV file) into a DataFrame:

import pandas as pd
df = pd.read_csv('[Link]')
print(df)

NumPy Introduction
What is NumPy?
NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier transform, and matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use
it freely.
NumPy stands for Numerical Python.
Why Use NumPy?
In Python we have lists that serve the purpose of arrays, but they are slow to process.
NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that
make working with ndarray very easy.

Nayana M Ass. Prof. Dept. of MBA Page 4


MBABA313
Data Manipulation and Analysis

Arrays are very frequently used in data science, where speed and resources are very
important.
Why is NumPy Faster Than Lists?
NumPy arrays are stored at one continuous place in memory unlike lists, so processes can
access and manipulate them very efficiently.
This behavior is called locality of reference in computer science.
This is the main reason why NumPy is faster than lists. Also it is optimized to work with
latest CPU architectures.

Installing NumPy in Python


To begin using NumPy, you need to install it first. This can be done through pip command:

pip install numpy

Once installed, import the library with the alias np


import numpy as np
Example
import numpy
arr = [Link]([1, 2, 3, 4, 5])
print(arr)
NumPy as np

NumPy is usually imported under the np alias.

Create an alias with the as keyword while importing:

import numpy as np
Now the NumPy package can be referred to as np instead of numpy.
Example
import numpy as np
arr = [Link]([1, 2, 3, 4, 5])
print(arr)
Checking NumPy Version

The version string is stored under __version__ attribute.

Example
import numpy as np
print(np.__version__)
NumPy Creating Arrays
Create a NumPy ndarray Object
NumPy is used to work with arrays. The array object in NumPy is called ndarray.
We can create a NumPy ndarray object by using the array() function.

Nayana M Ass. Prof. Dept. of MBA Page 5


MBABA313
Data Manipulation and Analysis

Example
import numpy as np
arr = [Link]([1, 2, 3, 4, 5])
print(arr)
print(type(arr))

type(): This built-in Python function tells us the type of the object passed to it. Like in above
code it shows that arr is [Link] type.

To create an ndarray, we can pass a list, tuple or any array-like object into the array() method,
and it will be converted into an ndarray:

Example
Use a tuple to create a NumPy array:
import numpy as np
arr = [Link]((1, 2, 3, 4, 5))
print(arr)
Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).
nested array: are arrays that have arrays as their elements.
0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
Example
Create a 0-D array with value 42

import numpy as np
arr = [Link](42)
print(arr)
1-D Arrays
An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
These are the most common and basic arrays.
Example
Create a 1-D array containing the values 1,2,3,4,5:
import numpy as np
arr = [Link]([1, 2, 3, 4, 5])
print(arr)
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
These are often used to represent matrix or 2nd order tensors.
NumPy has a whole sub module dedicated towards matrix operations called [Link]
Example
Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:

Nayana M Ass. Prof. Dept. of MBA Page 6


MBABA313
Data Manipulation and Analysis

import numpy as np
arr = [Link]([[1, 2, 3], [4, 5, 6]])
print(arr)
NumPy Array Indexing
Access Array Elements
Array indexing is the same as accessing an array element.
You can access an array element by referring to its index number.
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.

Example
Get the first element from the following array:
import numpy as np
arr = [Link]([1, 2, 3, 4])
print(arr[0])

Example
Get the second element from the following array.
import numpy as np
arr = [Link]([1, 2, 3, 4])
print(arr[1])

Get third and fourth elements from the following array and add them.

import numpy as np
arr = [Link]([1, 2, 3, 4])
print(arr[2] + arr[3])
Access 2-D Arrays
To access elements from 2-D arrays we can use comma separated integers representing the
dimension and the index of the element.
Think of 2-D arrays like a table with rows and columns, where the dimension represents the
row and the index represents the column.
Example
Access the element on the first row, second column:
import numpy as np
arr = [Link]([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])

Example
Access the element on the 2nd row, 5th column:
import numpy as np
arr = [Link]([[1,2,3,4,5], [6,7,8,9,10]])
print('5th element on 2nd row: ', arr[1, 4])
What is Matplotlib?

Matplotlib is a low level graph plotting library in python that serves as a visualization utility.

Nayana M Ass. Prof. Dept. of MBA Page 7


MBABA313
Data Manipulation and Analysis

 Matplotlib was created by John D. Hunter.


 Matplotlib is open source and we can use it freely.
 Matplotlib is mostly written in python, a few segments are written in C, Objective-C
and Javascript for Platform compatibility.
Installation of Matplotlib

If you have Python and PIP already installed on a system, then installation of Matplotlib is
very easy.

Install it using this command:

C:\Users\Your Name>pip install matplotlib


Import Matplotlib

Once Matplotlib is installed, import it in your applications by adding


the import module statement:

import matplotlib

Checking Matplotlib Version


The version string is stored under __version__ attribute.
Example
import matplotlib
print(matplotlib.__version__)
Matplotlib Pyplot
Pyplot

Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported
under the plt alias:

import [Link] as plt

Now the Pyplot package can be referred to as plt.

Example
Draw a line in a diagram from position (0,0) to position (6,250):
import [Link] as plt
import numpy as np
xpoints = [Link]([0, 6])
ypoints = [Link]([0, 250])
[Link](xpoints, ypoints)
[Link]()

Nayana M Ass. Prof. Dept. of MBA Page 8


MBABA313
Data Manipulation and Analysis

Result:

Plotting x and y points

 The plot() function is used to draw points (markers) in a diagram.


 By default, the plot() function draws a line from point to point.
 The function takes parameters for specifying points in the diagram.
 Parameter 1 is an array containing the points on the x-axis.
 Parameter 2 is an array containing the points on the y-axis.

If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and [3,
10] to the plot function.

Example
Draw a line in a diagram from position (1, 3) to position (8, 10):
import [Link] as plt
import numpy as np
xpoints = [Link]([1, 8])
ypoints = [Link]([3, 10])
[Link](xpoints, ypoints)
[Link]()

Nayana M Ass. Prof. Dept. of MBA Page 9


MBABA313
Data Manipulation and Analysis

Plotting Without Line


To plot only the markers, you can use shortcut string notation parameter 'o', which means
'rings'.

Example
Draw two points in the diagram, one at position (1, 3) and one in position (8, 10):
import [Link] as plt
import numpy as np
xpoints = [Link]([1, 8])
ypoints = [Link]([3, 10])
[Link](xpoints, ypoints, 'o')
[Link]()
Result:

Multiple Points
You can plot as many points as you like, just make sure you have the same number of points
in both axis.
Example
Draw a line in a diagram from position (1, 3) to (2, 8) then to (6, 1) and finally to
position (8, 10):
import [Link] as plt
import numpy as np
xpoints = [Link]([1, 2, 6, 8])
ypoints = [Link]([3, 8, 1, 10])
[Link](xpoints, ypoints)
[Link]()

Nayana M Ass. Prof. Dept. of MBA Page 10


MBABA313
Data Manipulation and Analysis

Different Types of Plots in Matplotlib


Matplotlib offers a wide range of plot types to suit various data visualization needs. Here
are some of the most commonly used types of plots in Matplotlib:
 1. Line Graph
 2. Bar Chart
 3. Histogram
 4. Scatter Plot
 5. Pie Chart
 6. 3D Plot

DATA IMPORT AND EXPORT:


CSV files are the Comma Separated Files. It allows users to load tabular data into
a DataFrame, which is a powerful structure for data manipulation and analysis. To access
data from the CSV file, we require a function read_csv() from Pandas that retrieves data in
the form of the data frame.
Read CSV Files
 A simple way to store big data sets is to use CSV files (comma separated files).
 CSV files contains plain text and is a well know format that can be read by everyone
including Pandas.
Example
Load the CSV into a DataFrame:
import pandas as pd
df = pd.read_csv('[Link]')
print(df.to_string())

Note: use to_string() to print the entire DataFrame.

If you have a large DataFrame with many rows, Pandas will only return the first 5 rows,
and the last 5 rows:

Example
Print the DataFrame without the to_string() method:

Nayana M Ass. Prof. Dept. of MBA Page 11


MBABA313
Data Manipulation and Analysis

import pandas as pd
df = pd.read_csv('[Link]')
print(df)

excel files.
Pandas can also read Excel files.
import pandas as pd
df = pd.read_excel('path/to/your_file.xlsx')

Export Pandas dataframe to a CSV file


When working on a Data Science project one of the key tasks is data management which
includes data collection, cleaning and storage. Once our data is cleaned and processed it’s
essential to save it in a structured format for further analysis or sharing.
A CSV (Comma-Separated Values) file is a widely used format for storing tabular data.
In Python Pandas provides an easy-to-use
function to_csv()to export a DataFrame into a CSV file.

Before exporting let's first create a sample DataFrame using Pandas


import pandas as pd
scores = {'Name': ['a', 'b', 'c', 'd'],
'Score': [90, 80, 95, 20]}
df = [Link](scores)
print(df)

output:

Now that we have a sample DataFrame, let's export it to a CSV file.

Exporting DataFrame to CSV


1. Basic Export
The simplest way to export a DataFrame to a CSV file is by using the to_csv() function
without any additional parameters. This method creates a CSV file where the DataFrame's
contents are written as-is.

df.to_csv("your_name.csv")

output:

Customizing the CSV Export

Nayana M Ass. Prof. Dept. of MBA Page 12


MBABA313
Data Manipulation and Analysis

2. Remove Index Column


The to_csv() exports the index column which represents the row numbers of the
DataFrame. If we do not want this extra column in our CSV file we can remove it by
setting index=False.
df.to_csv('your_name.csv', index = False)

output:

3. Export only selected columns


In some cases we may not want to export all columns from our DataFrame.
The columns parameter in to_csv() allows us to specify which columns should be included
in the output file.
df.to_csv("your_name.csv", columns = ['Name'])
output:

CSV Files.
Pandas DataFrames can be exported to CSV.

df.to_csv('path/to/output_file.csv', index=False) # index=False prevents writing the


DataFrame index

excel files.
Pandas DataFrames can be exported to Excel.

df.to_excel('path/to/output_file.xlsx', index=False)

DATA CLEANING

Data cleaning in Python is a crucial step in preparing raw data for analysis and machine
learning. The pandas library is the most commonly used tool for this purpose, offering a wide
range of functionalities to address various data quality issues.

Data cleaning means fixing bad data in your data set.

Bad data could be:

 Empty cells
 Data in wrong format
 Wrong data

Nayana M Ass. Prof. Dept. of MBA Page 13


MBABA313
Data Manipulation and Analysis

 Duplicates

Key Data Cleaning Techniques in Python using Pandas:


 Handling Missing Values:
Handling missing values is a crucial step in data preprocessing using Python, commonly
performed with the Pandas library. Several techniques can be employed, depending on the
nature of the data and the extent of missingness.

1. Identifying Missing Values:


Before handling, missing values must be identified.
import pandas as pd
import numpy as np

# Create a sample DataFrame


data = {'A': [1, 2, [Link], 4],
'B': [5, [Link], 7, 8],
'C': [9, 10, 11, [Link]]}
df = [Link](data)

# Check for missing values


print([Link]())

# Count missing values per column


print([Link]().sum())

2. Handling Missing Values:


Deletion:
o Dropping rows: Removes rows containing any missing values.
df_dropped_rows = [Link]()

 Dropping columns: Removes columns containing any missing values.


df_dropped_cols = [Link](axis=1)

 Note: Deletion should be used cautiously to avoid losing valuable data, especially in smaller
datasets.

 Imputation: Filling missing values with estimated values.

Nayana M Ass. Prof. Dept. of MBA Page 14


MBABA313
Data Manipulation and Analysis

 Mean/Median/Mode Imputation:
# Fill with mean (for numerical data)
df['A'].fillna(df['A'].mean(), inplace=True)

# Fill with median (for numerical data, robust to outliers)


df['B'].fillna(df['B'].median(), inplace=True)

# Fill with mode (for categorical data)


df['C'].fillna(df['C'].mode()[0], inplace=True)

 Forward Fill (ffill) / Backward Fill (bfill): Fills missing values with the previous or next
valid observation. Useful for time-series data.
df['B'].fillna(method='ffill', inplace=True)
df['C'].fillna(method='bfill', inplace=True)

 Interpolation: Estimates missing values based on the values of surrounding data points.
df_interpolated = [Link](method='linear')

 Constant Value Imputation: Fills missing values with a specified constant.


df['C'].fillna('Unknown', inplace=True)
DATA TRANSFORMATION
Data transformation in Python involves converting raw data into a more suitable format or
structure for analysis, modeling, or reporting. This process is crucial for cleaning, preparing,
and enhancing data quality, making it compatible with various analytical tools and
algorithms.

Key Libraries and Techniques for Data Transformation in Python:


Pandas:

The most widely used library for data manipulation and analysis, offering DataFrames and
Series for structured data.

 Selecting and Filtering: [Link][], [Link][], [Link](), boolean indexing.

 Adding/Dropping/Renaming Columns: df['new_col'] = ..., [Link](), [Link]().

 Handling Missing Values: [Link](), [Link]().

 Grouping and Aggregation: [Link](), agg().

 Merging and Joining: [Link](), [Link](), [Link]().

 Applying Functions: [Link](), [Link](), [Link]().

Nayana M Ass. Prof. Dept. of MBA Page 15


MBABA313
Data Manipulation and Analysis

 Reshaping Data: df.pivot_table(), [Link]().

NumPy:
Essential for numerical operations and array manipulation, often used in conjunction with
Pandas.

 Mathematical Operations: [Link](), [Link](), [Link]().

 Array Reshaping: [Link]().

DATA ANALYSIS
Data Analysis is the technique of collecting, transforming and organizing data to make
future predictions and informed data-driven decisions. It also helps to find possible
solutions for a business problem.

Analyzing Numerical Data with NumPy


NumPy is an array processing package in Python and provides a high-performance
multidimensional array object and tools for working with these arrays. It is the fundamental
package for scientific computing with Python.
NumPy Array is a table of elements usually numbers, all of the same types, indexed by a
tuple of positive integers. In Numpy the number of dimensions of the array is called the
rank of the array. A tuple of integers giving the size of the array along each dimension is
known as the shape of the array.

Creating NumPy Array


NumPy arrays can be created in multiple ways with various ranks. It can also be created
with the use of different data types like lists, tuples, etc.

Create Array using [Link](shape, dtype = None, order = 'C')


import numpy as np
a = [Link]([2, 2], dtype = int)
print("\nMatrix a : \n", a)
b = [Link](2, dtype = int)
print("Matrix b : \n", b)
c = [Link]([3, 3])
print("\nMatrix c : \n", c)

Nayana M Ass. Prof. Dept. of MBA Page 16


MBABA313
Data Manipulation and Analysis

Output
Matrix a :
[[0 0]
[0 0]]
Matrix b :
[0 0]

Matrix c :
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]

Operations on Numpy Arrays

Arithmetic Operations
1. Addition:

import numpy as np
a = [Link]([5, 72, 13, 100])
b = [Link]([2, 5, 10, 30])
add_ans = a+b
print(add_ans)
add_ans = [Link](a, b)
print(add_ans)
c = [Link]([1, 2, 3, 4])
add_ans = a+b+c
print(add_ans)
add_ans = [Link](a, b, c)
print(add_ans)

Output
[ 7 77 23 130]
[ 7 77 23 130]
[ 8 79 26 134]
[ 7 77 23 130]
2. Subtraction:

import numpy as np
a = [Link]([5, 72, 13, 100])
b = [Link]([2, 5, 10, 30])
sub_ans = a-b
print(sub_ans)
sub_ans = [Link](a, b)
print(sub_ans)
Output
[ 3 67 3 70]
[ 3 67 3 70]

Nayana M Ass. Prof. Dept. of MBA Page 17


MBABA313
Data Manipulation and Analysis

3. Multiplication:

import numpy as np
a = [Link]([5, 72, 13, 100])
b = [Link]([2, 5, 10, 30])
mul_ans = a*b
print(mul_ans)
mul_ans = [Link](a, b)
print(mul_ans)

Output
[ 10 360 130 3000]
[ 10 360 130 3000]

4. Division:

import numpy as np
a = [Link]([5, 72, 13, 100])
b = [Link]([2, 5, 10, 30])
div_ans = a/b
print(div_ans)
div_ans = [Link](a, b)
print(div_ans)
Output
[ 2.5 14.4 1.3 3.33333333]
[ 2.5 14.4 1.3 3.33333333]

1. Data Aggregation:
Data aggregation involves combining multiple data points into a single, summary
metric. This is typically achieved using the groupby() method in Pandas, followed by
applying an aggregation function.
 groupby():
This method splits the DataFrame into groups based on one or more columns. For example,
grouping sales data by 'Product Category' allows for analyzing sales performance for each
category individually.

 Aggregation Functions:
After grouping, various functions can be applied to each group to compute summary
statistics. Common aggregation functions include:
 sum(): Calculates the total sum of values within each group.
 mean(): Computes the average value within each group.
 count(): Counts the number of non-null items in each group.
 min() and max(): Find the minimum and maximum values within each group.
 std() and var(): Calculate the standard deviation and variance within each group.
 agg(): Allows applying multiple aggregation functions simultaneously or using custom
functions.

Nayana M Ass. Prof. Dept. of MBA Page 18


MBABA313
Data Manipulation and Analysis

import pandas as pd
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'A'],
'Value': [10, 15, 12, 8, 20, 11]}
df = [Link](data)
# Group by 'Category' and calculate the sum of 'Value'
category_sum = [Link]('Category')['Value'].sum()
print(category_sum)
# Group by 'Category' and apply multiple aggregations
category_summary = [Link]('Category')['Value'].agg(['sum', 'mean', 'count'])
print(category_summary)

2. Data Summarization:
Data summarization focuses on presenting key characteristics of a dataset, often in a
descriptive manner.
 Descriptive Statistics:
The describe() method in Pandas provides a quick summary of numerical columns, including
count, mean, standard deviation, min, max, and quartiles.
 Pivot Tables:
These tables provide a powerful way to summarize data by rearranging and aggregating data
based on multiple dimensions. pd.pivot_table() allows for flexible summarization with
various aggregation functions.
 Visualizations:
Summarized data can be effectively presented using visualizations like bar plots (e.g., using
Seaborn), which can automatically calculate and display means or other aggregates across
categorical groups.
Example of Summarization:

import pandas as pd
data = {'Age': [25, 30, 22, 35, 28],
'Salary': [50000, 60000, 45000, 75000, 55000]}
df = [Link](data)
# Get descriptive statistics for numerical columns
print([Link]())
# Create a pivot table (if applicable with more complex data)
# For this simple example, a direct aggregation is more suitable.
# Example with more columns for a pivot table:
data_pivot = {'Region': ['East', 'West', 'East', 'West'], 'Product': ['A', 'B', 'B', 'A'], 'Sales': [100,
150, 120, 80]}
df_pivot = [Link](data_pivot)
pivot_table = pd.pivot_table(df_pivot, values='Sales', index='Region', columns='Product',
aggfunc='sum')
print(pivot_table)

DATA VISUALIZATION
Data visualization provides a good, organized pictorial representation of the data which
makes it easier to understand, observe, analyse.

Nayana M Ass. Prof. Dept. of MBA Page 19


MBABA313
Data Manipulation and Analysis

Python provides various libraries that come with different features for visualizing data. All
these libraries come with different features and can support various types of graphs. In this
tutorial, we will be discussing four such libraries.
 Matplotlib
 Seaborn
 Bokeh
 Plotly

Creating plots and charts with Matplotlib in Python involves using the pyplot module, which
offers a MATLAB-like interface for easy plotting.

1. Installation:
First, ensure Matplotlib is installed. If not, open your terminal or command prompt and run:
pip install matplotlib
2. Basic Line Plot:
To create a simple line plot:
Line Chart is used to represent a relationship between two data X and Y on a different axis.

import [Link] as plt


# Data for the plot
x = [1, 2, 3, 4, 5]
y = [2, 4, 1, 5, 2]
# Create the plot
[Link](x, y)
# Add labels and a title
[Link]("X-axis Label")
[Link]("Y-axis Label")
[Link]("Simple Line Plot")
# Display the plot
[Link]()

Nayana M Ass. Prof. Dept. of MBA Page 20


MBABA313
Data Manipulation and Analysis

Explanation:
 import [Link] as plt: Imports the pyplot module and assigns it the alias plt for
convenience.
 [Link](x, y): Creates the line plot using the provided x and y data.
 [Link](), [Link](), [Link](): Add descriptive labels to the axes and a title to the plot.
 [Link](): Displays the generated plot in a new window.

Scatter Plot

Scatter plots are used to observe relationships between variables and uses dots to represent
the relationship between them. The scatter() method in the matplotlib library is used to
draw a scatter plot.

import pandas as pd
import [Link] as plt
# reading the database
data = pd.read_csv("[Link]")
# Scatter plot with day against tip
[Link](data['day'], data['tip'])
# Adding Title to the Plot
[Link]("Scatter Plot")
# Setting the X and Y labels
[Link]('Day')
[Link]('Tip')
[Link]()

Matplotlib supports various chart types:

 Bar Chart: Use [Link](x, height).


 Scatter Plot: Use [Link](x, y).
 Pie Chart: Use [Link](sizes, labels=labels, autopct='%1.1f%%').
 Histogram: Use [Link](data, bins=bins).

Nayana M Ass. Prof. Dept. of MBA Page 21


MBABA313
Data Manipulation and Analysis

Bar Chart
A bar plot or bar chart is a graph that represents the category of data with rectangular bars
with lengths and heights that is proportional to the values which they represent. It can be
created using the bar() method

import [Link] as plt


categories = ['A', 'B', 'C', 'D']
values = [10, 25, 15, 30]
[Link](categories, values, color='skyblue')
[Link]("Categories")
[Link]("Values")
[Link]("Bar Chart Example")
[Link]()

Histogram

A histogram is basically used to represent data in the form of some groups. It is a type of
bar plot where the X-axis represents the bin ranges while the Y-axis gives information
about frequency. The hist() function is used to compute and create a histogram. In
histogram, if we pass categorical data then it will automatically compute the frequency of
that data i.e. how often each value occurred

import pandas as pd
import [Link] as plt
# reading the database
data = pd.read_csv("[Link]")
# histogram of total_bills
[Link](data['total_bill'])
[Link]("Histogram")
# Adding the legends
[Link]()

Nayana M Ass. Prof. Dept. of MBA Page 22


MBABA313
Data Manipulation and Analysis

Seaborn
Seaborn is a high-level interface built on top of the Matplotlib. It provides beautiful design
styles and color palettes to make more attractive graphs.
To install seaborn type the below command in the terminal.
pip install seaborn
Seaborn is built on the top of Matplotlib, therefore it can be used with the Matplotlib as
well. Using both Matplotlib and Seaborn together is a very simple process. We just have to
invoke the Seaborn Plotting function as normal, and then we can use Matplotlib’s
customization function.
Note: Seaborn comes loaded with dataset such as tips, iris, etc. but for the sake of this
tutorial we will use Pandas for loading these datasets.
1. Import Seaborn:
The first step is to import the Seaborn library, typically aliased as sns.

import seaborn as sns


import [Link] as plt # Often used alongside for further customization
2. Load Data:
You'll need a dataset to visualize. This can be a Pandas DataFrame or a similar data
structure. Seaborn also includes some built-in datasets for examples.

import pandas as pd
data = [Link]({'x': [1, 2, 3, 4, 5], 'y': [2, 4, 5, 4, 6], 'category': ['A', 'B', 'A', 'B', 'A']})
# Or load a built-in dataset:
# tips = sns.load_dataset('tips')
3. Choose a Plot Type:
Seaborn offers a wide range of plot types for various data visualization needs. Some common
examples include:

Nayana M Ass. Prof. Dept. of MBA Page 23


MBABA313
Data Manipulation and Analysis

 Relational Plots: [Link](), [Link](), [Link]()


 Distribution Plots: [Link](), [Link](), [Link]()
 Categorical Plots: [Link](), [Link](), [Link](), [Link](), [Link]()
 Matrix Plots: [Link](), [Link]()
 Regression Plots: [Link](), [Link]()

4. Create the Plot:


Call the desired Seaborn function, passing in your data and specifying the variables for the x
and y axes, and potentially other aesthetic mappings like hue (for color-coding based on a
categorical variable), size, or style.
Example: Scatter Plot

[Link](x='x', y='y', data=data, hue='category')


[Link]('Scatter Plot of X vs Y by Category')
[Link]()
Example: Histogram
[Link](data=data, x='y', kde=True) # kde=True adds a Kernel Density Estimate
[Link]('Distribution of Y')
[Link]()
5. Customize (Optional):
Seaborn plots are highly customizable. You can adjust colors, styles, labels, titles, and more
using both Seaborn's parameters and Matplotlib's functions.
 Styling: sns.set_style(), sns.set_palette()
 Labels and Titles: [Link](), [Link](), [Link]()
 Figure Size: [Link](figsize=(width, height))
6. Display the Plot:
Finally, use [Link]() to display your generated plot.

Nayana M Ass. Prof. Dept. of MBA Page 24

You might also like