Python Chapter 7
Python Chapter 7
Numpy
- pip install numpy
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Introduction to NumPy
Introduction to NumPy
• NumPy offers several ways to create arrays, providing flexibility for various use
cases.
• From Python Lists: can convert a Python list directly into a NumPy array using
the [Link]() function. This is a convenient approach when the data is already
in a list format.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• From Scratch: can create an array directly by specifying the elements within
square brackets []. This method is useful for defining small arrays or arrays with
specific values.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• With Specific Data Types: NumPy allows us to define the data type of the
elements in the array using the dtype argument within the [Link]() function.
This ensures efficient memory usage and optimized operations for the specific
data types we're working with.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• [Link](start, stop, step=1, dtype=None): Generates an array with evenly spaced values
within a given range (exclusive of stop).
• start: Starting value (inclusive).
• stop: Ending value (exclusive).
• step (optional): Interval between elements (defaults to 1).
• dtype (optional): The data type of the elements.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• Selecting appropriate data types for NumPy arrays is crucial for efficient
memory usage, optimized calculations, and accurate results.
• Common NumPy data types:
• Numeric Data Types:
• Boolean Data Type:
• String Data Type:
• Other Data Types:
• User-Defined Data Types (UDTs):
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• NumPy arrays come equipped with various attributes that provide valuable insights into their
structure, data types, and memory usage.
• Key Attributes:
• ndim: Represents the number of dimensions (axes) in the array.
• A 1D array has ndim = 1.
• A 2D array (matrix) has ndim = 2.
• Higher-dimensional arrays have correspondingly higher ndim values.
• shape: This is a tuple that specifies the size of the array along each dimension.
• For a 2D array with 3 rows and 4 columns, shape would be (3, 4).
• size: Represents the total number of elements in the array.
• It's the product of elements in the shape tuple.
• dtype: This attribute reveals the data type of the elements in the array (e.g., float64, int32, str).
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• Accessing Elements:
• You can access individual elements using square brackets [] and their
corresponding indices.
• For a 1D array: arr[index] (e.g., arr[2] accesses the third element).
• For a 2D array: arr[row_index, column_index] (e.g., arr[1, 2] accesses the element
at row 1, column 2).
• This allows to retrieve specific values from the arrays.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• Slicing can be used to select specific rows, columns, or even portions of the array
along any dimension.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• Copy:
• Define a copy as a new array with its own data in memory.
• modifications to the copy do not affect the original array.
• Uses .copy()
• View:
• Define a view as a reference to the original array's data.
• Explain that changes made to a view are reflected in the original array.
• Uses .view()
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• Advanced Iteration:
• Using [Link]:
• Provides an efficient and flexible multi-dimensional iterator.
• Allows for advanced control, such as broadcasting, iteration order, and multi-
dimensional indexing.
• Suitable for more complex iteration needs.
• Using [Link]:
• Returns a 1D iterator over the array.
• Simple and easy to use.
• Does not offer advanced control over iteration.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• Searching:
• [Link]() for finding elements based on condition
• It returns the indices where the condition is True.
• [Link]() for finding insertion indices
• If you want to find where to insert new elements to maintain sorted order, use
[Link]().
• This is helpful for tasks like merging sorted arrays or finding intervals.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Statistical Calculations
• Dispersion:
• [Link]() computes the standard deviation, measuring how spread out the data is.
• [Link]() calculates the variance, which is the square of the standard deviation.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Statistical Calculations
• Range:
• [Link]() and [Link]() find the smallest and largest values in the array.
• Percentiles:
• [Link]() calculates specific percentiles of the data, indicating the value below
which a given percentage of observations fall.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Pandas
- pip install pandas
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Pandas
• Pandas is an open-source data analysis and manipulation tool, built on top of NumPy.
• The name "Pandas" is derived from "Panel Data," a term for multidimensional
structured data sets, and "Python Data Analysis."
• Created by Wes McKinney in 2008.
• It provides data structures and functions needed to work with structured data
seamlessly.
• Pandas is essential for data science and machine learning tasks due to its powerful
data manipulation capabilities.
• Pandas is well-suited for working with tabular data, such as spreadsheets or SQL
tables.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Pandas
• It is built on top of the NumPy library which means that a lot of the structures of
NumPy are used or replicated in Pandas.
• Data set cleaning, merging, and joining.
• Easy handling of missing data (represented as NaN) in floating point as well as
non-floating point data.
• Columns can be inserted and deleted from DataFrame and higher-dimensional
objects.
• Powerful group by functionality for performing split-apply-combine operations
on data sets.
• Data Visualization.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• Hashable Objects:
• A hashable object can be used as a dictionary key because its hash value remains
constant during its lifetime.
• Pandas Series
• A one-dimensional labeled array capable of holding data of any type (integer, string,
float, Python objects, etc.).
• Axis labels are collectively called indexes.
• Similar to a column in an Excel sheet.
• Labels need not be unique but must be of a hashable type.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• Features
• Supports both integer and label-based indexing.
• Provides various methods for operations involving the index.
• Creating a Series
• Loaded from existing storage (SQL database, CSV file, Excel file).
• Can be created from lists, dictionaries, scalar values, etc.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• Pandas DataFrame
• A two-dimensional data structure with labeled axes (rows and columns).
• Size-mutable, potentially heterogeneous tabular data structure.
• Components
• Data: The actual data stored in the DataFrame.
• Rows: Represent the index (labels) for the rows.
• Columns: Represent the labels for the columns.
• Creating DataFrame
• Loaded from existing storage (SQL database, CSV file, Excel file).
• Can be created from lists, dictionaries, a list of dictionaries, etc.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• Column Selection
• Select columns by their name.
• Example: df['column_name’]
• Row Selection
• Using [Link][] to retrieve rows by label.
• Using [Link][] to retrieve rows by integer location.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
• .loc[]
• .loc[] is a label-based data selection method.
• It is used to select rows and columns based on their labels.
• When using .loc[], the start and end labels in slicing are both inclusive.
• .iloc[]
• iloc[] is an integer-location based indexing method.
• It is used to select rows and columns based on their integer positions.
• When using .iloc[], the end index in slicing is exclusive.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
.loc[] .iloc[]
• Label-based • Integer-location based
• Uses row and column labels • Uses row and column integer positions
• Inclusive of both start and end labels • Exclusive of end position
• Raises KeyError for invalid labels • Raises IndexError for invalid positions
• When labels are known and • When positions are known or for general
meaningful slicing
• [Link][0, 'Name'], [Link][:, 'Age'] • [Link][0, 1], [Link][:, 1]
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Types of Subsetting
• Index-Based Subsetting
• Using indices or positions to select elements.
• This is common in lists and arrays.
• Example: my_list[2:5]
• Label-Based Subsetting
• Using labels to select elements, rows, or columns.
• This is common in Pandas DataFrames.
• Example: [Link]['row_label', 'column_label’]
• Conditional Subsetting
• Selecting elements based on a condition or criteria.
• This is commonly used in DataFrames with boolean indexing.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Matplotlib
- pip install matplotlib
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Matplotlib: Introduction
• Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy.
• Powerful Python library for creating visualizations
• Cross-platform compatibility
• It provides an object-oriented API for embedding plots into applications. This allows
embedding plots into applications using general-purpose GUI toolkits like Tkinter,
wxPython
• Commonly used for data visualization in scientific and analytic contexts.
• Basic structure of a Matplotlib program
• Importing necessary libraries ([Link], numpy)
• Creating plots
• Displaying plots
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Matplotlib: Introduction
• Importing necessary libraries ([Link], numpy)
import [Link] as plt
import numpy as np
• Creating plots
x = [Link](0, 10, 100)
y = [Link](x)
[Link](x, y)
• Displaying plots
[Link]()
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Marker
• Symbols used to represent data points
• Customizing data point appearance
• Different types of markers
• ‘o’: Dots
• 's': square
• '^': triangle
• 'd': diamond
• '*': star
• '+': plus
• 'x': cross
• How to customize markers
• Color, size, edge color, marker style
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Lines
• Line plots are perfect for showing how data changes over time or to represent
relationships between variables.
• contains all the 2D line class which can draw with a variety of line styles,
markers and colors.
• Visualizing trends and relationships
• Creating a basic line plot
• Customizing line styles, colors,
and markers
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Lines
• Solid Line: '-' or 'solid'
• Dashed Line: '--' or 'dashed'
• Dotted Line: ':' or 'dotted'
• Dash-Dot Line: '-.' or 'dashdot’
• Custom
• Loosely Dotted: (0, (1, 10))
• Densely Dashed: (0, (5, 1))
• Dash-Dot-Dotted: (0, (3, 5, 1, 5, 1, 5))
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Color
• Color is a powerful tool for making your visualizations stand out.
• Enhancing visual appeal with color
• Color palettes and schemes
• Applying color to plots
• Matplotlib offers various ways to control colors in the plots.
• Use colormaps like 'viridis', 'plasma', 'inferno', and 'magma' for scientific
visualizations.
• Consider color blindness when choosing colors.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Color
• In almost all places in matplotlib where a color can be specified by the user it can
be provided as:
• an RGB or RGBA tuple of float values in [0, 1] (e.g., (0.1, 0.2, 0.5) or (0.1, 0.2, 0.5,
0.3))
• a hex RGB or RGBA string (e.g., '#0F0F0F' or '#0F0F0F0F')
• a string representation of a float value in [0, 1] inclusive for gray level (e.g., '0.5')
• one of {'b', 'g', 'r', 'c', 'm', 'y', 'k', 'w'}
• a X11/CSS4 color name
• a name from the xkcd color survey prefixed with 'xkcd:' (e.g., 'xkcd:sky blue')
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Color
• a name from the xkcd color survey prefixed with 'xkcd:' (e.g., 'xkcd:sky blue')
• one of {'C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9'}
• one of {'tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple’, 'tab:brown',
'tab:pink', 'tab:gray', 'tab:olive', 'tab:cyan'} which are the Tableau Colors from the
‘T10’ categorical palette (which is the default color cycle).
• All string specifications of color are case-insensitive.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Color
• Colormap:
• A colormap is a range of colors used to map scalar data to colors in
visualizations.
• Colormaps enhance the visualization of data by representing numerical values
as colors
• Commonly used in heatmaps, surface plots, and other types of plots to show
variations and patterns in data.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Color
• Classes of Colormap
• Matplotlib provides various colormaps classified into different categories.
• Main Classes:
• Sequential: For data that progresses from low to high values.
• Diverging: For data that diverges from a central value.
• Qualitative: For categorical data without any specific order.
• Cyclic: For data that repeats periodically.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Color
• Classes of Colormap
• Matplotlib provides various colormaps classified into different categories.
• Main Classes:
• Sequential: For data that progresses from low to high values.
• Diverging: For data that diverges from a central value.
• Qualitative: For categorical data without any specific order.
• Cyclic: For data that repeats periodically.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Color
• Choosing the Right Colormap
• Factors to Consider:
• Nature of Data: Sequential, diverging, qualitative, or cyclic.
• Data Distribution: Whether the data is continuous or categorical.
• Perceptual Uniformity: Colormaps like viridis are designed to be perceptually
uniform.
• Color Vision Deficiency: Choose colormaps that are accessible to colorblind users,
like cividis.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Label
• Labels provide descriptions for axes and titles for plots.
• Enhances readability and understanding of plots
• Basic Labeling Functions
• xlabel(): Sets the label for the x-axis
• ylabel(): Sets the label for the y-axis
• title(): Sets the title for the plot
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Grid Lines
Subplots
• Subplots
• Subplots allow multiple plots in a single figure.
• Use [Link](rows, cols, index) to create subplots.
• Useful for comparing different datasets or visualizations side by side
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Scatter plots
• A scatter plot is a type of plot or mathematical diagram using Cartesian
coordinates to display values for typically two variables for a set of data.
• If the points are coded (color/shape/size), one additional variable can be
displayed.
• Matplotlib, a popular plotting library in Python, provides a convenient way to
create scatter plots.
• Scatter plots display individual data points.
• Use [Link](x, y) to create scatter plots.
• Customize with parameters like color, size, marker, and alpha.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Bar Graph
• A bar graph (or bar chart) in Matplotlib is a visual representation of data where
categories are displayed along one axis and their corresponding values are
represented by rectangular bars.
• The length or height of each bar is proportional to the value it represents.
• [Link](categories, values)
Customizing the Bar Graph
• We can customize the appearance of the bar graph in various ways:
• Color: Change the color of the bars.
• Width: Adjust the width of the bars.
• Orientation: Create horizontal bar graphs.
• Annotations: Add text annotations to the bars.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Bar Graph
• Horizontal Bar Graph
• To create a horizontal bar graph, use the barh() function:
• Adding Error Bars
• Error bars can be added to show the variability of data
• This can help indicate the precision of the measurements or the uncertainty in
the data.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Histogram
• A histogram represents the distribution of numerical data by grouping data
points into bins (intervals) and displaying the frequency of data points in each
bin.
• It’s particularly useful for understanding the underlying distribution of data.
• To create a histogram in Matplotlib, the function [Link]() is used.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Histogram
• Key Parameters: [Link](data, bins=30, color='skyblue', edgecolor='black')
• x: The input data, typically a list or array of numbers.
• bins: The number of bins (intervals) or the bin edges.
• range: The lower and upper range of the bins.
• density: If True, the histogram is normalized to form a probability density.
• cumulative: If True, each bin will contain the cumulative frequency up to and
including the current bin.
• color: The color of the bars.
• edgecolor: The color of the bin edges.
• alpha: The transparency level of the bars.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Pie chart
• A pie chart is a circular statistical graphic divided into slices to illustrate
numerical proportions.
• In Matplotlib, the pie() function is used to create pie charts.
• Each slice of the pie chart represents a category's contribution to the whole,
making it easy to compare parts of the dataset.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Pie chart
• Components of a Pie Chart
• Slices: Each slice represents a category's contribution.
• Labels: Text labels to identify each slice.
• Colors: Different colors for each slice.
• Explode: A feature to offset a slice from the pie for emphasis.
• Autopct: A feature to display the percentage value of each slice.
• Startangle: The starting angle for the pie chart.
Compiled by: Rakesh Shrestha (sthrakeshrestha)
Box plot
• A box plot in Matplotlib is a graphical representation used to summarize the
distribution of a dataset.
• It displays the dataset’s minimum, first quartile (Q1), median, third quartile (Q3), and
maximum values.
• To create a boxplot in Matplotlib, boxplot() function is used
• Box: The main part of the plot, which extends from Q1 to Q3.
• Median Line: A line inside the box that represents the median (middle value) of the
dataset.
• Whiskers: Lines extending from the box to the smallest and largest values within 1.5
times the interquartile range (IQR) from Q1 and Q3, respectively.
• Outliers: Data points outside the whiskers, often marked as individual points.