0% found this document useful (0 votes)

14 views51 pages

NumPy and Pandas Data Manipulation Guide

This document provides an overview of data manipulation tools and software, focusing on NumPy and Pandas. It covers installation procedures, basic operations, and functionalities of NumPy's ndarray, including indexing, slicing, and reshaping. Additionally, it highlights the capabilities of Pandas for data handling and analysis.

Uploaded by

hariharan67889

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views51 pages

NumPy and Pandas Data Manipulation Guide

Uploaded by

hariharan67889

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT – IV

DATA MANIPULATION TOOLS & SOFTWARES: Numpy: Installation - Ndarray - Basic Operations -
Indexing, Slicing, and Iterating - Shape Manipulation - Array Manipulation - Structured Arrays -
Reading and Writing Array Data on Files. Pandas: The pandas Library: An Introduction - Installation -
Introduction to pandas Data Structures - Operations between Data Structures - Function Application
and Mapping - Sorting and Ranking - Correlation and Covariance - ―Not a Number Data -
Hierarchical Indexing and Leveling – Reading and Writing Data: CSV or Text File - HTML Files –
Microsoft Excel Files.

1. INSTALLATION :
What is NumPy?
NumPy (Numerical Python) is a powerful library for numerical computations, especially
with arrays and matrices. It is foundational for scientific computing in Python.

Step-by-Step Installation Instructions

1. Check if NumPy is already installed

Before installing, check if NumPy is already installed on your system:

import numpy
print(numpy.__version__)

If you see a version number, NumPy is already installed. If you get a ModuleNotFoundError,
proceed with installation.

2. Install Using pip (Python Package Installer)

Basic Installation

Open your terminal or command prompt and run:

pip install numpy

If you're using Python 3 specifically:

pip3 install numpy

This downloads and installs the latest stable version from PyPI.

To verify installation:

After installation, go to Python shell or script:

import numpy as np
print(np.__version__)

3. Using pip in a Virtual Environment (Recommended)

To avoid package conflicts, it's best to use a virtual environment:

# Create a virtual environment

python -m venv venv

# Activate the environment

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

# Install numpy in the virtual environment

pip install numpy

4. Install with conda (if using Anaconda or Miniconda)

If you’re using Anaconda/Miniconda:

conda install numpy

This is often faster because conda installs precompiled binaries optimized for your system.

5. Installing a Specific Version of NumPy

pip install numpy==1.24.4

Replace 1.24.4 with the version you want. You can see available versions here:
[Link]

6. Installing from Source (Advanced)

1. Clone the repository:

git clone [Link]

cd numpy

2. Install build dependencies:

pip install -r [Link]

3. Build and install:

pip install .

Use this only if you need to develop or test the latest source version.
Troubleshooting Installation Issues
Problem Solution
Permission denied Use --user flag: pip install --user numpy
pip not recognized Use python -m pip install numpy
Conflicts with other Use a virtual environment or pip install --upgrade --force-
packages reinstall numpy
No internet / offline Download .whl file from PyPI and install: pip install
install numpy-*.whl

Verifying Installation (Optional)

Run a basic test:

import numpy as np

# Create an array
a = [Link]([1, 2, 3])
print("Array:", a)

# Perform operation
print("Mean:", [Link](a))

Resources
• NumPy Official Site: [Link]
• Installation Docs: [Link]
• PyPI: [Link]
• Source Code: [Link]

[Link]:

What is ndarray?
In NumPy, every array is an instance of the [Link] class. It can be 1D, 2D, or multi-
dimensional.

Importing NumPy
import numpy as np
Creating an ndarray
1. From a Python List
arr = [Link]([1, 2, 3, 4])
print(arr)
print(type(arr)) # <class '[Link]'>

2. From a List of Lists (2D Array)

matrix = [Link]([[1, 2, 3], [4, 5, 6]])
print(matrix)

3. Using NumPy Functions

zeros = [Link]((3, 3)) # 3x3 array of zeros
ones = [Link]((2, 2)) # 2x2 array of ones
rand = [Link](2, 3) # 2x3 array with random values

Attributes of ndarray
a = [Link]([[1, 2, 3], [4, 5, 6]])

print("Array:\n", a)
print("Shape:", [Link]) # (2, 3)
print("Dimensions:", [Link]) # 2
print("Size:", [Link]) # 6
print("Data type:", [Link]) # int64 (varies by system)

Operations on ndarray
1. Arithmetic Operations
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])

print(a + b) # [5 7 9]
print(a * b) # [ 4 10 18]
print(a ** 2) # [1 4 9]

2. Matrix Operations
A = [Link]([[1, 2], [3, 4]])
B = [Link]([[2, 0], [1, 3]])

print("Matrix Multiplication:\n", A @ B) # or [Link](A, B)

Indexing & Slicing

a = [Link]([[10, 20, 30], [40, 50, 60]])

print(a[0, 1]) # 20
print(a[:, 1]) # [20 50]
print(a[1, :]) # [40 50 60]

Reshaping & Flattening

a = [Link]([[1, 2], [3, 4], [5, 6]])

reshaped = [Link]((2, 3)) # Change shape to 2x3

flattened = [Link]() # Convert to 1D

print("Reshaped:\n", reshaped)
print("Flattened:\n", flattened)

Example Use Case

# Calculate mean and standard deviation of random data
data = [Link](0, 1, size=(1000,))
print("Mean:", [Link](data))
print("Std Dev:", [Link](data))

Summary
Feature Description
ndarray Core array type in NumPy
shape Tuple showing array dimensions
dtype Data type of array elements
ndim Number of dimensions
size Total number of elements
reshape() Changes the shape of the array
flatten() Converts array to 1D
@ or dot() Matrix multiplication

[Link] OPERATIONS

Basic Operations with ndarray

NumPy provides vectorized operations, which means operations are performed element-
wise and much faster than with plain Python lists.

Let’s go step by step:

1. Element-wise Arithmetic Operations
import numpy as np

a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])

print("Addition:", a + b) # [5 7 9]
print("Subtraction:", a - b) # [-3 -3 -3]
print("Multiplication:", a * b) # [ 4 10 18]
print("Division:", a / b) # [0.25 0.4 0.5 ]
print("Power:", a ** 2) # [1 4 9]

Operations are broadcasted if shapes are compatible (explained below).

2. Scalar Operations

You can perform operations with a scalar (number) on the entire array:

print("Add scalar:", a + 10) # [11 12 13]

print("Multiply scalar:", a * 3) # [3 6 9]

3. Comparison Operations

Returns a boolean array:

print("Equal:", a == b) # [False False False]

print("Greater than:", b > a) # [ True True True]

4. Aggregate Functions (Statistics)

Apply mathematical operations across the array:

arr = [Link]([[1, 2, 3], [4, 5, 6]])

print("Sum:", [Link](arr)) # 21
print("Min:", [Link](arr)) # 1
print("Max:", [Link](arr)) # 6
print("Mean:", [Link](arr)) # 3.5
print("Standard Deviation:", [Link](arr)) # ~1.7078
print("Sum along axis 0 (columns):", [Link](arr, axis=0)) # [5 7 9]
print("Sum along axis 1 (rows):", [Link](arr, axis=1)) # [6 15]

5. Dot Product / Matrix Multiplication

x = [Link]([[1, 2], [3, 4]])
y = [Link]([[2, 0], [1, 3]])

# Matrix multiplication
print("Matrix product:\n", [Link](x, y)) # or x @ y

6. Transpose of a Matrix
matrix = [Link]([[1, 2, 3], [4, 5, 6]])

print("Original:\n", matrix)
print("Transposed:\n", matrix.T)

7. Broadcasting

Broadcasting allows NumPy to perform operations on arrays of different shapes:

a = [Link]([1, 2, 3])
b = 10
print("Broadcasted add:", a + b) # [11 12 13]

matrix = [Link]([[1, 2, 3], [4, 5, 6]])

vector = [Link]([10, 20, 30])

print("Add vector to matrix:\n", matrix + vector)

# Adds [10,20,30] to each row

8. Logical Operations
arr = [Link]([1, 2, 3, 4, 5])

# Find where condition is true

print("Elements > 2:", arr[arr > 2]) # [3 4 5]

# Combine conditions
print("Even numbers:", arr[(arr % 2 == 0)]) # [2 4]

9. Copying vs Viewing Arrays

original = [Link]([1, 2, 3])
view = [Link]() # Shares data
copy = [Link]() # Creates new array

original[0] = 100

print("Original:", original) # [100 2 3]

print("View:", view) # [100 2 3] - affected
print("Copy:", copy) # [1 2 3] - not affected

10. Reshaping Arrays

a = [Link]([1, 2, 3, 4, 5, 6])

reshaped = [Link]((2, 3))

print("Reshaped:\n", reshaped)

flattened = [Link]()
print("Flattened:", flattened)

Summary Table
Operation Type Example Description
Arithmetic a + b, a * 2 Element-wise operations
Comparison a > b, a == 3 Returns boolean arrays
Aggregation [Link](a), [Link](a) Statistical operations
Dot Product [Link](a, b) or a @ b Matrix multiplication
Transpose a.T Transpose rows/columns
Reshape [Link]((r, c)) Change array shape
Flatten [Link]() Convert to 1D
Broadcasting a + b (different shapes) Auto-expands array dimensions
Indexing/Masking a[a > 2] Filter values by condition

[Link], SLICING, AND ITERATING

1. Indexing
Indexing allows you to access individual elements in a NumPy array.

1D Array Indexing
import numpy as np

arr = [Link]([10, 20, 30, 40, 50])

print(arr[0]) # First element → 10

print(arr[-1]) # Last element → 50

2D Array Indexing
arr2d = [Link]([[1, 2, 3], [4, 5, 6]])

print(arr2d[0, 0]) # Row 0, Column 0 → 1

print(arr2d[1, 2]) # Row 1, Column 2 → 6

3D Array Indexing
arr3d = [Link]([
[[1, 2], [3, 4]],
[[5, 6], [7, 8]]
])

print(arr3d[1, 0, 1]) # Output: 6

2. Slicing
Slicing lets you extract subarrays using the syntax:

array[start:stop:step]

1D Slicing
arr = [Link]([10, 20, 30, 40, 50])

print(arr[1:4]) # [20 30 40]

print(arr[:3]) # [10 20 30]
print(arr[::2]) # [10 30 50]

2D Slicing
arr2d = [Link]([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(arr2d[0:2, 1:]) # Rows 0-1, Cols 1-end

# Output:
# [[2 3]
# [5 6]]

Advanced 2D Example
print(arr2d[:, 1]) # All rows, column 1 → [2 5 8]
print(arr2d[1, :]) # Row 1, all columns → [4 5 6]

3. Iterating
You can iterate over arrays using for loops. But NumPy also supports vectorized operations
which are faster and preferred.

Iterate Over 1D Array

arr = [Link]([10, 20, 30])

for x in arr:
print(x)

Iterate Over 2D Array (Row-wise)

arr2d = [Link]([[1, 2], [3, 4]])

for row in arr2d:

print("Row:", row)
Iterate Over Each Element

Use .flat or [Link]() to access all elements regardless of dimensions:

for x in [Link]:
print(x)

Or:

for x in [Link](arr2d):
print(x)

Boolean Indexing (Fancy Indexing)

Select elements using conditions.

arr = [Link]([10, 20, 30, 40])

print(arr[arr > 25]) # [30 40]

Example in 2D:
arr2d = [Link]([[5, 10], [15, 20]])
mask = arr2d > 10

print(mask)
# [[False False]
# [ True True]]

print(arr2d[mask]) # [15 20]

Advanced Indexing
Indexing with arrays of indices
arr = [Link]([10, 20, 30, 40])

indices = [0, 2]
print(arr[indices]) # [10 30]

2D example:
arr2d = [Link]([[10, 20], [30, 40], [50, 60]])

rows = [Link]([0, 2])

cols = [Link]([1, 0])

print(arr2d[rows, cols]) # [20 50]

Summary Table
Feature Example Description
Indexing 1D arr[2] Access 3rd element
Indexing 2D arr[1, 2] Access element at row 1, column 2
Slicing 1D arr[1:4] Elements from index 1 to 3
Slicing 2D arr[0:2, 1:] Slice rows 0-1, cols 1-end
Iterating for x in arr Loop through elements
Flat iteration for x in [Link] Flatten and iterate
Boolean indexing arr[arr > 10] Filter using a condition
Fancy indexing arr[[0, 2]] Index using another array

[Link] MANIPULATION

1. What is Shape?
The shape of a NumPy array is a tuple that indicates the number of elements along each axis
(dimension).

import numpy as np

a = [Link]([[1, 2, 3], [4, 5, 6]])

print([Link]) # (2, 3) → 2 rows, 3 columns

2. Reshape
Changes the shape of the array without changing its data.

a = [Link]([1, 2, 3, 4, 5, 6])

reshaped = [Link]((2, 3)) # 2 rows, 3 columns

print(reshaped)

Notes:

• The total number of elements must remain the same.

• You can use -1 to let NumPy automatically calculate one dimension:

[Link]((-1, 2)) # NumPy figures out rows → (3, 2)

3. Flatten
Converts a multi-dimensional array into a 1D array.
a = [Link]([[1, 2], [3, 4]])

flat = [Link]() # returns a copy

print(flat) # [1 2 3 4]

Difference from ravel():

r = [Link]() # returns a *view* if possible

Use flatten() if you want a copy, or ravel() for a faster view (no memory duplication).

4. Transpose
Swaps rows and columns.

a = [Link]([[1, 2, 3], [4, 5, 6]])

print("Original:\n", a)
print("Transposed:\n", a.T)

Transposing Higher-Dimensional Arrays

a = [Link](24).reshape((2, 3, 4)) # Shape: (2 blocks, 3 rows, 4 cols)
print([Link](1, 0, 2)) # Shape: (3, 2, 4)

5. Expand or Reduce Dimensions

Add a new dimension using [Link] or reshape
a = [Link]([1, 2, 3])

print([Link]) # (3,)

a_col = a[:, [Link]]

print(a_col.shape) # (3, 1)

a_row = a[[Link], :]
print(a_row.shape) # (1, 3)

np.expand_dims()

a = [Link]([1, 2, 3])
b = np.expand_dims(a, axis=0) # (1, 3)
c = np.expand_dims(a, axis=1) # (3, 1)

[Link]() — Remove dimensions of size 1

a = [Link]([[[1, 2, 3]]]) # shape (1, 1, 3)
print([Link](a)) # shape (3,)

6. Resize vs Reshape
reshape() returns a new array, but...
a = [Link]([1, 2, 3, 4])
reshaped = [Link]((2, 2)) # Creates new array

resize() modifies the array in-place:

[Link]((2, 2)) # Changes the original array
print(a)

Use with caution — resize() can lose data or fill with garbage if dimensions mismatch.

Shape Manipulation Summary

Operation Method Description
Get shape [Link] Returns shape tuple
Reshape [Link]((r, c)) Change shape, returns new array
Flatten [Link]() Convert to 1D, returns copy
Ravel [Link]() Flatten view (faster, may not copy)
Transpose a.T Swap axes for 2D
Axis transpose [Link]((1, 0, 2)) Custom axis reordering
Add dimension a[:, [Link]] Expand 1D → 2D
Expand dims np.expand_dims(a, 1) Add axis at position
Squeeze [Link](a) Remove axes with size 1
Resize (in-place) [Link]((r, c)) Resizes existing array (modifies data)

Example: Full Workflow

a = [Link](12) # [0 1 2 ... 11]
b = [Link]((3, 4)) # 3x4 array
c = b.T # Transpose to 4x3
d = [Link]() # Flatten back to 1D
e = [Link]((2, 6)) # Reshape to 2x6
print(e)

[Link] MANIPULATION
NumPy Array Manipulation
Manipulating arrays is essential when working with real-world datasets, reshaping models, or
transforming data formats. Let’s break it down:

1. Joining Arrays (Concatenation)

Using [Link]()
import numpy as np

a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])

joined = [Link]((a, b))

print(joined) # [1 2 3 4 5 6]

2D Arrays — specify axis

a = [Link]([[1, 2], [3, 4]])
b = [Link]([[5, 6]])

# Join along axis 0 (rows)

joined = [Link]((a, b), axis=0)
print(joined)

[Link]() and [Link]()

a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])

print([Link]((a, b)))
# Vertical Stack (2 rows)

print([Link]((a, b)))
# Horizontal Stack (1 row)

2. Splitting Arrays
[Link]() — evenly split
a = [Link]([1, 2, 3, 4, 5, 6])
split = [Link](a, 3)
print(split) # [array([1, 2]), array([3, 4]), array([5, 6])]

np.array_split() — can split unevenly

a = [Link]([1, 2, 3, 4, 5])
split = np.array_split(a, 3)
print(split) # Uneven: [array([1, 2]), array([3, 4]), array([5])]
2D Split (horizontal/vertical)
a = [Link]([[1, 2, 3], [4, 5, 6]])

print([Link](a, 3)) # Split columns

print([Link](a, 2)) # Split rows

3. Adding Elements
NumPy arrays are fixed-size, so adding/removing means creating a new array.

[Link]()

a = [Link]([1, 2, 3])
b = [Link](a, [4, 5])
print(b) # [1 2 3 4 5]

Appending to 2D Arrays
a = [Link]([[1, 2], [3, 4]])

# Append new row

print([Link](a, [[5, 6]], axis=0))

# Append new column

print([Link](a, [[5], [6]], axis=1))

The shape must match for rows/columns when appending.

4. Deleting Elements
[Link]()

a = [Link]([10, 20, 30, 40])

print([Link](a, 2)) # Delete element at index 2 → [10 20 40]

Deleting from 2D Arrays

a = [Link]([[1, 2], [3, 4], [5, 6]])

# Delete row at index 1

print([Link](a, 1, axis=0))

# Delete column at index 0

print([Link](a, 0, axis=1))
5. Inserting Elements
[Link]()

a = [Link]([1, 2, 3])

print([Link](a, 1, [10])) # [1 10 2 3]

Insert into 2D Arrays

a = [Link]([[1, 2], [3, 4]])

# Insert row at index 1

print([Link](a, 1, [9, 9], axis=0))

# Insert column at index 0

print([Link](a, 0, [7, 8], axis=1))

6. Stacking Arrays with New Axe

[Link]() — joins arrays along a new axis
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])

stacked = [Link]((a, b)) # Shape: (2, 3)

stacked_axis1 = [Link]((a, b), axis=1) # Shape: (3, 2)

print(stacked)
print(stacked_axis1)

7. Changing Shape for Manipulation

a = [Link]([1, 2, 3, 4, 5, 6])
reshaped = [Link]((2, 3)) # Needed for 2D manipulation

Use shape manipulation methods (reshape, flatten, etc.) to prepare arrays for advanced
manipulations.

Summary Table
Operation Function Description
Join arrays concatenate, hstack, vstack, stack Combine arrays along axis
Split arrays split, array_split, hsplit, vsplit Break arrays into parts
Add elements append, insert Add elements (returns new array)
Remove elements delete Remove elements by index
Operation Function Description
Reshape reshape, ravel, flatten Change shape, convert to 1D
Stack along axis stack Join with a new axis

Example Use Case

# Combine two datasets, remove a column, and reshape
data1 = [Link]([[1, 2], [3, 4]])
data2 = [Link]([[5, 6]])

combined = [Link]((data1, data2)) # Stack rows

cleaned = [Link](combined, 1, axis=1) # Remove 2nd column
reshaped = [Link]((3, 1)) # Reshape to single column
print(reshaped)

[Link] ARRAYS

1. What is a Structured Array?

A structured array is an ndarray where each element is a record, and each record can
contain multiple named fields of different types.

2. Creating a Structured Array

Example: A record with name, age, and weight
import numpy as np

# Define a structured data type

person_dtype = [Link]([
('name', 'U10'), # Unicode string of max length 10
('age', 'i4'), # 4-byte (int32) integer
('weight', 'f4') # 4-byte (float32) float
])

# Create the structured array

people = [Link]([
('Alice', 25, 55.0),
('Bob', 30, 85.5),
('Eve', 22, 48.0)
], dtype=person_dtype)

print(people)

Output:
[('Alice', 25, 55. ) ('Bob', 30, 85.5) ('Eve', 22, 48. )]
3. Accessing Fields in Structured Arrays
Access a specific field (column):
print(people['name']) # ['Alice' 'Bob' 'Eve']
print(people['age']) # [25 30 22]

Access a specific row (record):

print(people[1]) # ('Bob', 30, 85.5)

Access a specific value:

print(people[0]['name']) # 'Alice'
print(people[2]['weight']) # 48.0

4. Structured Dtype Formats

Type Code Meaning
'i4' int32 4-byte signed int
'f8' float64 8-byte float
'U20' Unicode string of max 20 characters
'S10' ASCII byte string of max 10 chars

You can also use Python-style types:

[Link]([('age', int), ('weight', float)])

5. Adding and Sorting Data

Filtering (Boolean Indexing):
# Get people older than 24
print(people[people['age'] > 24])

Sorting by field:
# Sort by weight
sorted_people = [Link](people, order='weight')
print(sorted_people)

6. Nested Structured Arrays

You can nest structured types:
dtype = [Link]([
('name', 'U10'),
('metrics', [('age', 'i4'), ('weight', 'f4')])
])

a = [Link]([
('Alice', (25, 55.0)),
('Bob', (30, 85.5))
], dtype=dtype)

print(a['metrics']['age']) # [25 30]

7. Converting to and from Structured Arrays

From regular array → structured
data = [('John', 21, 60.5), ('Lucy', 30, 70.0)]

structured = [Link](data, dtype=person_dtype)

Structured → regular dict/list of records

records = [Link]()
print(records)
# [('Alice', 25, 55.0), ('Bob', 30, 85.5), ('Eve', 22, 48.0)]

8. Saving/Loading Structured Arrays

# Save to binary file
[Link]('[Link]', people)

# Load back
loaded = [Link]('[Link]')
print(loaded)

9. Summary Table
Feature Method / Syntax Description
Create [Link](..., dtype=...) Define fields and types
Access field arr['name'] Get specific column
Access arr[i] Get specific row
record
Filter arr[arr['age'] > 25] Filter with condition
Sort [Link](arr, order='field') Sort by field
dtype=[('a', [('b', int), ('c',
Nested fields float)])] Complex structured data
Save/Load [Link]() / [Link]() Save and retrieve from
Feature Method / Syntax Description
disk

Example Use Case

# Find average weight of people older than 25
older = people[people['age'] > 25]
avg_weight = [Link](older['weight'])

print("Average weight of age > 25:", avg_weight)

[Link] AND WRITING ARRAY DATA ON FILES

NumPy supports saving and loading arrays in various formats:

• Binary files: .npy (single array), .npz (multiple arrays)

• Text files: .txt, .csv, etc.

1. Binary Files with .npy and .npz

Save an array to a .npy file
import numpy as np

arr = [Link]([1, 2, 3, 4, 5])

[Link]('[Link]', arr)

Load an array from a .npy file

loaded = [Link]('[Link]')
print(loaded) # [1 2 3 4 5]

Save multiple arrays to a .npz file

a = [Link](5)
b = [Link](0, 1, 5)

[Link]('[Link]', arr1=a, arr2=b)

Load .npz file

data = [Link]('[Link]')
print(data['arr1']) # [0 1 2 3 4]
print(data['arr2']) # [0. 0.25 0.5 0.75 1. ]
2. Text Files: .txt, .csv
Save to text file with [Link]()
a = [Link]([[1, 2, 3], [4, 5, 6]])

[Link]('[Link]', a)

Add a custom delimiter (e.g., CSV):

[Link]('[Link]', a, delimiter=',')

Load from text file with [Link]()

b = [Link]('[Link]')
print(b)

Load with delimiter (e.g., CSV):

b = [Link]('[Link]', delimiter=',')

3. Using genfromtxt() for Missing Data

genfromtxt() is more robust than loadtxt() — supports missing values, headers, etc.

data = [Link]('[Link]', delimiter=',', skip_header=1)

print(data)

4. Save/Load Structured Arrays

Save to .npy
dtype = [('name', 'U10'), ('age', 'i4'), ('weight', 'f4')]
people = [Link]([('Alice', 25, 55.0), ('Bob', 30, 85.5)], dtype=dtype)

[Link]('[Link]', people)

Load it back
loaded = [Link]('[Link]')
print(loaded['name']) # ['Alice' 'Bob']

5. Example: Full Round-Trip

a = [Link](3, 3)
# Save to binary
[Link]('my_array.npy', a)

# Save to CSV
[Link]('my_array.csv', a, delimiter=',')

# Load both back

bin_loaded = [Link]('my_array.npy')
csv_loaded = [Link]('my_array.csv', delimiter=',')

print("Binary:\n", bin_loaded)
print("CSV:\n", csv_loaded)

Comparison: Binary vs Text Formats

Format Function Pros Cons
.npy [Link]() Fast, preserves dtype Not human-readable
.npz [Link]() Store multiple arrays Binary, not editable
.txt/.csv [Link]() Human-readable, editable Slower, type info lost

Summary Table
Task Function
Save to .npy [Link](filename, array)
Load from .npy [Link](filename)
Save multiple arrays [Link](filename, arr1=..., arr2=...)
Load .npz [Link](filename)['arr1']
Save to text [Link](filename, array)
Load from text [Link](filename)
Load with missing data [Link](filename)

[Link]: The pandas Library: An Introduction

What is Pandas?
Pandas is a high-level Python library built on NumPy. It provides easy-to-use data structures
and functions for:

• Reading/writing data (CSV, Excel, SQL, JSON, etc.)

• Cleaning, transforming, reshaping, and analyzing data
• Handling missing data
• Time series operations
• Statistical summaries and group operations
The name "Pandas" comes from "Panel Data" — an econometrics term for
multidimensional structured data.

1. Installing Pandas
pip install pandas

2. Core Data Structures

1. Series — 1D Labeled Array
import pandas as pd

data = [Link]([10, 20, 30, 40])

print(data)

Output:

0 10
1 20
2 30
3 40
dtype: int64

• Like a NumPy array but with labels (index)

• Access with data[0] or [Link][0] or [Link][0]

2. DataFrame — 2D Labeled Table (Rows & Columns)

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Score': [85.5, 90.0, 88.5]
}

df = [Link](data)
print(df)

Output:

Name Age Score

0 Alice 25 85.5
1 Bob 30 90.0
2 Charlie 22 88.5
3. Reading Data from Files
Read from CSV
df = pd.read_csv('[Link]')

Read from Excel

df = pd.read_excel('[Link]')

Read from JSON

df = pd.read_json('[Link]')

You can also read from SQL, HTML, clipboard, and many more formats.

4. Writing Data to Files

df.to_csv('[Link]', index=False)
df.to_excel('[Link]', index=False)
df.to_json('[Link]')

5. Basic Operations on DataFrames

View Data
[Link]() # First 5 rows
[Link](3) # Last 3 rows
[Link]() # Structure and non-null count
[Link]() # Summary statistics

Access Columns and Rows

df['Name'] # Access a column
df[['Name', 'Score']] # Multiple columns

[Link][1] # Access row by label/index

[Link][1] # Access row by position

6. Basic Data Analysis

Filtering Rows
df[df['Age'] > 25]
Sorting
df.sort_values(by='Score', ascending=False)

Adding/Modifying Columns
df['Passed'] = df['Score'] >= 60

Dropping Columns/Rows
[Link](columns=['Age'], inplace=True)
[Link](index=[0], inplace=True)

7. Grouping and Aggregation

[Link]('Passed')['Score'].mean()

8. Handling Missing Data

[Link]() # Detect missing values
[Link]() # Remove missing rows
[Link](0) # Replace with a value

9. Time Series Support

dates = pd.date_range('2023-01-01', periods=5)
ts = [Link]([1, 2, 3, 4, 5], index=dates)
print(ts)

10. Interoperability with NumPy

You can convert between NumPy arrays and Pandas:

# DataFrame to NumPy
arr = [Link]

# Series to NumPy
s = df['Score'].to_numpy()

# NumPy to DataFrame
import numpy as np
a = [Link]([[1, 2], [3, 4]])
df2 = [Link](a, columns=['A', 'B'])

Summary
Feature Tool/Method
Create Series [Link]([...])
Create DataFrame [Link]({...})
Read CSV/Excel/JSON read_csv(), read_excel(), etc.
Write to file to_csv(), to_excel(), etc.
View data head(), info(), describe()
Filter, Sort df[df['col'] > x], sort_values()
Group & Aggregate groupby(), mean(), etc.
Handle missing data dropna(), fillna()
Time series pd.date_range(), Series(dates)

Example: Simple Analysis

df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, 92, 78, 60],
'Passed': [True, True, True, False]
})

print("Average score of passed students:")

print(df[df['Passed']]['Score'].mean())

[Link]

1. Using pip (Recommended for most users)

For Standard Python (via terminal/command prompt):
pip install pandas

Upgrade pandas to the latest version:

pip install --upgrade pandas

Make sure pip is pointing to your correct Python environment (python -m pip
install pandas is safer in virtual environments).

2. Using conda (For Anaconda/Miniconda users)

conda install pandas

This installs the version compatible with your current conda environment.
3. Verify Installation
After installing, open Python (or a Jupyter Notebook) and run:

import pandas as pd
print(pd.__version__)

If no errors appear and the version prints, you’re good to go.

4. Optional: Install with Jupyter Support

If you’re working in a Jupyter Notebook environment and want to ensure everything works:

pip install pandas jupyter

Or for conda:

conda install pandas notebook

5. Installing in Virtual Environment (Recommended

for Clean Projects)
Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # Mac/Linux
venv\Scripts\activate # Windows

pip install pandas

Troubleshooting Tips
Problem Solution
pip not recognized Use python -m pip install pandas
Conflicts with other
Use a virtual environment or conda environment
libraries
Slow install or timeout Use a mirror: pip install -i [Link]
pandas
Installation Summary
Tool Command Use Case
pip pip install pandas Regular Python users
conda conda install pandas Anaconda/Miniconda users
upgrade pip install --upgrade pandas Get latest version
verify import pandas Check successful install

[Link] TO PANDAS DATA STRUCTURES

Introduction to pandas Data Structures

Pandas primarily offers two main data structures for handling data:

1. Series

What is a Series?

• A 1-dimensional labeled array.

• Can hold any data type (integers, floats, strings, Python objects, etc.).
• Each element has an index label associated with it (like row labels).

Creating a Series
import pandas as pd

# Simple list to Series

s = [Link]([10, 20, 30, 40])
print(s)

Output:

0 10
1 20
2 30
3 40
dtype: int64

Custom Indexing in Series

s = [Link]([10, 20, 30], index=['a', 'b', 'c'])
print(s)

Output:

a 10
b 20
c 30
dtype: int64

Accessing Data in Series

print(s['b']) # 20
print(s[1]) # 20 (also accessible by integer position)

2. DataFrame

What is a DataFrame?

• A 2-dimensional labeled data structure (like a table or spreadsheet).

• Consists of rows and columns.
• Each column is essentially a Series.
• Columns can be different data types (e.g., one column int, another float, another
string).

Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Score': [85.5, 90.0, 88.5]
}

df = [Link](data)
print(df)

Output:

Name Age Score

0 Alice 25 85.5
1 Bob 30 90.0
2 Charlie 22 88.5

Accessing Data in DataFrame

print(df['Name']) # Access a column (returns a Series)
print([Link][1]) # Access row with index 1
print([Link][1, 2]) # Access element at row 1, column 2 (90.0)

Quick Comparison
Feature Series DataFrame
Dimensions 1D 2D
Structure Labeled array Table with rows and columns
Data types Homogeneous (usually) Columns can have mixed types
Indexing Single index Row and column indexing

Why Use These?

• Series: Great for single columns or time series data.
• DataFrame: Ideal for structured, tabular data with multiple columns.

[Link] BETWEEN DATA STRUCTURES

Operations Between pandas Data

Structures

1. Operations on Series
Series behave much like NumPy arrays but with aligned indexing:

Example:
import pandas as pd

s1 = [Link]([1, 2, 3], index=['a', 'b', 'c'])

s2 = [Link]([4, 5, 6], index=['b', 'c', 'd'])

print(s1 + s2)

Output:

a NaN # 'a' only in s1, so result is NaN

b 6.0 # 2 + 4
c 8.0 # 3 + 5
d NaN # 'd' only in s2, so result is NaN
dtype: float64
Key: Operations align on index labels, not just positions. Missing labels produce NaN.

2. Operations on DataFrames
DataFrames also align on both row and column labels.

Example:
df1 = [Link]({
'A': [1, 2, 3],
'B': [4, 5, 6]
}, index=['a', 'b', 'c'])

df2 = [Link]({
'B': [7, 8, 9],
'C': [10, 11, 12]
}, index=['b', 'c', 'd'])

print(df1 + df2)

Output:

A B C
a NaN NaN NaN
b NaN 12.0 NaN
c NaN 14.0 NaN
d NaN NaN NaN

• Rows and columns that don’t match become NaN.

• For overlapping rows and columns, values are added.

3. Broadcasting Between DataFrames and Series

Pandas supports broadcasting when you perform operations between a DataFrame and a
Series.

Example 1: Add a Series (column-wise) to DataFrame

df = [Link]({
'A': [1, 2, 3],
'B': [4, 5, 6]
})

s = [Link]([10, 20, 30], index=[0, 1, 2])

print(df + s)
Output:

A B
0 11 14
1 22 25
2 33 36

• Here, s is added row-wise because its index matches the DataFrame's row index.

Example 2: Add a Series (row-wise) to DataFrame using axis=1

s = [Link]([10, 20], index=['A', 'B'])

print(df + s)

Output:

A B
0 11 24
1 12 25
2 13 26

• The Series s matches DataFrame columns, so it broadcasts column-wise to each row.

4. Arithmetic with fill values

You can specify a fill_value to use instead of NaN during operations.

print([Link](df2, fill_value=0))

This treats missing values as 0 instead of NaN, making the operation more intuitive.

5. Comparison Operations
Comparison between DataFrames or Series works similarly and aligns on labels.

print(df1 > df2)

Returns a DataFrame of booleans aligned by index and columns.

6. Summary
Operation Type Behavior Example
Series + Series Index-aligned arithmetic s1 + s2
DataFrame +
Aligns both rows and columns df1 + df2
DataFrame
DataFrame + Series Broadcast Series across columns df + s (where s indexed by
(row) (default) rows)
DataFrame + Series Broadcast Series across rows df + s (where s indexed by
(col) (axis=1) columns)
Operations with Use a default value for missing [Link](df2, fill_value=0)
fill_value labels

[Link] Application and Mapping

Pandas: Function Application and Mapping

Pandas provides several ways to apply functions element-wise, row-wise, or column-wise on
Series and DataFrames.

1. Applying Functions on a Series

a) Using .apply() with a function

import pandas as pd

s = [Link]([1, 2, 3, 4])

# Define a function
def square(x):
return x ** 2

# Apply function element-wise

s_squared = [Link](square)
print(s_squared)

Output:

0 1
1 4
2 9
3 16
dtype: int64

b) Using lambda functions

s_doubled = [Link](lambda x: x * 2)
print(s_doubled)

c) Using vectorized operations (faster!)

For many common functions, just use vectorized operators instead of .apply():

s_plus_one = s + 1

2. Applying Functions on a DataFrame

a) Applying a function to each element (applymap)

df = [Link]({
'A': [1, 2, 3],
'B': [4, 5, 6]
})

# Square each element in DataFrame

df_squared = [Link](lambda x: x ** 2)
print(df_squared)

b) Applying a function along rows or columns (apply)

• axis=0 — apply function to each column (default)

• axis=1 — apply function to each row

# Sum of each column

col_sum = [Link](sum, axis=0)
print(col_sum)

# Sum of each row

row_sum = [Link](sum, axis=1)
print(row_sum)

c) Example: Applying a custom function to rows

def range_func(row):
return [Link]() - [Link]()

row_range = [Link](range_func, axis=1)

print(row_range)

3. Mapping Values in a Series

a) Using .map() with a dictionary

You can map values of a Series to new values using a dictionary.

s = [Link](['cat', 'dog', 'bird', 'cat'])

mapping = {'cat': 'meow', 'dog': 'woof'}

s_mapped = [Link](mapping)
print(s_mapped)

Output:

0 meow
1 woof
2 NaN
3 meow
dtype: object

Note: Values not in the mapping dictionary become NaN.

b) Using .map() with a function

s_mapped = [Link](lambda x: [Link]())
print(s_mapped)

4. Replacing Values Using .replace()

Similar to .map(), but better suited for replacing values:

s_replaced = [Link]({'cat': 'feline', 'dog': 'canine'})

print(s_replaced)

5. Vectorized String Methods with .str

For string data, you can apply vectorized string functions:

s = [Link](['apple', 'banana', 'cherry'])

print([Link]())
print([Link]('a'))

Summary Table
Method Applies To Functionality
.apply() Series, DataFrame Apply function element-wise or along axis
.applymap() DataFrame only Apply function element-wise on DataFrame
.map() Series Map values using dict or function
.replace() Series, DataFrame Replace specified values
.str Series (strings) Vectorized string operations

[Link] AND RANKING

Pandas: Sorting and Ranking

1. Sorting Data
a) Sorting a Series
import pandas as pd

s = [Link]([4, 2, 8, 1])

# Sort values ascending

print(s.sort_values())

# Sort values descending

print(s.sort_values(ascending=False))

# Sort by index
print(s.sort_index())

b) Sorting a DataFrame

• Sort by one or multiple columns

• Sort rows based on column values

df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Score': [85.5, 90.0, 88.5]
})

# Sort by Age ascending

print(df.sort_values(by='Age'))

# Sort by Score descending

print(df.sort_values(by='Score', ascending=False))

# Sort by multiple columns: Age ascending, Score descending

print(df.sort_values(by=['Age', 'Score'], ascending=[True, False]))
c) Sorting in-place

To modify the DataFrame without creating a new copy:

df.sort_values(by='Age', inplace=True)

2. Ranking Data
Ranking assigns ranks to data, handling ties as needed.

a) Ranking in a Series
s = [Link]([100, 200, 100, 300])

print([Link]()) # Average rank for ties (default)

print([Link](method='min')) # Use minimum rank for ties
print([Link](method='max')) # Use maximum rank for ties
print([Link](method='dense')) # Like 'min' but rank always increments by 1

Output:

0 1.5
1 3.0
2 1.5
3 4.0
dtype: float64

b) Ranking in a DataFrame

Rank along rows or columns.

df = [Link]({
'Math': [90, 80, 90],
'English': [70, 90, 80]
})

# Rank each column

print([Link]())

# Rank each row

print([Link](axis=1))

c) Ranking with ascending/descending order

print([Link](ascending=False))
Summary Table
Operation Description Example
sort_values()
Sort by values in df.sort_values(by='Age')
Series/DataFrame
sort_index() Sort by index s.sort_index()
rank() Assign ranks, handling ties [Link](method='min')
df.sort_values(by='Age',
inplace param Modify original object inplace=True)

[Link] AND COVARIANCE

1. Covariance
What is Covariance?

• Measures how two variables vary together.

• Positive covariance → variables increase or decrease together.
• Negative covariance → one variable increases when the other decreases.
• Zero covariance → no linear relationship.

Calculate Covariance in pandas

import pandas as pd

data = {
'X': [1, 2, 3, 4, 5],
'Y': [2, 4, 6, 8, 10],
'Z': [5, 4, 3, 2, 1]
}

df = [Link](data)

# Covariance matrix of DataFrame columns

cov_matrix = [Link]()
print(cov_matrix)

Output:

X Y Z
X 2.5 5.0 -2.5
Y 5.0 10.0 -5.0
Z -2.5 -5.0 2.5

• Diagonal elements = variances.

• Off-diagonal = covariance between variables.
2. Correlation
What is Correlation?

• Measures the strength and direction of linear relationship between two variables.
• Values range from -1 to +1:
o +1 = perfect positive linear correlation
o -1 = perfect negative linear correlation
o 0 = no linear correlation
• More interpretable than covariance as it is normalized.

Calculate Correlation in pandas

corr_matrix = [Link]()
print(corr_matrix)

Output:

X Y Z
X 1.000000 1.000000 -1.000000
Y 1.000000 1.000000 -1.000000
Z -1.000000 -1.000000 1.000000

• X and Y are perfectly positively correlated.

• X and Z are perfectly negatively correlated.

Correlation Methods

By default, [Link]() uses Pearson correlation.

Other methods:

• 'pearson' (default)
• 'kendall' (Kendall Tau)
• 'spearman' (Spearman rank)

Example:

[Link](method='spearman')

3. Correlation/Covariance between two Series

x = df['X']
y = df['Y']

print([Link](y))
print([Link](y))

Summary
Function Purpose Example
[Link]() Covariance matrix [Link]()
[Link]() Correlation matrix (Pearson by default) [Link]()
[Link]() Correlation between two Series [Link](y)
[Link]() Covariance between two Series [Link](y)

16.―Not a Number Data

Pandas: Handling Not a Number (NaN)

Data

1. What is NaN?
• NaN stands for "Not a Number".
• It’s the standard missing data marker in pandas (and NumPy).
• Used to represent missing or undefined values in numeric arrays.
• Can also appear in object/string columns.

2. Detecting NaN Values

a) Using isna() or isnull()

Both methods are equivalent and return a boolean mask.

import pandas as pd
import numpy as np

df = [Link]({
'A': [1, 2, [Link], 4],
'B': [[Link], 2, 3, 4]
})
print([Link]())

Output:

A B
0 False True
1 False False
2 True False
3 False False

b) Check if any NaN in entire DataFrame

print([Link]().any()) # Per column
print([Link]().any().any()) # Overall

3. Handling NaN Data

a) Removing rows or columns with NaN (dropna())

# Drop rows with any NaN
cleaned_rows = [Link]()

# Drop columns with any NaN

cleaned_cols = [Link](axis=1)

print(cleaned_rows)
print(cleaned_cols)

b) Filling NaN values (fillna())

Replace NaN with a specified value:

df_filled = [Link](0)
print(df_filled)

You can also forward-fill or backward-fill:

df_ffill = [Link](method='ffill') # Forward fill

df_bfill = [Link](method='bfill') # Backward fill

c) Filling with a different value per column:

[Link]({'A': 0, 'B': 99})
4. Replacing NaN with Interpolation
Useful for time-series or numeric data:

df_interpolated = [Link]()
print(df_interpolated)

5. Checking for NaN in Series/DataFrame

print([Link](df['A'][2])) # True
print([Link](df['A'][1])) # True

6. Why NaN is Important

• NaNs propagate in calculations, preventing misleading results.
• Many pandas functions have parameters to ignore or handle NaNs gracefully.
• Essential to clean or impute missing data for accurate analysis.

Summary Table
Method Purpose Example
isna()/isnull() Detect NaN values [Link]()
Drop rows/columns with
dropna() [Link](), [Link](axis=1)
NaNs
fillna()
Fill NaNs with a specified [Link](0),
value [Link](method='ffill')
interpolate() Fill NaNs via interpolation [Link]()
notna() Detect non-NaN values [Link]()

[Link] INDEXING AND LEVELING

Pandas: Hierarchical Indexing

(MultiIndex) and Leveling

1. What is Hierarchical Indexing?

• It allows multiple index levels on rows and/or columns.
• You can think of it as nested indexing.
• Enables working with higher-dimensional data (3D+) in a 2D table.
• Makes grouping and slicing complex datasets easier.

2. Creating a MultiIndex (Hierarchical Index)

a) From tuples:
import pandas as pd

arrays = [
['A', 'A', 'B', 'B'],
[1, 2, 1, 2]
]

index = [Link].from_arrays(arrays, names=['letter', 'number'])

s = [Link]([10, 20, 30, 40], index=index)

print(s)

Output:

letter number
A 1 10
2 20
B 1 30
2 40
dtype: int64

b) From a DataFrame:
df = [Link]({
'City': ['NY', 'NY', 'LA', 'LA'],
'Year': [2020, 2021, 2020, 2021],
'Value': [100, 110, 90, 95]
})

df = df.set_index(['City', 'Year'])
print(df)

3. Accessing Data in MultiIndex

a) Using .loc[] with tuples:

print([Link][('A', 2)]) # Output: 20
b) Slicing across levels:
print([Link]['A']) # All data for 'A'
print([Link][('A', slice(1,2))])

4. Index Level Operations

a) Getting levels and labels

print([Link]) # List of unique values at each level
print([Link]) # Names of each level

b) Resetting index levels

df_reset = df.reset_index()
print(df_reset)

c) Swapping levels
s_swapped = [Link]('letter', 'number')
print(s_swapped)

d) Sorting by index levels

s_sorted = s.sort_index(level='number')
print(s_sorted)

5. Aggregation on MultiIndex DataFrames

You can perform aggregation grouped by levels:

df = [Link]({
'City': ['NY', 'NY', 'LA', 'LA'],
'Year': [2020, 2021, 2020, 2021],
'Value': [100, 110, 90, 95]
}).set_index(['City', 'Year'])

print([Link](level='City').sum())

Summary Table
Operation Description Example
Create MultiIndex Using arrays or set_index [Link].from_arrays()
Access with .loc Access data using tuples or slices [Link][('A', 1)]
Flatten MultiIndex back to
Reset index df.reset_index()
columns
Swap index levels Swap order of index levels [Link]()
Sort by index levels Sort by specified index level s.sort_index(level='number')
Group by index
Aggregate data based on level [Link](level='City').sum()
level

[Link] AND WRITING DATA: CSV OR TEXT FILE

Pandas: Reading and Writing CSV or

Text Files

1. Reading CSV or Text Files

a) pd.read_csv()

• The most common function to read CSV files (also works with many text files).
• Automatically parses CSV into a DataFrame.

import pandas as pd

# Basic CSV read

df = pd.read_csv('[Link]')

print([Link]())

b) Common Parameters of read_csv

Parameter Description Example

filepath_or_buffer Path to file or URL '[Link]'
sep Delimiter (default is comma ,) sep='\t' for tab-separated
Row number(s) to use as column
header header=0 (default), None
names
names List of column names to use names=['A', 'B', 'C']
index_col Column(s) to set as index index_col=0
usecols Return a subset of columns usecols=['A', 'C']
Parameter Description Example
dtype={'A': int, 'B':
dtype Data type for columns float}

na_values
Additional strings to recognize as na_values=['NA', 'missing']
NaN
parse_dates Parse columns as dates parse_dates=['date_column']

skiprows
Number of rows or list of rows to skiprows=1
skip
nrows Number of rows to read nrows=100

Example: Reading a tab-separated file with no header

df = pd.read_csv('[Link]', sep='\t', header=None, names=['A', 'B', 'C'])

2. Writing Data to CSV or Text Files

a) df.to_csv()

Saves DataFrame to CSV file.

df.to_csv('[Link]', index=False) # index=False to avoid writing row

numbers

b) Common Parameters of to_csv

Parameter Description Example

path_or_buf File path or object '[Link]'
sep Field delimiter sep='\t'
index Write row names (index) index=False
header Write column names header=True
columns Specify columns to write columns=['A', 'B']
mode Write mode, e.g., append ('a') mode='a'
na_rep Representation for missing data na_rep='NA'
compression Compression mode (e.g., 'gzip', 'bz2') compression='gzip'

Example: Writing DataFrame to a tab-separated file without index

df.to_csv('[Link]', sep='\t', index=False)
3. Reading Large Files in Chunks
You can read large files in chunks to save memory:

chunk_iter = pd.read_csv('large_file.csv', chunksize=1000)

for chunk in chunk_iter:

print([Link]())

Summary Table
Function Purpose Basic Usage
pd.read_csv() Read CSV/text file pd.read_csv('[Link]')
df.to_csv() Write DataFrame to CSV df.to_csv('[Link]', index=False)
chunksize Read file in chunks pd.read_csv('[Link]', chunksize=1000)

[Link] FILES

Pandas: Reading and Writing HTML

Files

1. Reading HTML Tables

Pandas can read tables embedded in HTML files or web pages using the pd.read_html()
function.

a) Read tables from a local or web HTML file

import pandas as pd

# From a URL
url = '[Link]

tables = pd.read_html(url)

print(f"Number of tables found: {len(tables)}")

# Access the first table

df = tables[0]
print([Link]())
b) Reading from a local HTML file
tables = pd.read_html('local_file.html')
df = tables[0] # First table in the HTML file

c) Parameters of read_html

Parameter Description Example

io URL or local file path or string '[Link] '[Link]'
match String or regex to match table content 'GDP' to find tables with GDP keyword
flavor Parser engine: 'bs4' or 'lxml' 'bs4' (default if installed)
header Row to use as header header=0
skiprows Rows to skip skiprows=1
attrs Dict of HTML attributes to match attrs = {'class': 'wikitable'}
encoding Character encoding 'utf-8'

2. Writing DataFrames to HTML

a) Save a DataFrame as an HTML table

df.to_html('[Link]')

This saves the DataFrame as a basic HTML table.

b) Customizing HTML output

• You can customize the table by specifying parameters:

df.to_html('[Link]', index=False, border=0, classes='table table-

striped')

• This removes the index column, sets border to 0, and adds CSS classes (good for
Bootstrap styling).

3. Example: Reading and Writing HTML Table

# Read tables from Wikipedia
tables =
pd.read_html('[Link]
_(United_Nations)')
# Extract first table
df = tables[0]

# Save to local HTML

df.to_html('countries_population.html', index=False)

Notes:
• Reading HTML tables requires lxml and beautifulsoup4 libraries installed. You
can install them via:

pip install lxml beautifulsoup4 html5lib

• read_html returns a list of DataFrames since one HTML page can have multiple
tables.

[Link] EXCEL FILES

Pandas: Reading and Writing Microsoft

Excel Files

1. Reading Excel Files

a) Basic usage with pd.read_excel()

import pandas as pd

# Read Excel file (.xls or .xlsx)

df = pd.read_excel('[Link]')

print([Link]())

b) Reading specific sheets

• By default, reads the first sheet.

# Read a specific sheet by name

df_sheet = pd.read_excel('[Link]', sheet_name='Sheet2')

# Read a sheet by index (0-based)

df_sheet = pd.read_excel('[Link]', sheet_name=0)
c) Reading multiple sheets at once
dfs = pd.read_excel('[Link]', sheet_name=None) # Reads all sheets

# dfs is a dictionary with sheet names as keys and DataFrames as values

print([Link]())

d) Common parameters

Parameter Description Example

sheet_name Sheet name, index, list of names/indexes, or None 'Sheet1', [0, 2], None
header Row number to use as column names header=0 (default)
names List of column names to use names=['A', 'B', 'C']
usecols Columns to read usecols='A:C' or [0,2]
skiprows Rows to skip skiprows=2
nrows Number of rows to read nrows=100
dtype Data types for columns dtype={'A': int}

2. Writing to Excel Files

a) Basic write with to_excel()

df.to_excel('[Link]', index=False) # index=False to skip row numbers

b) Writing multiple DataFrames to different sheets

with [Link]('output_multi.xlsx') as writer:
df1.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')

c) Important parameters for to_excel()

Parameter Description Example

excel_writer File path or ExcelWriter object '[Link]'
sheet_name Sheet name 'Sheet1'
index Write row index index=False
header Write column headers header=True
startrow Upper left cell row to start writing startrow=2
startcol Upper left cell column to start writing startcol=1
Parameter Description Example
engine Engine to use ('openpyxl', 'xlsxwriter') 'xlsxwriter'

3. Requirements
• To work with Excel files, pandas uses external libraries:
o openpyxl for .xlsx files
o xlrd for .xls files (Note: recent versions of xlrd dropped support for .xlsx)
o xlsxwriter for writing (optional, faster features)

Install them via pip if needed:

pip install openpyxl xlrd xlsxwriter

4. Example: Read, modify, and save Excel file

df = pd.read_excel('[Link]', sheet_name='2023')

# Add a new column

df['Total'] = df['Quantity'] * df['Price']

# Save to new Excel file

df.to_excel('sales_updated.xlsx', index=False)

Summary Table
Function Purpose Basic Usage
pd.read_excel() Read Excel file pd.read_excel('[Link]')
df.to_excel() Write DataFrame to Excel file df.to_excel('[Link]')
ExcelWriter Write multiple sheets with [Link]() as w:

If you'd like, I can show you advanced Excel writing features like formatting, charts, or
formulas using pandas + xlsxwriter!

NumPy Basics: A Python Tutorial
No ratings yet
NumPy Basics: A Python Tutorial
38 pages
NumPy Basics: Arrays and Operations
No ratings yet
NumPy Basics: Arrays and Operations
16 pages
NumPy Basics: Arrays and Operations
No ratings yet
NumPy Basics: Arrays and Operations
9 pages
Numpy Library
No ratings yet
Numpy Library
4 pages
NumPy Complete Notes
No ratings yet
NumPy Complete Notes
29 pages
Loading NPY Files with NumPy
No ratings yet
Loading NPY Files with NumPy
33 pages
Numpy Advanced
No ratings yet
Numpy Advanced
28 pages
NumPy Basics: Arrays & Computation
No ratings yet
NumPy Basics: Arrays & Computation
53 pages
Numpy Practical
No ratings yet
Numpy Practical
4 pages
NumPy Guide for Data Science Essentials
No ratings yet
NumPy Guide for Data Science Essentials
7 pages
Python Data Types and NumPy Basics
No ratings yet
Python Data Types and NumPy Basics
8 pages
Mastering NumPy for Data Analysis
No ratings yet
Mastering NumPy for Data Analysis
17 pages
NumPy Arrays vs Python Lists Explained
No ratings yet
NumPy Arrays vs Python Lists Explained
14 pages
Numpy Video Script
No ratings yet
Numpy Video Script
17 pages
Lec 4
No ratings yet
Lec 4
20 pages
NumPy Tutorial
No ratings yet
NumPy Tutorial
3 pages
Python DataScience Tutorial
No ratings yet
Python DataScience Tutorial
17 pages
NumPy: Essential Guide for Python Users
No ratings yet
NumPy: Essential Guide for Python Users
11 pages
NumPy Basics: Array Creation & Operations
No ratings yet
NumPy Basics: Array Creation & Operations
27 pages
Python (4-6)
No ratings yet
Python (4-6)
29 pages
NumPy Arrays and Operations Guide
No ratings yet
NumPy Arrays and Operations Guide
10 pages
Introduction to NumPy for Data Science
No ratings yet
Introduction to NumPy for Data Science
15 pages
NumPy C
No ratings yet
NumPy C
11 pages
Data Science Lab Manual: NumPy Basics
No ratings yet
Data Science Lab Manual: NumPy Basics
42 pages
Practical Guide to NumPy for Data Science
100% (1)
Practical Guide to NumPy for Data Science
27 pages
NumPy Basics for Python Developers
No ratings yet
NumPy Basics for Python Developers
4 pages
Data Handling with NumPy and Pandas
No ratings yet
Data Handling with NumPy and Pandas
37 pages
Creating NumPy Arrays with Zeros
No ratings yet
Creating NumPy Arrays with Zeros
33 pages
NumPy Basics for Python Programming
No ratings yet
NumPy Basics for Python Programming
43 pages
NumPy Basics for Python Programming
No ratings yet
NumPy Basics for Python Programming
45 pages
NumPy Basics: Array Operations Guide
No ratings yet
NumPy Basics: Array Operations Guide
8 pages
Numpy 1
No ratings yet
Numpy 1
10 pages
NumPy Handbook for Python Users
No ratings yet
NumPy Handbook for Python Users
16 pages
NumPy API Reference Guide
No ratings yet
NumPy API Reference Guide
5 pages
NumPy: Essential Python Array Operations
No ratings yet
NumPy: Essential Python Array Operations
64 pages
Introduction To NumPy
No ratings yet
Introduction To NumPy
20 pages
Numpy Basics: Installation and Usage Guide
No ratings yet
Numpy Basics: Installation and Usage Guide
8 pages
NumPy - The Absolute Basics For Beginners - NumPy v1.23 Manual
No ratings yet
NumPy - The Absolute Basics For Beginners - NumPy v1.23 Manual
29 pages
NumPy Basics: Arrays and Operations
No ratings yet
NumPy Basics: Arrays and Operations
34 pages
Unit III NumPy Diploma Notes
No ratings yet
Unit III NumPy Diploma Notes
4 pages
Python Numerical Computing with NumPy
No ratings yet
Python Numerical Computing with NumPy
44 pages
Datascience Unit 2 Notes
No ratings yet
Datascience Unit 2 Notes
69 pages
Introduction to NumPy for Class 11
No ratings yet
Introduction to NumPy for Class 11
15 pages
Key Features of NumPy Arrays
No ratings yet
Key Features of NumPy Arrays
5 pages
Python Arrays: Creation and Operations
No ratings yet
Python Arrays: Creation and Operations
18 pages
Python NumPy Basics for Data Science
No ratings yet
Python NumPy Basics for Data Science
49 pages
NumPy Notes For Data Analysis
No ratings yet
NumPy Notes For Data Analysis
55 pages
NumPy Essentials for Machine Learning
No ratings yet
NumPy Essentials for Machine Learning
21 pages
NumPy Array Operations Guide
No ratings yet
NumPy Array Operations Guide
14 pages
NumPy Mathematical Functions Overview
No ratings yet
NumPy Mathematical Functions Overview
41 pages
Comprehensive Guide to NumPy Basics
No ratings yet
Comprehensive Guide to NumPy Basics
14 pages
The Role of Software in Product Evolution
No ratings yet
The Role of Software in Product Evolution
13 pages
Module 1 - Systems, Roles, and Development Methodologies
No ratings yet
Module 1 - Systems, Roles, and Development Methodologies
30 pages
Tree-Structured Indexing in Databases
No ratings yet
Tree-Structured Indexing in Databases
26 pages
Online Student Registration System Report
100% (1)
Online Student Registration System Report
48 pages
Unit 5: Reinforcement Learning Notes
No ratings yet
Unit 5: Reinforcement Learning Notes
20 pages
Overview of SNMP and TR-069 Protocols
No ratings yet
Overview of SNMP and TR-069 Protocols
22 pages
Rheaction Programmer
No ratings yet
Rheaction Programmer
15 pages
CNN Quiz: Deep Neural Networks Basics
No ratings yet
CNN Quiz: Deep Neural Networks Basics
6 pages
Class 11 Computer Science Preboard Exam
No ratings yet
Class 11 Computer Science Preboard Exam
10 pages
Computer Science Basics and Definitions
No ratings yet
Computer Science Basics and Definitions
17 pages
Complete Digital Freelancing Course
100% (1)
Complete Digital Freelancing Course
11 pages
Vehicle Braking System Maintenance Plan
No ratings yet
Vehicle Braking System Maintenance Plan
12 pages
STET 2023 Computer Science Syllabus
No ratings yet
STET 2023 Computer Science Syllabus
8 pages
Editing TPS Data in Leica Infinity
No ratings yet
Editing TPS Data in Leica Infinity
13 pages
Benefits of Educational Websites
No ratings yet
Benefits of Educational Websites
15 pages
Cargill Ghana Charity Shipment Invoice
No ratings yet
Cargill Ghana Charity Shipment Invoice
2 pages
Quarterly Bicycle Sales Analysis
No ratings yet
Quarterly Bicycle Sales Analysis
4 pages
Project Handover & Maintenance Template
No ratings yet
Project Handover & Maintenance Template
2 pages
EDLC Phases in Embedded Systems
No ratings yet
EDLC Phases in Embedded Systems
31 pages
Class 8 ISSO Olympiad Workbook Guide
No ratings yet
Class 8 ISSO Olympiad Workbook Guide
2 pages
Top 50 Operating System Interview Questions: 1) Explain The Main Purpose of An Operating System?
No ratings yet
Top 50 Operating System Interview Questions: 1) Explain The Main Purpose of An Operating System?
11 pages
Controller: Automation Systems
No ratings yet
Controller: Automation Systems
195 pages
Advanced Power Quality Analyzer Features
No ratings yet
Advanced Power Quality Analyzer Features
12 pages
Technical Drawing Specifications Document
No ratings yet
Technical Drawing Specifications Document
1 page
Embedded C Programming Master Class
No ratings yet
Embedded C Programming Master Class
48 pages
Penilaian Simulasi Digital SMK 2023
No ratings yet
Penilaian Simulasi Digital SMK 2023
6 pages
Understanding Bits and Bytes in PLCs
No ratings yet
Understanding Bits and Bytes in PLCs
2 pages
Big Data Analytics Lab Course Overview
No ratings yet
Big Data Analytics Lab Course Overview
4 pages
MCA Computer Networks Exam Paper 2023
No ratings yet
MCA Computer Networks Exam Paper 2023
2 pages

NumPy and Pandas Data Manipulation Guide

Uploaded by

NumPy and Pandas Data Manipulation Guide

Uploaded by

UNIT – IV

Step-by-Step Installation Instructions

Before installing, check if NumPy is already installed on your system:

2. Install Using pip (Python Package Installer)

Open your terminal or command prompt and run:

pip install numpy

If you're using Python 3 specifically:

pip3 install numpy

After installation, go to Python shell or script:

3. Using pip in a Virtual Environment (Recommended)

# Create a virtual environment

# Activate the environment

# Install numpy in the virtual environment

4. Install with conda (if using Anaconda or Miniconda)

conda install numpy

5. Installing a Specific Version of NumPy

6. Installing from Source (Advanced)

git clone [Link]

2. Install build dependencies:

pip install -r [Link]

3. Build and install:

Verifying Installation (Optional)

2. From a List of Lists (2D Array)

3. Using NumPy Functions

print("Matrix Multiplication:\n", A @ B) # or [Link](A, B)

Indexing & Slicing

Reshaping & Flattening

reshaped = [Link]((2, 3)) # Change shape to 2x3

Example Use Case

Basic Operations with ndarray

Let’s go step by step:

Operations are broadcasted if shapes are compatible (explained below).

print("Add scalar:", a + 10) # [11 12 13]

Returns a boolean array:

print("Equal:", a == b) # [False False False]

4. Aggregate Functions (Statistics)

Apply mathematical operations across the array:

arr = [Link]([[1, 2, 3], [4, 5, 6]])

5. Dot Product / Matrix Multiplication

Broadcasting allows NumPy to perform operations on arrays of different shapes:

matrix = [Link]([[1, 2, 3], [4, 5, 6]])

print("Add vector to matrix:\n", matrix + vector)

# Find where condition is true

9. Copying vs Viewing Arrays

print("Original:", original) # [100 2 3]

10. Reshaping Arrays

reshaped = [Link]((2, 3))

[Link], SLICING, AND ITERATING

arr = [Link]([10, 20, 30, 40, 50])

print(arr[0]) # First element → 10

print(arr2d[0, 0]) # Row 0, Column 0 → 1

print(arr3d[1, 0, 1]) # Output: 6

print(arr[1:4]) # [20 30 40]

print(arr2d[0:2, 1:]) # Rows 0-1, Cols 1-end

Iterate Over 1D Array

Iterate Over 2D Array (Row-wise)

for row in arr2d:

Use .flat or [Link]() to access all elements regardless of dimensions:

Boolean Indexing (Fancy Indexing)

arr = [Link]([10, 20, 30, 40])

print(arr[arr > 25]) # [30 40]

print(arr2d[mask]) # [15 20]

rows = [Link]([0, 2])

print(arr2d[rows, cols]) # [20 50]

a = [Link]([[1, 2, 3], [4, 5, 6]])

reshaped = [Link]((2, 3)) # 2 rows, 3 columns

• The total number of elements must remain the same.

[Link]((-1, 2)) # NumPy figures out rows → (3, 2)

flat = [Link]() # returns a *copy*

Difference from ravel():

a = [Link]([[1, 2, 3], [4, 5, 6]])

Transposing Higher-Dimensional Arrays

5. Expand or Reduce Dimensions

a_col = a[:, [Link]]

[Link]() — Remove dimensions of size 1

resize() modifies the array in-place:

flat = [Link]() # returns a copy