0% found this document useful (0 votes)
14 views51 pages

NumPy and Pandas Data Manipulation Guide

This document provides an overview of data manipulation tools and software, focusing on NumPy and Pandas. It covers installation procedures, basic operations, and functionalities of NumPy's ndarray, including indexing, slicing, and reshaping. Additionally, it highlights the capabilities of Pandas for data handling and analysis.

Uploaded by

hariharan67889
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views51 pages

NumPy and Pandas Data Manipulation Guide

This document provides an overview of data manipulation tools and software, focusing on NumPy and Pandas. It covers installation procedures, basic operations, and functionalities of NumPy's ndarray, including indexing, slicing, and reshaping. Additionally, it highlights the capabilities of Pandas for data handling and analysis.

Uploaded by

hariharan67889
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT – IV

DATA MANIPULATION TOOLS & SOFTWARES: Numpy: Installation - Ndarray - Basic Operations -
Indexing, Slicing, and Iterating - Shape Manipulation - Array Manipulation - Structured Arrays -
Reading and Writing Array Data on Files. Pandas: The pandas Library: An Introduction - Installation -
Introduction to pandas Data Structures - Operations between Data Structures - Function Application
and Mapping - Sorting and Ranking - Correlation and Covariance - ―Not a Number Data -
Hierarchical Indexing and Leveling – Reading and Writing Data: CSV or Text File - HTML Files –
Microsoft Excel Files.

1. INSTALLATION :
What is NumPy?
NumPy (Numerical Python) is a powerful library for numerical computations, especially
with arrays and matrices. It is foundational for scientific computing in Python.

Step-by-Step Installation Instructions


1. Check if NumPy is already installed

Before installing, check if NumPy is already installed on your system:

import numpy
print(numpy.__version__)

If you see a version number, NumPy is already installed. If you get a ModuleNotFoundError,
proceed with installation.

2. Install Using pip (Python Package Installer)


Basic Installation

Open your terminal or command prompt and run:

pip install numpy

If you're using Python 3 specifically:

pip3 install numpy

This downloads and installs the latest stable version from PyPI.

To verify installation:

After installation, go to Python shell or script:

import numpy as np
print(np.__version__)

3. Using pip in a Virtual Environment (Recommended)


To avoid package conflicts, it's best to use a virtual environment:

# Create a virtual environment


python -m venv venv

# Activate the environment


# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

# Install numpy in the virtual environment


pip install numpy

4. Install with conda (if using Anaconda or Miniconda)


If you’re using Anaconda/Miniconda:

conda install numpy

This is often faster because conda installs precompiled binaries optimized for your system.

5. Installing a Specific Version of NumPy


pip install numpy==1.24.4

Replace 1.24.4 with the version you want. You can see available versions here:
[Link]

6. Installing from Source (Advanced)


1. Clone the repository:

git clone [Link]


cd numpy

2. Install build dependencies:

pip install -r [Link]

3. Build and install:

pip install .

Use this only if you need to develop or test the latest source version.
Troubleshooting Installation Issues
Problem Solution
Permission denied Use --user flag: pip install --user numpy
pip not recognized Use python -m pip install numpy
Conflicts with other Use a virtual environment or pip install --upgrade --force-
packages reinstall numpy
No internet / offline Download .whl file from PyPI and install: pip install
install numpy-*.whl

Verifying Installation (Optional)


Run a basic test:

import numpy as np

# Create an array
a = [Link]([1, 2, 3])
print("Array:", a)

# Perform operation
print("Mean:", [Link](a))

Resources
• NumPy Official Site: [Link]
• Installation Docs: [Link]
• PyPI: [Link]
• Source Code: [Link]

[Link]:

What is ndarray?
In NumPy, every array is an instance of the [Link] class. It can be 1D, 2D, or multi-
dimensional.

Importing NumPy
import numpy as np
Creating an ndarray
1. From a Python List
arr = [Link]([1, 2, 3, 4])
print(arr)
print(type(arr)) # <class '[Link]'>

2. From a List of Lists (2D Array)


matrix = [Link]([[1, 2, 3], [4, 5, 6]])
print(matrix)

3. Using NumPy Functions


zeros = [Link]((3, 3)) # 3x3 array of zeros
ones = [Link]((2, 2)) # 2x2 array of ones
rand = [Link](2, 3) # 2x3 array with random values

Attributes of ndarray
a = [Link]([[1, 2, 3], [4, 5, 6]])

print("Array:\n", a)
print("Shape:", [Link]) # (2, 3)
print("Dimensions:", [Link]) # 2
print("Size:", [Link]) # 6
print("Data type:", [Link]) # int64 (varies by system)

Operations on ndarray
1. Arithmetic Operations
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])

print(a + b) # [5 7 9]
print(a * b) # [ 4 10 18]
print(a ** 2) # [1 4 9]

2. Matrix Operations
A = [Link]([[1, 2], [3, 4]])
B = [Link]([[2, 0], [1, 3]])

print("Matrix Multiplication:\n", A @ B) # or [Link](A, B)

Indexing & Slicing


a = [Link]([[10, 20, 30], [40, 50, 60]])

print(a[0, 1]) # 20
print(a[:, 1]) # [20 50]
print(a[1, :]) # [40 50 60]

Reshaping & Flattening


a = [Link]([[1, 2], [3, 4], [5, 6]])

reshaped = [Link]((2, 3)) # Change shape to 2x3


flattened = [Link]() # Convert to 1D

print("Reshaped:\n", reshaped)
print("Flattened:\n", flattened)

Example Use Case


# Calculate mean and standard deviation of random data
data = [Link](0, 1, size=(1000,))
print("Mean:", [Link](data))
print("Std Dev:", [Link](data))

Summary
Feature Description
ndarray Core array type in NumPy
shape Tuple showing array dimensions
dtype Data type of array elements
ndim Number of dimensions
size Total number of elements
reshape() Changes the shape of the array
flatten() Converts array to 1D
@ or dot() Matrix multiplication

[Link] OPERATIONS

Basic Operations with ndarray


NumPy provides vectorized operations, which means operations are performed element-
wise and much faster than with plain Python lists.

Let’s go step by step:


1. Element-wise Arithmetic Operations
import numpy as np

a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])

print("Addition:", a + b) # [5 7 9]
print("Subtraction:", a - b) # [-3 -3 -3]
print("Multiplication:", a * b) # [ 4 10 18]
print("Division:", a / b) # [0.25 0.4 0.5 ]
print("Power:", a ** 2) # [1 4 9]

Operations are broadcasted if shapes are compatible (explained below).

2. Scalar Operations

You can perform operations with a scalar (number) on the entire array:

print("Add scalar:", a + 10) # [11 12 13]


print("Multiply scalar:", a * 3) # [3 6 9]

3. Comparison Operations

Returns a boolean array:

print("Equal:", a == b) # [False False False]


print("Greater than:", b > a) # [ True True True]

4. Aggregate Functions (Statistics)

Apply mathematical operations across the array:

arr = [Link]([[1, 2, 3], [4, 5, 6]])

print("Sum:", [Link](arr)) # 21
print("Min:", [Link](arr)) # 1
print("Max:", [Link](arr)) # 6
print("Mean:", [Link](arr)) # 3.5
print("Standard Deviation:", [Link](arr)) # ~1.7078
print("Sum along axis 0 (columns):", [Link](arr, axis=0)) # [5 7 9]
print("Sum along axis 1 (rows):", [Link](arr, axis=1)) # [6 15]

5. Dot Product / Matrix Multiplication


x = [Link]([[1, 2], [3, 4]])
y = [Link]([[2, 0], [1, 3]])

# Matrix multiplication
print("Matrix product:\n", [Link](x, y)) # or x @ y

6. Transpose of a Matrix
matrix = [Link]([[1, 2, 3], [4, 5, 6]])

print("Original:\n", matrix)
print("Transposed:\n", matrix.T)

7. Broadcasting

Broadcasting allows NumPy to perform operations on arrays of different shapes:

a = [Link]([1, 2, 3])
b = 10
print("Broadcasted add:", a + b) # [11 12 13]

matrix = [Link]([[1, 2, 3], [4, 5, 6]])


vector = [Link]([10, 20, 30])

print("Add vector to matrix:\n", matrix + vector)


# Adds [10,20,30] to each row

8. Logical Operations
arr = [Link]([1, 2, 3, 4, 5])

# Find where condition is true


print("Elements > 2:", arr[arr > 2]) # [3 4 5]

# Combine conditions
print("Even numbers:", arr[(arr % 2 == 0)]) # [2 4]

9. Copying vs Viewing Arrays


original = [Link]([1, 2, 3])
view = [Link]() # Shares data
copy = [Link]() # Creates new array

original[0] = 100

print("Original:", original) # [100 2 3]


print("View:", view) # [100 2 3] - affected
print("Copy:", copy) # [1 2 3] - not affected

10. Reshaping Arrays


a = [Link]([1, 2, 3, 4, 5, 6])

reshaped = [Link]((2, 3))


print("Reshaped:\n", reshaped)

flattened = [Link]()
print("Flattened:", flattened)

Summary Table
Operation Type Example Description
Arithmetic a + b, a * 2 Element-wise operations
Comparison a > b, a == 3 Returns boolean arrays
Aggregation [Link](a), [Link](a) Statistical operations
Dot Product [Link](a, b) or a @ b Matrix multiplication
Transpose a.T Transpose rows/columns
Reshape [Link]((r, c)) Change array shape
Flatten [Link]() Convert to 1D
Broadcasting a + b (different shapes) Auto-expands array dimensions
Indexing/Masking a[a > 2] Filter values by condition

[Link], SLICING, AND ITERATING

1. Indexing
Indexing allows you to access individual elements in a NumPy array.

1D Array Indexing
import numpy as np

arr = [Link]([10, 20, 30, 40, 50])

print(arr[0]) # First element → 10


print(arr[-1]) # Last element → 50

2D Array Indexing
arr2d = [Link]([[1, 2, 3], [4, 5, 6]])

print(arr2d[0, 0]) # Row 0, Column 0 → 1


print(arr2d[1, 2]) # Row 1, Column 2 → 6

3D Array Indexing
arr3d = [Link]([
[[1, 2], [3, 4]],
[[5, 6], [7, 8]]
])

print(arr3d[1, 0, 1]) # Output: 6

2. Slicing
Slicing lets you extract subarrays using the syntax:

array[start:stop:step]

1D Slicing
arr = [Link]([10, 20, 30, 40, 50])

print(arr[1:4]) # [20 30 40]


print(arr[:3]) # [10 20 30]
print(arr[::2]) # [10 30 50]

2D Slicing
arr2d = [Link]([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(arr2d[0:2, 1:]) # Rows 0-1, Cols 1-end


# Output:
# [[2 3]
# [5 6]]

Advanced 2D Example
print(arr2d[:, 1]) # All rows, column 1 → [2 5 8]
print(arr2d[1, :]) # Row 1, all columns → [4 5 6]

3. Iterating
You can iterate over arrays using for loops. But NumPy also supports vectorized operations
which are faster and preferred.

Iterate Over 1D Array


arr = [Link]([10, 20, 30])

for x in arr:
print(x)

Iterate Over 2D Array (Row-wise)


arr2d = [Link]([[1, 2], [3, 4]])

for row in arr2d:


print("Row:", row)
Iterate Over Each Element

Use .flat or [Link]() to access all elements regardless of dimensions:

for x in [Link]:
print(x)

Or:

for x in [Link](arr2d):
print(x)

Boolean Indexing (Fancy Indexing)


Select elements using conditions.

arr = [Link]([10, 20, 30, 40])

print(arr[arr > 25]) # [30 40]

Example in 2D:
arr2d = [Link]([[5, 10], [15, 20]])
mask = arr2d > 10

print(mask)
# [[False False]
# [ True True]]

print(arr2d[mask]) # [15 20]

Advanced Indexing
Indexing with arrays of indices
arr = [Link]([10, 20, 30, 40])

indices = [0, 2]
print(arr[indices]) # [10 30]

2D example:
arr2d = [Link]([[10, 20], [30, 40], [50, 60]])

rows = [Link]([0, 2])


cols = [Link]([1, 0])

print(arr2d[rows, cols]) # [20 50]


Summary Table
Feature Example Description
Indexing 1D arr[2] Access 3rd element
Indexing 2D arr[1, 2] Access element at row 1, column 2
Slicing 1D arr[1:4] Elements from index 1 to 3
Slicing 2D arr[0:2, 1:] Slice rows 0-1, cols 1-end
Iterating for x in arr Loop through elements
Flat iteration for x in [Link] Flatten and iterate
Boolean indexing arr[arr > 10] Filter using a condition
Fancy indexing arr[[0, 2]] Index using another array

[Link] MANIPULATION

1. What is Shape?
The shape of a NumPy array is a tuple that indicates the number of elements along each axis
(dimension).

import numpy as np

a = [Link]([[1, 2, 3], [4, 5, 6]])


print([Link]) # (2, 3) → 2 rows, 3 columns

2. Reshape
Changes the shape of the array without changing its data.

a = [Link]([1, 2, 3, 4, 5, 6])

reshaped = [Link]((2, 3)) # 2 rows, 3 columns


print(reshaped)

Notes:

• The total number of elements must remain the same.


• You can use -1 to let NumPy automatically calculate one dimension:

[Link]((-1, 2)) # NumPy figures out rows → (3, 2)

3. Flatten
Converts a multi-dimensional array into a 1D array.
a = [Link]([[1, 2], [3, 4]])

flat = [Link]() # returns a *copy*


print(flat) # [1 2 3 4]

Difference from ravel():


r = [Link]() # returns a *view* if possible

Use flatten() if you want a copy, or ravel() for a faster view (no memory duplication).

4. Transpose
Swaps rows and columns.

a = [Link]([[1, 2, 3], [4, 5, 6]])

print("Original:\n", a)
print("Transposed:\n", a.T)

Transposing Higher-Dimensional Arrays


a = [Link](24).reshape((2, 3, 4)) # Shape: (2 blocks, 3 rows, 4 cols)
print([Link](1, 0, 2)) # Shape: (3, 2, 4)

5. Expand or Reduce Dimensions


Add a new dimension using [Link] or reshape
a = [Link]([1, 2, 3])

print([Link]) # (3,)

a_col = a[:, [Link]]


print(a_col.shape) # (3, 1)

a_row = a[[Link], :]
print(a_row.shape) # (1, 3)

np.expand_dims()

a = [Link]([1, 2, 3])
b = np.expand_dims(a, axis=0) # (1, 3)
c = np.expand_dims(a, axis=1) # (3, 1)

[Link]() — Remove dimensions of size 1


a = [Link]([[[1, 2, 3]]]) # shape (1, 1, 3)
print([Link](a)) # shape (3,)

6. Resize vs Reshape
reshape() returns a new array, but...
a = [Link]([1, 2, 3, 4])
reshaped = [Link]((2, 2)) # Creates new array

resize() modifies the array in-place:


[Link]((2, 2)) # Changes the original array
print(a)

Use with caution — resize() can lose data or fill with garbage if dimensions mismatch.

Shape Manipulation Summary


Operation Method Description
Get shape [Link] Returns shape tuple
Reshape [Link]((r, c)) Change shape, returns new array
Flatten [Link]() Convert to 1D, returns copy
Ravel [Link]() Flatten view (faster, may not copy)
Transpose a.T Swap axes for 2D
Axis transpose [Link]((1, 0, 2)) Custom axis reordering
Add dimension a[:, [Link]] Expand 1D → 2D
Expand dims np.expand_dims(a, 1) Add axis at position
Squeeze [Link](a) Remove axes with size 1
Resize (in-place) [Link]((r, c)) Resizes existing array (modifies data)

Example: Full Workflow


a = [Link](12) # [0 1 2 ... 11]
b = [Link]((3, 4)) # 3x4 array
c = b.T # Transpose to 4x3
d = [Link]() # Flatten back to 1D
e = [Link]((2, 6)) # Reshape to 2x6
print(e)

[Link] MANIPULATION
NumPy Array Manipulation
Manipulating arrays is essential when working with real-world datasets, reshaping models, or
transforming data formats. Let’s break it down:

1. Joining Arrays (Concatenation)


Using [Link]()
import numpy as np

a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])

joined = [Link]((a, b))


print(joined) # [1 2 3 4 5 6]

2D Arrays — specify axis


a = [Link]([[1, 2], [3, 4]])
b = [Link]([[5, 6]])

# Join along axis 0 (rows)


joined = [Link]((a, b), axis=0)
print(joined)

[Link]() and [Link]()


a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])

print([Link]((a, b)))
# Vertical Stack (2 rows)

print([Link]((a, b)))
# Horizontal Stack (1 row)

2. Splitting Arrays
[Link]() — evenly split
a = [Link]([1, 2, 3, 4, 5, 6])
split = [Link](a, 3)
print(split) # [array([1, 2]), array([3, 4]), array([5, 6])]

np.array_split() — can split unevenly


a = [Link]([1, 2, 3, 4, 5])
split = np.array_split(a, 3)
print(split) # Uneven: [array([1, 2]), array([3, 4]), array([5])]
2D Split (horizontal/vertical)
a = [Link]([[1, 2, 3], [4, 5, 6]])

print([Link](a, 3)) # Split columns


print([Link](a, 2)) # Split rows

3. Adding Elements
NumPy arrays are fixed-size, so adding/removing means creating a new array.

[Link]()

a = [Link]([1, 2, 3])
b = [Link](a, [4, 5])
print(b) # [1 2 3 4 5]

Appending to 2D Arrays
a = [Link]([[1, 2], [3, 4]])

# Append new row


print([Link](a, [[5, 6]], axis=0))

# Append new column


print([Link](a, [[5], [6]], axis=1))

The shape must match for rows/columns when appending.

4. Deleting Elements
[Link]()

a = [Link]([10, 20, 30, 40])

print([Link](a, 2)) # Delete element at index 2 → [10 20 40]

Deleting from 2D Arrays


a = [Link]([[1, 2], [3, 4], [5, 6]])

# Delete row at index 1


print([Link](a, 1, axis=0))

# Delete column at index 0


print([Link](a, 0, axis=1))
5. Inserting Elements
[Link]()

a = [Link]([1, 2, 3])

print([Link](a, 1, [10])) # [1 10 2 3]

Insert into 2D Arrays


a = [Link]([[1, 2], [3, 4]])

# Insert row at index 1


print([Link](a, 1, [9, 9], axis=0))

# Insert column at index 0


print([Link](a, 0, [7, 8], axis=1))

6. Stacking Arrays with New Axe


[Link]() — joins arrays along a new axis
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])

stacked = [Link]((a, b)) # Shape: (2, 3)


stacked_axis1 = [Link]((a, b), axis=1) # Shape: (3, 2)

print(stacked)
print(stacked_axis1)

7. Changing Shape for Manipulation


a = [Link]([1, 2, 3, 4, 5, 6])
reshaped = [Link]((2, 3)) # Needed for 2D manipulation

Use shape manipulation methods (reshape, flatten, etc.) to prepare arrays for advanced
manipulations.

Summary Table
Operation Function Description
Join arrays concatenate, hstack, vstack, stack Combine arrays along axis
Split arrays split, array_split, hsplit, vsplit Break arrays into parts
Add elements append, insert Add elements (returns new array)
Remove elements delete Remove elements by index
Operation Function Description
Reshape reshape, ravel, flatten Change shape, convert to 1D
Stack along axis stack Join with a new axis

Example Use Case


# Combine two datasets, remove a column, and reshape
data1 = [Link]([[1, 2], [3, 4]])
data2 = [Link]([[5, 6]])

combined = [Link]((data1, data2)) # Stack rows


cleaned = [Link](combined, 1, axis=1) # Remove 2nd column
reshaped = [Link]((3, 1)) # Reshape to single column
print(reshaped)

[Link] ARRAYS

1. What is a Structured Array?


A structured array is an ndarray where each element is a record, and each record can
contain multiple named fields of different types.

2. Creating a Structured Array


Example: A record with name, age, and weight
import numpy as np

# Define a structured data type


person_dtype = [Link]([
('name', 'U10'), # Unicode string of max length 10
('age', 'i4'), # 4-byte (int32) integer
('weight', 'f4') # 4-byte (float32) float
])

# Create the structured array


people = [Link]([
('Alice', 25, 55.0),
('Bob', 30, 85.5),
('Eve', 22, 48.0)
], dtype=person_dtype)

print(people)

Output:
[('Alice', 25, 55. ) ('Bob', 30, 85.5) ('Eve', 22, 48. )]
3. Accessing Fields in Structured Arrays
Access a specific field (column):
print(people['name']) # ['Alice' 'Bob' 'Eve']
print(people['age']) # [25 30 22]

Access a specific row (record):


print(people[1]) # ('Bob', 30, 85.5)

Access a specific value:


print(people[0]['name']) # 'Alice'
print(people[2]['weight']) # 48.0

4. Structured Dtype Formats


Type Code Meaning
'i4' int32 4-byte signed int
'f8' float64 8-byte float
'U20' Unicode string of max 20 characters
'S10' ASCII byte string of max 10 chars

You can also use Python-style types:

[Link]([('age', int), ('weight', float)])

5. Adding and Sorting Data


Filtering (Boolean Indexing):
# Get people older than 24
print(people[people['age'] > 24])

Sorting by field:
# Sort by weight
sorted_people = [Link](people, order='weight')
print(sorted_people)

6. Nested Structured Arrays


You can nest structured types:
dtype = [Link]([
('name', 'U10'),
('metrics', [('age', 'i4'), ('weight', 'f4')])
])

a = [Link]([
('Alice', (25, 55.0)),
('Bob', (30, 85.5))
], dtype=dtype)

print(a['metrics']['age']) # [25 30]

7. Converting to and from Structured Arrays


From regular array → structured
data = [('John', 21, 60.5), ('Lucy', 30, 70.0)]

structured = [Link](data, dtype=person_dtype)

Structured → regular dict/list of records


records = [Link]()
print(records)
# [('Alice', 25, 55.0), ('Bob', 30, 85.5), ('Eve', 22, 48.0)]

8. Saving/Loading Structured Arrays


# Save to binary file
[Link]('[Link]', people)

# Load back
loaded = [Link]('[Link]')
print(loaded)

9. Summary Table
Feature Method / Syntax Description
Create [Link](..., dtype=...) Define fields and types
Access field arr['name'] Get specific column
Access arr[i] Get specific row
record
Filter arr[arr['age'] > 25] Filter with condition
Sort [Link](arr, order='field') Sort by field
dtype=[('a', [('b', int), ('c',
Nested fields float)])] Complex structured data
Save/Load [Link]() / [Link]() Save and retrieve from
Feature Method / Syntax Description
disk

Example Use Case


# Find average weight of people older than 25
older = people[people['age'] > 25]
avg_weight = [Link](older['weight'])

print("Average weight of age > 25:", avg_weight)

[Link] AND WRITING ARRAY DATA ON FILES

NumPy supports saving and loading arrays in various formats:

• Binary files: .npy (single array), .npz (multiple arrays)


• Text files: .txt, .csv, etc.

1. Binary Files with .npy and .npz


Save an array to a .npy file
import numpy as np

arr = [Link]([1, 2, 3, 4, 5])


[Link]('[Link]', arr)

Load an array from a .npy file


loaded = [Link]('[Link]')
print(loaded) # [1 2 3 4 5]

Save multiple arrays to a .npz file


a = [Link](5)
b = [Link](0, 1, 5)

[Link]('[Link]', arr1=a, arr2=b)

Load .npz file


data = [Link]('[Link]')
print(data['arr1']) # [0 1 2 3 4]
print(data['arr2']) # [0. 0.25 0.5 0.75 1. ]
2. Text Files: .txt, .csv
Save to text file with [Link]()
a = [Link]([[1, 2, 3], [4, 5, 6]])

[Link]('[Link]', a)

Add a custom delimiter (e.g., CSV):

[Link]('[Link]', a, delimiter=',')

Load from text file with [Link]()


b = [Link]('[Link]')
print(b)

Load with delimiter (e.g., CSV):

b = [Link]('[Link]', delimiter=',')

3. Using genfromtxt() for Missing Data


genfromtxt() is more robust than loadtxt() — supports missing values, headers, etc.

data = [Link]('[Link]', delimiter=',', skip_header=1)


print(data)

4. Save/Load Structured Arrays


Save to .npy
dtype = [('name', 'U10'), ('age', 'i4'), ('weight', 'f4')]
people = [Link]([('Alice', 25, 55.0), ('Bob', 30, 85.5)], dtype=dtype)

[Link]('[Link]', people)

Load it back
loaded = [Link]('[Link]')
print(loaded['name']) # ['Alice' 'Bob']

5. Example: Full Round-Trip


a = [Link](3, 3)
# Save to binary
[Link]('my_array.npy', a)

# Save to CSV
[Link]('my_array.csv', a, delimiter=',')

# Load both back


bin_loaded = [Link]('my_array.npy')
csv_loaded = [Link]('my_array.csv', delimiter=',')

print("Binary:\n", bin_loaded)
print("CSV:\n", csv_loaded)

Comparison: Binary vs Text Formats


Format Function Pros Cons
.npy [Link]() Fast, preserves dtype Not human-readable
.npz [Link]() Store multiple arrays Binary, not editable
.txt/.csv [Link]() Human-readable, editable Slower, type info lost

Summary Table
Task Function
Save to .npy [Link](filename, array)
Load from .npy [Link](filename)
Save multiple arrays [Link](filename, arr1=..., arr2=...)
Load .npz [Link](filename)['arr1']
Save to text [Link](filename, array)
Load from text [Link](filename)
Load with missing data [Link](filename)

[Link]: The pandas Library: An Introduction

What is Pandas?
Pandas is a high-level Python library built on NumPy. It provides easy-to-use data structures
and functions for:

• Reading/writing data (CSV, Excel, SQL, JSON, etc.)


• Cleaning, transforming, reshaping, and analyzing data
• Handling missing data
• Time series operations
• Statistical summaries and group operations
The name "Pandas" comes from "Panel Data" — an econometrics term for
multidimensional structured data.

1. Installing Pandas
pip install pandas

2. Core Data Structures


1. Series — 1D Labeled Array
import pandas as pd

data = [Link]([10, 20, 30, 40])


print(data)

Output:

0 10
1 20
2 30
3 40
dtype: int64

• Like a NumPy array but with labels (index)


• Access with data[0] or [Link][0] or [Link][0]

2. DataFrame — 2D Labeled Table (Rows & Columns)


import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Score': [85.5, 90.0, 88.5]
}

df = [Link](data)
print(df)

Output:

Name Age Score


0 Alice 25 85.5
1 Bob 30 90.0
2 Charlie 22 88.5
3. Reading Data from Files
Read from CSV
df = pd.read_csv('[Link]')

Read from Excel


df = pd.read_excel('[Link]')

Read from JSON


df = pd.read_json('[Link]')

You can also read from SQL, HTML, clipboard, and many more formats.

4. Writing Data to Files


df.to_csv('[Link]', index=False)
df.to_excel('[Link]', index=False)
df.to_json('[Link]')

5. Basic Operations on DataFrames


View Data
[Link]() # First 5 rows
[Link](3) # Last 3 rows
[Link]() # Structure and non-null count
[Link]() # Summary statistics

Access Columns and Rows


df['Name'] # Access a column
df[['Name', 'Score']] # Multiple columns

[Link][1] # Access row by label/index


[Link][1] # Access row by position

6. Basic Data Analysis


Filtering Rows
df[df['Age'] > 25]
Sorting
df.sort_values(by='Score', ascending=False)

Adding/Modifying Columns
df['Passed'] = df['Score'] >= 60

Dropping Columns/Rows
[Link](columns=['Age'], inplace=True)
[Link](index=[0], inplace=True)

7. Grouping and Aggregation


[Link]('Passed')['Score'].mean()

8. Handling Missing Data


[Link]() # Detect missing values
[Link]() # Remove missing rows
[Link](0) # Replace with a value

9. Time Series Support


dates = pd.date_range('2023-01-01', periods=5)
ts = [Link]([1, 2, 3, 4, 5], index=dates)
print(ts)

10. Interoperability with NumPy


You can convert between NumPy arrays and Pandas:

# DataFrame to NumPy
arr = [Link]

# Series to NumPy
s = df['Score'].to_numpy()

# NumPy to DataFrame
import numpy as np
a = [Link]([[1, 2], [3, 4]])
df2 = [Link](a, columns=['A', 'B'])

Summary
Feature Tool/Method
Create Series [Link]([...])
Create DataFrame [Link]({...})
Read CSV/Excel/JSON read_csv(), read_excel(), etc.
Write to file to_csv(), to_excel(), etc.
View data head(), info(), describe()
Filter, Sort df[df['col'] > x], sort_values()
Group & Aggregate groupby(), mean(), etc.
Handle missing data dropna(), fillna()
Time series pd.date_range(), Series(dates)

Example: Simple Analysis


df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, 92, 78, 60],
'Passed': [True, True, True, False]
})

print("Average score of passed students:")


print(df[df['Passed']]['Score'].mean())

[Link]

1. Using pip (Recommended for most users)


For Standard Python (via terminal/command prompt):
pip install pandas

Upgrade pandas to the latest version:


pip install --upgrade pandas

Make sure pip is pointing to your correct Python environment (python -m pip
install pandas is safer in virtual environments).

2. Using conda (For Anaconda/Miniconda users)


conda install pandas

This installs the version compatible with your current conda environment.
3. Verify Installation
After installing, open Python (or a Jupyter Notebook) and run:

import pandas as pd
print(pd.__version__)

If no errors appear and the version prints, you’re good to go.

4. Optional: Install with Jupyter Support


If you’re working in a Jupyter Notebook environment and want to ensure everything works:

pip install pandas jupyter

Or for conda:

conda install pandas notebook

5. Installing in Virtual Environment (Recommended


for Clean Projects)
Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # Mac/Linux
venv\Scripts\activate # Windows

pip install pandas

Troubleshooting Tips
Problem Solution
pip not recognized Use python -m pip install pandas
Conflicts with other
Use a virtual environment or conda environment
libraries
Slow install or timeout Use a mirror: pip install -i [Link]
pandas
Installation Summary
Tool Command Use Case
pip pip install pandas Regular Python users
conda conda install pandas Anaconda/Miniconda users
upgrade pip install --upgrade pandas Get latest version
verify import pandas Check successful install

[Link] TO PANDAS DATA STRUCTURES

Introduction to pandas Data Structures


Pandas primarily offers two main data structures for handling data:

1. Series

What is a Series?

• A 1-dimensional labeled array.


• Can hold any data type (integers, floats, strings, Python objects, etc.).
• Each element has an index label associated with it (like row labels).

Creating a Series
import pandas as pd

# Simple list to Series


s = [Link]([10, 20, 30, 40])
print(s)

Output:

0 10
1 20
2 30
3 40
dtype: int64

Custom Indexing in Series


s = [Link]([10, 20, 30], index=['a', 'b', 'c'])
print(s)

Output:

a 10
b 20
c 30
dtype: int64

Accessing Data in Series


print(s['b']) # 20
print(s[1]) # 20 (also accessible by integer position)

2. DataFrame

What is a DataFrame?

• A 2-dimensional labeled data structure (like a table or spreadsheet).


• Consists of rows and columns.
• Each column is essentially a Series.
• Columns can be different data types (e.g., one column int, another float, another
string).

Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Score': [85.5, 90.0, 88.5]
}

df = [Link](data)
print(df)

Output:

Name Age Score


0 Alice 25 85.5
1 Bob 30 90.0
2 Charlie 22 88.5

Accessing Data in DataFrame


print(df['Name']) # Access a column (returns a Series)
print([Link][1]) # Access row with index 1
print([Link][1, 2]) # Access element at row 1, column 2 (90.0)

Quick Comparison
Feature Series DataFrame
Dimensions 1D 2D
Structure Labeled array Table with rows and columns
Data types Homogeneous (usually) Columns can have mixed types
Indexing Single index Row and column indexing

Why Use These?


• Series: Great for single columns or time series data.
• DataFrame: Ideal for structured, tabular data with multiple columns.

[Link] BETWEEN DATA STRUCTURES

Operations Between pandas Data


Structures

1. Operations on Series
Series behave much like NumPy arrays but with aligned indexing:

Example:
import pandas as pd

s1 = [Link]([1, 2, 3], index=['a', 'b', 'c'])


s2 = [Link]([4, 5, 6], index=['b', 'c', 'd'])

print(s1 + s2)

Output:

a NaN # 'a' only in s1, so result is NaN


b 6.0 # 2 + 4
c 8.0 # 3 + 5
d NaN # 'd' only in s2, so result is NaN
dtype: float64
Key: Operations align on index labels, not just positions. Missing labels produce NaN.

2. Operations on DataFrames
DataFrames also align on both row and column labels.

Example:
df1 = [Link]({
'A': [1, 2, 3],
'B': [4, 5, 6]
}, index=['a', 'b', 'c'])

df2 = [Link]({
'B': [7, 8, 9],
'C': [10, 11, 12]
}, index=['b', 'c', 'd'])

print(df1 + df2)

Output:

A B C
a NaN NaN NaN
b NaN 12.0 NaN
c NaN 14.0 NaN
d NaN NaN NaN

• Rows and columns that don’t match become NaN.


• For overlapping rows and columns, values are added.

3. Broadcasting Between DataFrames and Series


Pandas supports broadcasting when you perform operations between a DataFrame and a
Series.

Example 1: Add a Series (column-wise) to DataFrame


df = [Link]({
'A': [1, 2, 3],
'B': [4, 5, 6]
})

s = [Link]([10, 20, 30], index=[0, 1, 2])

print(df + s)
Output:

A B
0 11 14
1 22 25
2 33 36

• Here, s is added row-wise because its index matches the DataFrame's row index.

Example 2: Add a Series (row-wise) to DataFrame using axis=1


s = [Link]([10, 20], index=['A', 'B'])

print(df + s)

Output:

A B
0 11 24
1 12 25
2 13 26

• The Series s matches DataFrame columns, so it broadcasts column-wise to each row.

4. Arithmetic with fill values


You can specify a fill_value to use instead of NaN during operations.

print([Link](df2, fill_value=0))

This treats missing values as 0 instead of NaN, making the operation more intuitive.

5. Comparison Operations
Comparison between DataFrames or Series works similarly and aligns on labels.

print(df1 > df2)

Returns a DataFrame of booleans aligned by index and columns.

6. Summary
Operation Type Behavior Example
Series + Series Index-aligned arithmetic s1 + s2
DataFrame +
Aligns both rows and columns df1 + df2
DataFrame
DataFrame + Series Broadcast Series across columns df + s (where s indexed by
(row) (default) rows)
DataFrame + Series Broadcast Series across rows df + s (where s indexed by
(col) (axis=1) columns)
Operations with Use a default value for missing [Link](df2, fill_value=0)
fill_value labels

[Link] Application and Mapping

Pandas: Function Application and Mapping


Pandas provides several ways to apply functions element-wise, row-wise, or column-wise on
Series and DataFrames.

1. Applying Functions on a Series

a) Using .apply() with a function


import pandas as pd

s = [Link]([1, 2, 3, 4])

# Define a function
def square(x):
return x ** 2

# Apply function element-wise


s_squared = [Link](square)
print(s_squared)

Output:

0 1
1 4
2 9
3 16
dtype: int64

b) Using lambda functions


s_doubled = [Link](lambda x: x * 2)
print(s_doubled)

c) Using vectorized operations (faster!)

For many common functions, just use vectorized operators instead of .apply():

s_plus_one = s + 1

2. Applying Functions on a DataFrame

a) Applying a function to each element (applymap)


df = [Link]({
'A': [1, 2, 3],
'B': [4, 5, 6]
})

# Square each element in DataFrame


df_squared = [Link](lambda x: x ** 2)
print(df_squared)

b) Applying a function along rows or columns (apply)

• axis=0 — apply function to each column (default)


• axis=1 — apply function to each row

# Sum of each column


col_sum = [Link](sum, axis=0)
print(col_sum)

# Sum of each row


row_sum = [Link](sum, axis=1)
print(row_sum)

c) Example: Applying a custom function to rows


def range_func(row):
return [Link]() - [Link]()

row_range = [Link](range_func, axis=1)


print(row_range)

3. Mapping Values in a Series


a) Using .map() with a dictionary

You can map values of a Series to new values using a dictionary.

s = [Link](['cat', 'dog', 'bird', 'cat'])

mapping = {'cat': 'meow', 'dog': 'woof'}

s_mapped = [Link](mapping)
print(s_mapped)

Output:

0 meow
1 woof
2 NaN
3 meow
dtype: object

Note: Values not in the mapping dictionary become NaN.

b) Using .map() with a function


s_mapped = [Link](lambda x: [Link]())
print(s_mapped)

4. Replacing Values Using .replace()


Similar to .map(), but better suited for replacing values:

s_replaced = [Link]({'cat': 'feline', 'dog': 'canine'})


print(s_replaced)

5. Vectorized String Methods with .str


For string data, you can apply vectorized string functions:

s = [Link](['apple', 'banana', 'cherry'])

print([Link]())
print([Link]('a'))

Summary Table
Method Applies To Functionality
.apply() Series, DataFrame Apply function element-wise or along axis
.applymap() DataFrame only Apply function element-wise on DataFrame
.map() Series Map values using dict or function
.replace() Series, DataFrame Replace specified values
.str Series (strings) Vectorized string operations

[Link] AND RANKING

Pandas: Sorting and Ranking

1. Sorting Data
a) Sorting a Series
import pandas as pd

s = [Link]([4, 2, 8, 1])

# Sort values ascending


print(s.sort_values())

# Sort values descending


print(s.sort_values(ascending=False))

# Sort by index
print(s.sort_index())

b) Sorting a DataFrame

• Sort by one or multiple columns


• Sort rows based on column values

df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Score': [85.5, 90.0, 88.5]
})

# Sort by Age ascending


print(df.sort_values(by='Age'))

# Sort by Score descending


print(df.sort_values(by='Score', ascending=False))

# Sort by multiple columns: Age ascending, Score descending


print(df.sort_values(by=['Age', 'Score'], ascending=[True, False]))
c) Sorting in-place

To modify the DataFrame without creating a new copy:

df.sort_values(by='Age', inplace=True)

2. Ranking Data
Ranking assigns ranks to data, handling ties as needed.

a) Ranking in a Series
s = [Link]([100, 200, 100, 300])

print([Link]()) # Average rank for ties (default)


print([Link](method='min')) # Use minimum rank for ties
print([Link](method='max')) # Use maximum rank for ties
print([Link](method='dense')) # Like 'min' but rank always increments by 1

Output:

0 1.5
1 3.0
2 1.5
3 4.0
dtype: float64

b) Ranking in a DataFrame

Rank along rows or columns.

df = [Link]({
'Math': [90, 80, 90],
'English': [70, 90, 80]
})

# Rank each column


print([Link]())

# Rank each row


print([Link](axis=1))

c) Ranking with ascending/descending order


print([Link](ascending=False))
Summary Table
Operation Description Example
sort_values()
Sort by values in df.sort_values(by='Age')
Series/DataFrame
sort_index() Sort by index s.sort_index()
rank() Assign ranks, handling ties [Link](method='min')
df.sort_values(by='Age',
inplace param Modify original object inplace=True)

[Link] AND COVARIANCE

1. Covariance
What is Covariance?

• Measures how two variables vary together.


• Positive covariance → variables increase or decrease together.
• Negative covariance → one variable increases when the other decreases.
• Zero covariance → no linear relationship.

Calculate Covariance in pandas


import pandas as pd

data = {
'X': [1, 2, 3, 4, 5],
'Y': [2, 4, 6, 8, 10],
'Z': [5, 4, 3, 2, 1]
}

df = [Link](data)

# Covariance matrix of DataFrame columns


cov_matrix = [Link]()
print(cov_matrix)

Output:

X Y Z
X 2.5 5.0 -2.5
Y 5.0 10.0 -5.0
Z -2.5 -5.0 2.5

• Diagonal elements = variances.


• Off-diagonal = covariance between variables.
2. Correlation
What is Correlation?

• Measures the strength and direction of linear relationship between two variables.
• Values range from -1 to +1:
o +1 = perfect positive linear correlation
o -1 = perfect negative linear correlation
o 0 = no linear correlation
• More interpretable than covariance as it is normalized.

Calculate Correlation in pandas


corr_matrix = [Link]()
print(corr_matrix)

Output:

X Y Z
X 1.000000 1.000000 -1.000000
Y 1.000000 1.000000 -1.000000
Z -1.000000 -1.000000 1.000000

• X and Y are perfectly positively correlated.


• X and Z are perfectly negatively correlated.

Correlation Methods

By default, [Link]() uses Pearson correlation.

Other methods:

• 'pearson' (default)
• 'kendall' (Kendall Tau)
• 'spearman' (Spearman rank)

Example:

[Link](method='spearman')

3. Correlation/Covariance between two Series


x = df['X']
y = df['Y']

print([Link](y))
print([Link](y))

Summary
Function Purpose Example
[Link]() Covariance matrix [Link]()
[Link]() Correlation matrix (Pearson by default) [Link]()
[Link]() Correlation between two Series [Link](y)
[Link]() Covariance between two Series [Link](y)

16.―Not a Number Data

Pandas: Handling Not a Number (NaN)


Data

1. What is NaN?
• NaN stands for "Not a Number".
• It’s the standard missing data marker in pandas (and NumPy).
• Used to represent missing or undefined values in numeric arrays.
• Can also appear in object/string columns.

2. Detecting NaN Values

a) Using isna() or isnull()

Both methods are equivalent and return a boolean mask.

import pandas as pd
import numpy as np

df = [Link]({
'A': [1, 2, [Link], 4],
'B': [[Link], 2, 3, 4]
})
print([Link]())

Output:

A B
0 False True
1 False False
2 True False
3 False False

b) Check if any NaN in entire DataFrame


print([Link]().any()) # Per column
print([Link]().any().any()) # Overall

3. Handling NaN Data

a) Removing rows or columns with NaN (dropna())


# Drop rows with any NaN
cleaned_rows = [Link]()

# Drop columns with any NaN


cleaned_cols = [Link](axis=1)

print(cleaned_rows)
print(cleaned_cols)

b) Filling NaN values (fillna())

Replace NaN with a specified value:

df_filled = [Link](0)
print(df_filled)

You can also forward-fill or backward-fill:

df_ffill = [Link](method='ffill') # Forward fill


df_bfill = [Link](method='bfill') # Backward fill

c) Filling with a different value per column:


[Link]({'A': 0, 'B': 99})
4. Replacing NaN with Interpolation
Useful for time-series or numeric data:

df_interpolated = [Link]()
print(df_interpolated)

5. Checking for NaN in Series/DataFrame


print([Link](df['A'][2])) # True
print([Link](df['A'][1])) # True

6. Why NaN is Important


• NaNs propagate in calculations, preventing misleading results.
• Many pandas functions have parameters to ignore or handle NaNs gracefully.
• Essential to clean or impute missing data for accurate analysis.

Summary Table
Method Purpose Example
isna()/isnull() Detect NaN values [Link]()
Drop rows/columns with
dropna() [Link](), [Link](axis=1)
NaNs
fillna()
Fill NaNs with a specified [Link](0),
value [Link](method='ffill')
interpolate() Fill NaNs via interpolation [Link]()
notna() Detect non-NaN values [Link]()

[Link] INDEXING AND LEVELING

Pandas: Hierarchical Indexing


(MultiIndex) and Leveling

1. What is Hierarchical Indexing?


• It allows multiple index levels on rows and/or columns.
• You can think of it as nested indexing.
• Enables working with higher-dimensional data (3D+) in a 2D table.
• Makes grouping and slicing complex datasets easier.

2. Creating a MultiIndex (Hierarchical Index)

a) From tuples:
import pandas as pd

arrays = [
['A', 'A', 'B', 'B'],
[1, 2, 1, 2]
]

index = [Link].from_arrays(arrays, names=['letter', 'number'])

s = [Link]([10, 20, 30, 40], index=index)


print(s)

Output:

letter number
A 1 10
2 20
B 1 30
2 40
dtype: int64

b) From a DataFrame:
df = [Link]({
'City': ['NY', 'NY', 'LA', 'LA'],
'Year': [2020, 2021, 2020, 2021],
'Value': [100, 110, 90, 95]
})

df = df.set_index(['City', 'Year'])
print(df)

3. Accessing Data in MultiIndex

a) Using .loc[] with tuples:


print([Link][('A', 2)]) # Output: 20
b) Slicing across levels:
print([Link]['A']) # All data for 'A'
print([Link][('A', slice(1,2))])

4. Index Level Operations

a) Getting levels and labels


print([Link]) # List of unique values at each level
print([Link]) # Names of each level

b) Resetting index levels


df_reset = df.reset_index()
print(df_reset)

c) Swapping levels
s_swapped = [Link]('letter', 'number')
print(s_swapped)

d) Sorting by index levels


s_sorted = s.sort_index(level='number')
print(s_sorted)

5. Aggregation on MultiIndex DataFrames


You can perform aggregation grouped by levels:

df = [Link]({
'City': ['NY', 'NY', 'LA', 'LA'],
'Year': [2020, 2021, 2020, 2021],
'Value': [100, 110, 90, 95]
}).set_index(['City', 'Year'])

print([Link](level='City').sum())

Summary Table
Operation Description Example
Create MultiIndex Using arrays or set_index [Link].from_arrays()
Access with .loc Access data using tuples or slices [Link][('A', 1)]
Flatten MultiIndex back to
Reset index df.reset_index()
columns
Swap index levels Swap order of index levels [Link]()
Sort by index levels Sort by specified index level s.sort_index(level='number')
Group by index
Aggregate data based on level [Link](level='City').sum()
level

[Link] AND WRITING DATA: CSV OR TEXT FILE

Pandas: Reading and Writing CSV or


Text Files

1. Reading CSV or Text Files

a) pd.read_csv()

• The most common function to read CSV files (also works with many text files).
• Automatically parses CSV into a DataFrame.

import pandas as pd

# Basic CSV read


df = pd.read_csv('[Link]')

print([Link]())

b) Common Parameters of read_csv

Parameter Description Example


filepath_or_buffer Path to file or URL '[Link]'
sep Delimiter (default is comma ,) sep='\t' for tab-separated
Row number(s) to use as column
header header=0 (default), None
names
names List of column names to use names=['A', 'B', 'C']
index_col Column(s) to set as index index_col=0
usecols Return a subset of columns usecols=['A', 'C']
Parameter Description Example
dtype={'A': int, 'B':
dtype Data type for columns float}

na_values
Additional strings to recognize as na_values=['NA', 'missing']
NaN
parse_dates Parse columns as dates parse_dates=['date_column']

skiprows
Number of rows or list of rows to skiprows=1
skip
nrows Number of rows to read nrows=100

Example: Reading a tab-separated file with no header


df = pd.read_csv('[Link]', sep='\t', header=None, names=['A', 'B', 'C'])

2. Writing Data to CSV or Text Files

a) df.to_csv()

Saves DataFrame to CSV file.

df.to_csv('[Link]', index=False) # index=False to avoid writing row


numbers

b) Common Parameters of to_csv

Parameter Description Example


path_or_buf File path or object '[Link]'
sep Field delimiter sep='\t'
index Write row names (index) index=False
header Write column names header=True
columns Specify columns to write columns=['A', 'B']
mode Write mode, e.g., append ('a') mode='a'
na_rep Representation for missing data na_rep='NA'
compression Compression mode (e.g., 'gzip', 'bz2') compression='gzip'

Example: Writing DataFrame to a tab-separated file without index


df.to_csv('[Link]', sep='\t', index=False)
3. Reading Large Files in Chunks
You can read large files in chunks to save memory:

chunk_iter = pd.read_csv('large_file.csv', chunksize=1000)

for chunk in chunk_iter:


print([Link]())

Summary Table
Function Purpose Basic Usage
pd.read_csv() Read CSV/text file pd.read_csv('[Link]')
df.to_csv() Write DataFrame to CSV df.to_csv('[Link]', index=False)
chunksize Read file in chunks pd.read_csv('[Link]', chunksize=1000)

[Link] FILES

Pandas: Reading and Writing HTML


Files

1. Reading HTML Tables


Pandas can read tables embedded in HTML files or web pages using the pd.read_html()
function.

a) Read tables from a local or web HTML file


import pandas as pd

# From a URL
url = '[Link]

tables = pd.read_html(url)

print(f"Number of tables found: {len(tables)}")

# Access the first table


df = tables[0]
print([Link]())
b) Reading from a local HTML file
tables = pd.read_html('local_file.html')
df = tables[0] # First table in the HTML file

c) Parameters of read_html

Parameter Description Example


io URL or local file path or string '[Link] '[Link]'
match String or regex to match table content 'GDP' to find tables with GDP keyword
flavor Parser engine: 'bs4' or 'lxml' 'bs4' (default if installed)
header Row to use as header header=0
skiprows Rows to skip skiprows=1
attrs Dict of HTML attributes to match attrs = {'class': 'wikitable'}
encoding Character encoding 'utf-8'

2. Writing DataFrames to HTML

a) Save a DataFrame as an HTML table


df.to_html('[Link]')

This saves the DataFrame as a basic HTML table.

b) Customizing HTML output

• You can customize the table by specifying parameters:

df.to_html('[Link]', index=False, border=0, classes='table table-


striped')

• This removes the index column, sets border to 0, and adds CSS classes (good for
Bootstrap styling).

3. Example: Reading and Writing HTML Table


# Read tables from Wikipedia
tables =
pd.read_html('[Link]
_(United_Nations)')
# Extract first table
df = tables[0]

# Save to local HTML


df.to_html('countries_population.html', index=False)

Notes:
• Reading HTML tables requires lxml and beautifulsoup4 libraries installed. You
can install them via:

pip install lxml beautifulsoup4 html5lib

• read_html returns a list of DataFrames since one HTML page can have multiple
tables.

[Link] EXCEL FILES

Pandas: Reading and Writing Microsoft


Excel Files

1. Reading Excel Files

a) Basic usage with pd.read_excel()


import pandas as pd

# Read Excel file (.xls or .xlsx)


df = pd.read_excel('[Link]')

print([Link]())

b) Reading specific sheets

• By default, reads the first sheet.

# Read a specific sheet by name


df_sheet = pd.read_excel('[Link]', sheet_name='Sheet2')

# Read a sheet by index (0-based)


df_sheet = pd.read_excel('[Link]', sheet_name=0)
c) Reading multiple sheets at once
dfs = pd.read_excel('[Link]', sheet_name=None) # Reads all sheets

# dfs is a dictionary with sheet names as keys and DataFrames as values


print([Link]())

d) Common parameters

Parameter Description Example


sheet_name Sheet name, index, list of names/indexes, or None 'Sheet1', [0, 2], None
header Row number to use as column names header=0 (default)
names List of column names to use names=['A', 'B', 'C']
usecols Columns to read usecols='A:C' or [0,2]
skiprows Rows to skip skiprows=2
nrows Number of rows to read nrows=100
dtype Data types for columns dtype={'A': int}

2. Writing to Excel Files

a) Basic write with to_excel()


df.to_excel('[Link]', index=False) # index=False to skip row numbers

b) Writing multiple DataFrames to different sheets


with [Link]('output_multi.xlsx') as writer:
df1.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')

c) Important parameters for to_excel()

Parameter Description Example


excel_writer File path or ExcelWriter object '[Link]'
sheet_name Sheet name 'Sheet1'
index Write row index index=False
header Write column headers header=True
startrow Upper left cell row to start writing startrow=2
startcol Upper left cell column to start writing startcol=1
Parameter Description Example
engine Engine to use ('openpyxl', 'xlsxwriter') 'xlsxwriter'

3. Requirements
• To work with Excel files, pandas uses external libraries:
o openpyxl for .xlsx files
o xlrd for .xls files (Note: recent versions of xlrd dropped support for .xlsx)
o xlsxwriter for writing (optional, faster features)

Install them via pip if needed:

pip install openpyxl xlrd xlsxwriter

4. Example: Read, modify, and save Excel file


df = pd.read_excel('[Link]', sheet_name='2023')

# Add a new column


df['Total'] = df['Quantity'] * df['Price']

# Save to new Excel file


df.to_excel('sales_updated.xlsx', index=False)

Summary Table
Function Purpose Basic Usage
pd.read_excel() Read Excel file pd.read_excel('[Link]')
df.to_excel() Write DataFrame to Excel file df.to_excel('[Link]')
ExcelWriter Write multiple sheets with [Link]() as w:

If you'd like, I can show you advanced Excel writing features like formatting, charts, or
formulas using pandas + xlsxwriter!

You might also like