UNIT – IV
DATA MANIPULATION TOOLS & SOFTWARES: Numpy: Installation - Ndarray - Basic Operations -
Indexing, Slicing, and Iterating - Shape Manipulation - Array Manipulation - Structured Arrays -
Reading and Writing Array Data on Files. Pandas: The pandas Library: An Introduction - Installation -
Introduction to pandas Data Structures - Operations between Data Structures - Function Application
and Mapping - Sorting and Ranking - Correlation and Covariance - ―Not a Number Data -
Hierarchical Indexing and Leveling – Reading and Writing Data: CSV or Text File - HTML Files –
Microsoft Excel Files.
1. INSTALLATION :
What is NumPy?
NumPy (Numerical Python) is a powerful library for numerical computations, especially
with arrays and matrices. It is foundational for scientific computing in Python.
Step-by-Step Installation Instructions
1. Check if NumPy is already installed
Before installing, check if NumPy is already installed on your system:
import numpy
print(numpy.__version__)
If you see a version number, NumPy is already installed. If you get a ModuleNotFoundError,
proceed with installation.
2. Install Using pip (Python Package Installer)
Basic Installation
Open your terminal or command prompt and run:
pip install numpy
If you're using Python 3 specifically:
pip3 install numpy
This downloads and installs the latest stable version from PyPI.
To verify installation:
After installation, go to Python shell or script:
import numpy as np
print(np.__version__)
3. Using pip in a Virtual Environment (Recommended)
To avoid package conflicts, it's best to use a virtual environment:
# Create a virtual environment
python -m venv venv
# Activate the environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install numpy in the virtual environment
pip install numpy
4. Install with conda (if using Anaconda or Miniconda)
If you’re using Anaconda/Miniconda:
conda install numpy
This is often faster because conda installs precompiled binaries optimized for your system.
5. Installing a Specific Version of NumPy
pip install numpy==1.24.4
Replace 1.24.4 with the version you want. You can see available versions here:
[Link]
6. Installing from Source (Advanced)
1. Clone the repository:
git clone [Link]
cd numpy
2. Install build dependencies:
pip install -r [Link]
3. Build and install:
pip install .
Use this only if you need to develop or test the latest source version.
Troubleshooting Installation Issues
Problem Solution
Permission denied Use --user flag: pip install --user numpy
pip not recognized Use python -m pip install numpy
Conflicts with other Use a virtual environment or pip install --upgrade --force-
packages reinstall numpy
No internet / offline Download .whl file from PyPI and install: pip install
install numpy-*.whl
Verifying Installation (Optional)
Run a basic test:
import numpy as np
# Create an array
a = [Link]([1, 2, 3])
print("Array:", a)
# Perform operation
print("Mean:", [Link](a))
Resources
• NumPy Official Site: [Link]
• Installation Docs: [Link]
• PyPI: [Link]
• Source Code: [Link]
[Link]:
What is ndarray?
In NumPy, every array is an instance of the [Link] class. It can be 1D, 2D, or multi-
dimensional.
Importing NumPy
import numpy as np
Creating an ndarray
1. From a Python List
arr = [Link]([1, 2, 3, 4])
print(arr)
print(type(arr)) # <class '[Link]'>
2. From a List of Lists (2D Array)
matrix = [Link]([[1, 2, 3], [4, 5, 6]])
print(matrix)
3. Using NumPy Functions
zeros = [Link]((3, 3)) # 3x3 array of zeros
ones = [Link]((2, 2)) # 2x2 array of ones
rand = [Link](2, 3) # 2x3 array with random values
Attributes of ndarray
a = [Link]([[1, 2, 3], [4, 5, 6]])
print("Array:\n", a)
print("Shape:", [Link]) # (2, 3)
print("Dimensions:", [Link]) # 2
print("Size:", [Link]) # 6
print("Data type:", [Link]) # int64 (varies by system)
Operations on ndarray
1. Arithmetic Operations
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])
print(a + b) # [5 7 9]
print(a * b) # [ 4 10 18]
print(a ** 2) # [1 4 9]
2. Matrix Operations
A = [Link]([[1, 2], [3, 4]])
B = [Link]([[2, 0], [1, 3]])
print("Matrix Multiplication:\n", A @ B) # or [Link](A, B)
Indexing & Slicing
a = [Link]([[10, 20, 30], [40, 50, 60]])
print(a[0, 1]) # 20
print(a[:, 1]) # [20 50]
print(a[1, :]) # [40 50 60]
Reshaping & Flattening
a = [Link]([[1, 2], [3, 4], [5, 6]])
reshaped = [Link]((2, 3)) # Change shape to 2x3
flattened = [Link]() # Convert to 1D
print("Reshaped:\n", reshaped)
print("Flattened:\n", flattened)
Example Use Case
# Calculate mean and standard deviation of random data
data = [Link](0, 1, size=(1000,))
print("Mean:", [Link](data))
print("Std Dev:", [Link](data))
Summary
Feature Description
ndarray Core array type in NumPy
shape Tuple showing array dimensions
dtype Data type of array elements
ndim Number of dimensions
size Total number of elements
reshape() Changes the shape of the array
flatten() Converts array to 1D
@ or dot() Matrix multiplication
[Link] OPERATIONS
Basic Operations with ndarray
NumPy provides vectorized operations, which means operations are performed element-
wise and much faster than with plain Python lists.
Let’s go step by step:
1. Element-wise Arithmetic Operations
import numpy as np
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])
print("Addition:", a + b) # [5 7 9]
print("Subtraction:", a - b) # [-3 -3 -3]
print("Multiplication:", a * b) # [ 4 10 18]
print("Division:", a / b) # [0.25 0.4 0.5 ]
print("Power:", a ** 2) # [1 4 9]
Operations are broadcasted if shapes are compatible (explained below).
2. Scalar Operations
You can perform operations with a scalar (number) on the entire array:
print("Add scalar:", a + 10) # [11 12 13]
print("Multiply scalar:", a * 3) # [3 6 9]
3. Comparison Operations
Returns a boolean array:
print("Equal:", a == b) # [False False False]
print("Greater than:", b > a) # [ True True True]
4. Aggregate Functions (Statistics)
Apply mathematical operations across the array:
arr = [Link]([[1, 2, 3], [4, 5, 6]])
print("Sum:", [Link](arr)) # 21
print("Min:", [Link](arr)) # 1
print("Max:", [Link](arr)) # 6
print("Mean:", [Link](arr)) # 3.5
print("Standard Deviation:", [Link](arr)) # ~1.7078
print("Sum along axis 0 (columns):", [Link](arr, axis=0)) # [5 7 9]
print("Sum along axis 1 (rows):", [Link](arr, axis=1)) # [6 15]
5. Dot Product / Matrix Multiplication
x = [Link]([[1, 2], [3, 4]])
y = [Link]([[2, 0], [1, 3]])
# Matrix multiplication
print("Matrix product:\n", [Link](x, y)) # or x @ y
6. Transpose of a Matrix
matrix = [Link]([[1, 2, 3], [4, 5, 6]])
print("Original:\n", matrix)
print("Transposed:\n", matrix.T)
7. Broadcasting
Broadcasting allows NumPy to perform operations on arrays of different shapes:
a = [Link]([1, 2, 3])
b = 10
print("Broadcasted add:", a + b) # [11 12 13]
matrix = [Link]([[1, 2, 3], [4, 5, 6]])
vector = [Link]([10, 20, 30])
print("Add vector to matrix:\n", matrix + vector)
# Adds [10,20,30] to each row
8. Logical Operations
arr = [Link]([1, 2, 3, 4, 5])
# Find where condition is true
print("Elements > 2:", arr[arr > 2]) # [3 4 5]
# Combine conditions
print("Even numbers:", arr[(arr % 2 == 0)]) # [2 4]
9. Copying vs Viewing Arrays
original = [Link]([1, 2, 3])
view = [Link]() # Shares data
copy = [Link]() # Creates new array
original[0] = 100
print("Original:", original) # [100 2 3]
print("View:", view) # [100 2 3] - affected
print("Copy:", copy) # [1 2 3] - not affected
10. Reshaping Arrays
a = [Link]([1, 2, 3, 4, 5, 6])
reshaped = [Link]((2, 3))
print("Reshaped:\n", reshaped)
flattened = [Link]()
print("Flattened:", flattened)
Summary Table
Operation Type Example Description
Arithmetic a + b, a * 2 Element-wise operations
Comparison a > b, a == 3 Returns boolean arrays
Aggregation [Link](a), [Link](a) Statistical operations
Dot Product [Link](a, b) or a @ b Matrix multiplication
Transpose a.T Transpose rows/columns
Reshape [Link]((r, c)) Change array shape
Flatten [Link]() Convert to 1D
Broadcasting a + b (different shapes) Auto-expands array dimensions
Indexing/Masking a[a > 2] Filter values by condition
[Link], SLICING, AND ITERATING
1. Indexing
Indexing allows you to access individual elements in a NumPy array.
1D Array Indexing
import numpy as np
arr = [Link]([10, 20, 30, 40, 50])
print(arr[0]) # First element → 10
print(arr[-1]) # Last element → 50
2D Array Indexing
arr2d = [Link]([[1, 2, 3], [4, 5, 6]])
print(arr2d[0, 0]) # Row 0, Column 0 → 1
print(arr2d[1, 2]) # Row 1, Column 2 → 6
3D Array Indexing
arr3d = [Link]([
[[1, 2], [3, 4]],
[[5, 6], [7, 8]]
])
print(arr3d[1, 0, 1]) # Output: 6
2. Slicing
Slicing lets you extract subarrays using the syntax:
array[start:stop:step]
1D Slicing
arr = [Link]([10, 20, 30, 40, 50])
print(arr[1:4]) # [20 30 40]
print(arr[:3]) # [10 20 30]
print(arr[::2]) # [10 30 50]
2D Slicing
arr2d = [Link]([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr2d[0:2, 1:]) # Rows 0-1, Cols 1-end
# Output:
# [[2 3]
# [5 6]]
Advanced 2D Example
print(arr2d[:, 1]) # All rows, column 1 → [2 5 8]
print(arr2d[1, :]) # Row 1, all columns → [4 5 6]
3. Iterating
You can iterate over arrays using for loops. But NumPy also supports vectorized operations
which are faster and preferred.
Iterate Over 1D Array
arr = [Link]([10, 20, 30])
for x in arr:
print(x)
Iterate Over 2D Array (Row-wise)
arr2d = [Link]([[1, 2], [3, 4]])
for row in arr2d:
print("Row:", row)
Iterate Over Each Element
Use .flat or [Link]() to access all elements regardless of dimensions:
for x in [Link]:
print(x)
Or:
for x in [Link](arr2d):
print(x)
Boolean Indexing (Fancy Indexing)
Select elements using conditions.
arr = [Link]([10, 20, 30, 40])
print(arr[arr > 25]) # [30 40]
Example in 2D:
arr2d = [Link]([[5, 10], [15, 20]])
mask = arr2d > 10
print(mask)
# [[False False]
# [ True True]]
print(arr2d[mask]) # [15 20]
Advanced Indexing
Indexing with arrays of indices
arr = [Link]([10, 20, 30, 40])
indices = [0, 2]
print(arr[indices]) # [10 30]
2D example:
arr2d = [Link]([[10, 20], [30, 40], [50, 60]])
rows = [Link]([0, 2])
cols = [Link]([1, 0])
print(arr2d[rows, cols]) # [20 50]
Summary Table
Feature Example Description
Indexing 1D arr[2] Access 3rd element
Indexing 2D arr[1, 2] Access element at row 1, column 2
Slicing 1D arr[1:4] Elements from index 1 to 3
Slicing 2D arr[0:2, 1:] Slice rows 0-1, cols 1-end
Iterating for x in arr Loop through elements
Flat iteration for x in [Link] Flatten and iterate
Boolean indexing arr[arr > 10] Filter using a condition
Fancy indexing arr[[0, 2]] Index using another array
[Link] MANIPULATION
1. What is Shape?
The shape of a NumPy array is a tuple that indicates the number of elements along each axis
(dimension).
import numpy as np
a = [Link]([[1, 2, 3], [4, 5, 6]])
print([Link]) # (2, 3) → 2 rows, 3 columns
2. Reshape
Changes the shape of the array without changing its data.
a = [Link]([1, 2, 3, 4, 5, 6])
reshaped = [Link]((2, 3)) # 2 rows, 3 columns
print(reshaped)
Notes:
• The total number of elements must remain the same.
• You can use -1 to let NumPy automatically calculate one dimension:
[Link]((-1, 2)) # NumPy figures out rows → (3, 2)
3. Flatten
Converts a multi-dimensional array into a 1D array.
a = [Link]([[1, 2], [3, 4]])
flat = [Link]() # returns a *copy*
print(flat) # [1 2 3 4]
Difference from ravel():
r = [Link]() # returns a *view* if possible
Use flatten() if you want a copy, or ravel() for a faster view (no memory duplication).
4. Transpose
Swaps rows and columns.
a = [Link]([[1, 2, 3], [4, 5, 6]])
print("Original:\n", a)
print("Transposed:\n", a.T)
Transposing Higher-Dimensional Arrays
a = [Link](24).reshape((2, 3, 4)) # Shape: (2 blocks, 3 rows, 4 cols)
print([Link](1, 0, 2)) # Shape: (3, 2, 4)
5. Expand or Reduce Dimensions
Add a new dimension using [Link] or reshape
a = [Link]([1, 2, 3])
print([Link]) # (3,)
a_col = a[:, [Link]]
print(a_col.shape) # (3, 1)
a_row = a[[Link], :]
print(a_row.shape) # (1, 3)
np.expand_dims()
a = [Link]([1, 2, 3])
b = np.expand_dims(a, axis=0) # (1, 3)
c = np.expand_dims(a, axis=1) # (3, 1)
[Link]() — Remove dimensions of size 1
a = [Link]([[[1, 2, 3]]]) # shape (1, 1, 3)
print([Link](a)) # shape (3,)
6. Resize vs Reshape
reshape() returns a new array, but...
a = [Link]([1, 2, 3, 4])
reshaped = [Link]((2, 2)) # Creates new array
resize() modifies the array in-place:
[Link]((2, 2)) # Changes the original array
print(a)
Use with caution — resize() can lose data or fill with garbage if dimensions mismatch.
Shape Manipulation Summary
Operation Method Description
Get shape [Link] Returns shape tuple
Reshape [Link]((r, c)) Change shape, returns new array
Flatten [Link]() Convert to 1D, returns copy
Ravel [Link]() Flatten view (faster, may not copy)
Transpose a.T Swap axes for 2D
Axis transpose [Link]((1, 0, 2)) Custom axis reordering
Add dimension a[:, [Link]] Expand 1D → 2D
Expand dims np.expand_dims(a, 1) Add axis at position
Squeeze [Link](a) Remove axes with size 1
Resize (in-place) [Link]((r, c)) Resizes existing array (modifies data)
Example: Full Workflow
a = [Link](12) # [0 1 2 ... 11]
b = [Link]((3, 4)) # 3x4 array
c = b.T # Transpose to 4x3
d = [Link]() # Flatten back to 1D
e = [Link]((2, 6)) # Reshape to 2x6
print(e)
[Link] MANIPULATION
NumPy Array Manipulation
Manipulating arrays is essential when working with real-world datasets, reshaping models, or
transforming data formats. Let’s break it down:
1. Joining Arrays (Concatenation)
Using [Link]()
import numpy as np
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])
joined = [Link]((a, b))
print(joined) # [1 2 3 4 5 6]
2D Arrays — specify axis
a = [Link]([[1, 2], [3, 4]])
b = [Link]([[5, 6]])
# Join along axis 0 (rows)
joined = [Link]((a, b), axis=0)
print(joined)
[Link]() and [Link]()
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])
print([Link]((a, b)))
# Vertical Stack (2 rows)
print([Link]((a, b)))
# Horizontal Stack (1 row)
2. Splitting Arrays
[Link]() — evenly split
a = [Link]([1, 2, 3, 4, 5, 6])
split = [Link](a, 3)
print(split) # [array([1, 2]), array([3, 4]), array([5, 6])]
np.array_split() — can split unevenly
a = [Link]([1, 2, 3, 4, 5])
split = np.array_split(a, 3)
print(split) # Uneven: [array([1, 2]), array([3, 4]), array([5])]
2D Split (horizontal/vertical)
a = [Link]([[1, 2, 3], [4, 5, 6]])
print([Link](a, 3)) # Split columns
print([Link](a, 2)) # Split rows
3. Adding Elements
NumPy arrays are fixed-size, so adding/removing means creating a new array.
[Link]()
a = [Link]([1, 2, 3])
b = [Link](a, [4, 5])
print(b) # [1 2 3 4 5]
Appending to 2D Arrays
a = [Link]([[1, 2], [3, 4]])
# Append new row
print([Link](a, [[5, 6]], axis=0))
# Append new column
print([Link](a, [[5], [6]], axis=1))
The shape must match for rows/columns when appending.
4. Deleting Elements
[Link]()
a = [Link]([10, 20, 30, 40])
print([Link](a, 2)) # Delete element at index 2 → [10 20 40]
Deleting from 2D Arrays
a = [Link]([[1, 2], [3, 4], [5, 6]])
# Delete row at index 1
print([Link](a, 1, axis=0))
# Delete column at index 0
print([Link](a, 0, axis=1))
5. Inserting Elements
[Link]()
a = [Link]([1, 2, 3])
print([Link](a, 1, [10])) # [1 10 2 3]
Insert into 2D Arrays
a = [Link]([[1, 2], [3, 4]])
# Insert row at index 1
print([Link](a, 1, [9, 9], axis=0))
# Insert column at index 0
print([Link](a, 0, [7, 8], axis=1))
6. Stacking Arrays with New Axe
[Link]() — joins arrays along a new axis
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])
stacked = [Link]((a, b)) # Shape: (2, 3)
stacked_axis1 = [Link]((a, b), axis=1) # Shape: (3, 2)
print(stacked)
print(stacked_axis1)
7. Changing Shape for Manipulation
a = [Link]([1, 2, 3, 4, 5, 6])
reshaped = [Link]((2, 3)) # Needed for 2D manipulation
Use shape manipulation methods (reshape, flatten, etc.) to prepare arrays for advanced
manipulations.
Summary Table
Operation Function Description
Join arrays concatenate, hstack, vstack, stack Combine arrays along axis
Split arrays split, array_split, hsplit, vsplit Break arrays into parts
Add elements append, insert Add elements (returns new array)
Remove elements delete Remove elements by index
Operation Function Description
Reshape reshape, ravel, flatten Change shape, convert to 1D
Stack along axis stack Join with a new axis
Example Use Case
# Combine two datasets, remove a column, and reshape
data1 = [Link]([[1, 2], [3, 4]])
data2 = [Link]([[5, 6]])
combined = [Link]((data1, data2)) # Stack rows
cleaned = [Link](combined, 1, axis=1) # Remove 2nd column
reshaped = [Link]((3, 1)) # Reshape to single column
print(reshaped)
[Link] ARRAYS
1. What is a Structured Array?
A structured array is an ndarray where each element is a record, and each record can
contain multiple named fields of different types.
2. Creating a Structured Array
Example: A record with name, age, and weight
import numpy as np
# Define a structured data type
person_dtype = [Link]([
('name', 'U10'), # Unicode string of max length 10
('age', 'i4'), # 4-byte (int32) integer
('weight', 'f4') # 4-byte (float32) float
])
# Create the structured array
people = [Link]([
('Alice', 25, 55.0),
('Bob', 30, 85.5),
('Eve', 22, 48.0)
], dtype=person_dtype)
print(people)
Output:
[('Alice', 25, 55. ) ('Bob', 30, 85.5) ('Eve', 22, 48. )]
3. Accessing Fields in Structured Arrays
Access a specific field (column):
print(people['name']) # ['Alice' 'Bob' 'Eve']
print(people['age']) # [25 30 22]
Access a specific row (record):
print(people[1]) # ('Bob', 30, 85.5)
Access a specific value:
print(people[0]['name']) # 'Alice'
print(people[2]['weight']) # 48.0
4. Structured Dtype Formats
Type Code Meaning
'i4' int32 4-byte signed int
'f8' float64 8-byte float
'U20' Unicode string of max 20 characters
'S10' ASCII byte string of max 10 chars
You can also use Python-style types:
[Link]([('age', int), ('weight', float)])
5. Adding and Sorting Data
Filtering (Boolean Indexing):
# Get people older than 24
print(people[people['age'] > 24])
Sorting by field:
# Sort by weight
sorted_people = [Link](people, order='weight')
print(sorted_people)
6. Nested Structured Arrays
You can nest structured types:
dtype = [Link]([
('name', 'U10'),
('metrics', [('age', 'i4'), ('weight', 'f4')])
])
a = [Link]([
('Alice', (25, 55.0)),
('Bob', (30, 85.5))
], dtype=dtype)
print(a['metrics']['age']) # [25 30]
7. Converting to and from Structured Arrays
From regular array → structured
data = [('John', 21, 60.5), ('Lucy', 30, 70.0)]
structured = [Link](data, dtype=person_dtype)
Structured → regular dict/list of records
records = [Link]()
print(records)
# [('Alice', 25, 55.0), ('Bob', 30, 85.5), ('Eve', 22, 48.0)]
8. Saving/Loading Structured Arrays
# Save to binary file
[Link]('[Link]', people)
# Load back
loaded = [Link]('[Link]')
print(loaded)
9. Summary Table
Feature Method / Syntax Description
Create [Link](..., dtype=...) Define fields and types
Access field arr['name'] Get specific column
Access arr[i] Get specific row
record
Filter arr[arr['age'] > 25] Filter with condition
Sort [Link](arr, order='field') Sort by field
dtype=[('a', [('b', int), ('c',
Nested fields float)])] Complex structured data
Save/Load [Link]() / [Link]() Save and retrieve from
Feature Method / Syntax Description
disk
Example Use Case
# Find average weight of people older than 25
older = people[people['age'] > 25]
avg_weight = [Link](older['weight'])
print("Average weight of age > 25:", avg_weight)
[Link] AND WRITING ARRAY DATA ON FILES
NumPy supports saving and loading arrays in various formats:
• Binary files: .npy (single array), .npz (multiple arrays)
• Text files: .txt, .csv, etc.
1. Binary Files with .npy and .npz
Save an array to a .npy file
import numpy as np
arr = [Link]([1, 2, 3, 4, 5])
[Link]('[Link]', arr)
Load an array from a .npy file
loaded = [Link]('[Link]')
print(loaded) # [1 2 3 4 5]
Save multiple arrays to a .npz file
a = [Link](5)
b = [Link](0, 1, 5)
[Link]('[Link]', arr1=a, arr2=b)
Load .npz file
data = [Link]('[Link]')
print(data['arr1']) # [0 1 2 3 4]
print(data['arr2']) # [0. 0.25 0.5 0.75 1. ]
2. Text Files: .txt, .csv
Save to text file with [Link]()
a = [Link]([[1, 2, 3], [4, 5, 6]])
[Link]('[Link]', a)
Add a custom delimiter (e.g., CSV):
[Link]('[Link]', a, delimiter=',')
Load from text file with [Link]()
b = [Link]('[Link]')
print(b)
Load with delimiter (e.g., CSV):
b = [Link]('[Link]', delimiter=',')
3. Using genfromtxt() for Missing Data
genfromtxt() is more robust than loadtxt() — supports missing values, headers, etc.
data = [Link]('[Link]', delimiter=',', skip_header=1)
print(data)
4. Save/Load Structured Arrays
Save to .npy
dtype = [('name', 'U10'), ('age', 'i4'), ('weight', 'f4')]
people = [Link]([('Alice', 25, 55.0), ('Bob', 30, 85.5)], dtype=dtype)
[Link]('[Link]', people)
Load it back
loaded = [Link]('[Link]')
print(loaded['name']) # ['Alice' 'Bob']
5. Example: Full Round-Trip
a = [Link](3, 3)
# Save to binary
[Link]('my_array.npy', a)
# Save to CSV
[Link]('my_array.csv', a, delimiter=',')
# Load both back
bin_loaded = [Link]('my_array.npy')
csv_loaded = [Link]('my_array.csv', delimiter=',')
print("Binary:\n", bin_loaded)
print("CSV:\n", csv_loaded)
Comparison: Binary vs Text Formats
Format Function Pros Cons
.npy [Link]() Fast, preserves dtype Not human-readable
.npz [Link]() Store multiple arrays Binary, not editable
.txt/.csv [Link]() Human-readable, editable Slower, type info lost
Summary Table
Task Function
Save to .npy [Link](filename, array)
Load from .npy [Link](filename)
Save multiple arrays [Link](filename, arr1=..., arr2=...)
Load .npz [Link](filename)['arr1']
Save to text [Link](filename, array)
Load from text [Link](filename)
Load with missing data [Link](filename)
[Link]: The pandas Library: An Introduction
What is Pandas?
Pandas is a high-level Python library built on NumPy. It provides easy-to-use data structures
and functions for:
• Reading/writing data (CSV, Excel, SQL, JSON, etc.)
• Cleaning, transforming, reshaping, and analyzing data
• Handling missing data
• Time series operations
• Statistical summaries and group operations
The name "Pandas" comes from "Panel Data" — an econometrics term for
multidimensional structured data.
1. Installing Pandas
pip install pandas
2. Core Data Structures
1. Series — 1D Labeled Array
import pandas as pd
data = [Link]([10, 20, 30, 40])
print(data)
Output:
0 10
1 20
2 30
3 40
dtype: int64
• Like a NumPy array but with labels (index)
• Access with data[0] or [Link][0] or [Link][0]
2. DataFrame — 2D Labeled Table (Rows & Columns)
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Score': [85.5, 90.0, 88.5]
}
df = [Link](data)
print(df)
Output:
Name Age Score
0 Alice 25 85.5
1 Bob 30 90.0
2 Charlie 22 88.5
3. Reading Data from Files
Read from CSV
df = pd.read_csv('[Link]')
Read from Excel
df = pd.read_excel('[Link]')
Read from JSON
df = pd.read_json('[Link]')
You can also read from SQL, HTML, clipboard, and many more formats.
4. Writing Data to Files
df.to_csv('[Link]', index=False)
df.to_excel('[Link]', index=False)
df.to_json('[Link]')
5. Basic Operations on DataFrames
View Data
[Link]() # First 5 rows
[Link](3) # Last 3 rows
[Link]() # Structure and non-null count
[Link]() # Summary statistics
Access Columns and Rows
df['Name'] # Access a column
df[['Name', 'Score']] # Multiple columns
[Link][1] # Access row by label/index
[Link][1] # Access row by position
6. Basic Data Analysis
Filtering Rows
df[df['Age'] > 25]
Sorting
df.sort_values(by='Score', ascending=False)
Adding/Modifying Columns
df['Passed'] = df['Score'] >= 60
Dropping Columns/Rows
[Link](columns=['Age'], inplace=True)
[Link](index=[0], inplace=True)
7. Grouping and Aggregation
[Link]('Passed')['Score'].mean()
8. Handling Missing Data
[Link]() # Detect missing values
[Link]() # Remove missing rows
[Link](0) # Replace with a value
9. Time Series Support
dates = pd.date_range('2023-01-01', periods=5)
ts = [Link]([1, 2, 3, 4, 5], index=dates)
print(ts)
10. Interoperability with NumPy
You can convert between NumPy arrays and Pandas:
# DataFrame to NumPy
arr = [Link]
# Series to NumPy
s = df['Score'].to_numpy()
# NumPy to DataFrame
import numpy as np
a = [Link]([[1, 2], [3, 4]])
df2 = [Link](a, columns=['A', 'B'])
Summary
Feature Tool/Method
Create Series [Link]([...])
Create DataFrame [Link]({...})
Read CSV/Excel/JSON read_csv(), read_excel(), etc.
Write to file to_csv(), to_excel(), etc.
View data head(), info(), describe()
Filter, Sort df[df['col'] > x], sort_values()
Group & Aggregate groupby(), mean(), etc.
Handle missing data dropna(), fillna()
Time series pd.date_range(), Series(dates)
Example: Simple Analysis
df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, 92, 78, 60],
'Passed': [True, True, True, False]
})
print("Average score of passed students:")
print(df[df['Passed']]['Score'].mean())
[Link]
1. Using pip (Recommended for most users)
For Standard Python (via terminal/command prompt):
pip install pandas
Upgrade pandas to the latest version:
pip install --upgrade pandas
Make sure pip is pointing to your correct Python environment (python -m pip
install pandas is safer in virtual environments).
2. Using conda (For Anaconda/Miniconda users)
conda install pandas
This installs the version compatible with your current conda environment.
3. Verify Installation
After installing, open Python (or a Jupyter Notebook) and run:
import pandas as pd
print(pd.__version__)
If no errors appear and the version prints, you’re good to go.
4. Optional: Install with Jupyter Support
If you’re working in a Jupyter Notebook environment and want to ensure everything works:
pip install pandas jupyter
Or for conda:
conda install pandas notebook
5. Installing in Virtual Environment (Recommended
for Clean Projects)
Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # Mac/Linux
venv\Scripts\activate # Windows
pip install pandas
Troubleshooting Tips
Problem Solution
pip not recognized Use python -m pip install pandas
Conflicts with other
Use a virtual environment or conda environment
libraries
Slow install or timeout Use a mirror: pip install -i [Link]
pandas
Installation Summary
Tool Command Use Case
pip pip install pandas Regular Python users
conda conda install pandas Anaconda/Miniconda users
upgrade pip install --upgrade pandas Get latest version
verify import pandas Check successful install
[Link] TO PANDAS DATA STRUCTURES
Introduction to pandas Data Structures
Pandas primarily offers two main data structures for handling data:
1. Series
What is a Series?
• A 1-dimensional labeled array.
• Can hold any data type (integers, floats, strings, Python objects, etc.).
• Each element has an index label associated with it (like row labels).
Creating a Series
import pandas as pd
# Simple list to Series
s = [Link]([10, 20, 30, 40])
print(s)
Output:
0 10
1 20
2 30
3 40
dtype: int64
Custom Indexing in Series
s = [Link]([10, 20, 30], index=['a', 'b', 'c'])
print(s)
Output:
a 10
b 20
c 30
dtype: int64
Accessing Data in Series
print(s['b']) # 20
print(s[1]) # 20 (also accessible by integer position)
2. DataFrame
What is a DataFrame?
• A 2-dimensional labeled data structure (like a table or spreadsheet).
• Consists of rows and columns.
• Each column is essentially a Series.
• Columns can be different data types (e.g., one column int, another float, another
string).
Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Score': [85.5, 90.0, 88.5]
}
df = [Link](data)
print(df)
Output:
Name Age Score
0 Alice 25 85.5
1 Bob 30 90.0
2 Charlie 22 88.5
Accessing Data in DataFrame
print(df['Name']) # Access a column (returns a Series)
print([Link][1]) # Access row with index 1
print([Link][1, 2]) # Access element at row 1, column 2 (90.0)
Quick Comparison
Feature Series DataFrame
Dimensions 1D 2D
Structure Labeled array Table with rows and columns
Data types Homogeneous (usually) Columns can have mixed types
Indexing Single index Row and column indexing
Why Use These?
• Series: Great for single columns or time series data.
• DataFrame: Ideal for structured, tabular data with multiple columns.
[Link] BETWEEN DATA STRUCTURES
Operations Between pandas Data
Structures
1. Operations on Series
Series behave much like NumPy arrays but with aligned indexing:
Example:
import pandas as pd
s1 = [Link]([1, 2, 3], index=['a', 'b', 'c'])
s2 = [Link]([4, 5, 6], index=['b', 'c', 'd'])
print(s1 + s2)
Output:
a NaN # 'a' only in s1, so result is NaN
b 6.0 # 2 + 4
c 8.0 # 3 + 5
d NaN # 'd' only in s2, so result is NaN
dtype: float64
Key: Operations align on index labels, not just positions. Missing labels produce NaN.
2. Operations on DataFrames
DataFrames also align on both row and column labels.
Example:
df1 = [Link]({
'A': [1, 2, 3],
'B': [4, 5, 6]
}, index=['a', 'b', 'c'])
df2 = [Link]({
'B': [7, 8, 9],
'C': [10, 11, 12]
}, index=['b', 'c', 'd'])
print(df1 + df2)
Output:
A B C
a NaN NaN NaN
b NaN 12.0 NaN
c NaN 14.0 NaN
d NaN NaN NaN
• Rows and columns that don’t match become NaN.
• For overlapping rows and columns, values are added.
3. Broadcasting Between DataFrames and Series
Pandas supports broadcasting when you perform operations between a DataFrame and a
Series.
Example 1: Add a Series (column-wise) to DataFrame
df = [Link]({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
s = [Link]([10, 20, 30], index=[0, 1, 2])
print(df + s)
Output:
A B
0 11 14
1 22 25
2 33 36
• Here, s is added row-wise because its index matches the DataFrame's row index.
Example 2: Add a Series (row-wise) to DataFrame using axis=1
s = [Link]([10, 20], index=['A', 'B'])
print(df + s)
Output:
A B
0 11 24
1 12 25
2 13 26
• The Series s matches DataFrame columns, so it broadcasts column-wise to each row.
4. Arithmetic with fill values
You can specify a fill_value to use instead of NaN during operations.
print([Link](df2, fill_value=0))
This treats missing values as 0 instead of NaN, making the operation more intuitive.
5. Comparison Operations
Comparison between DataFrames or Series works similarly and aligns on labels.
print(df1 > df2)
Returns a DataFrame of booleans aligned by index and columns.
6. Summary
Operation Type Behavior Example
Series + Series Index-aligned arithmetic s1 + s2
DataFrame +
Aligns both rows and columns df1 + df2
DataFrame
DataFrame + Series Broadcast Series across columns df + s (where s indexed by
(row) (default) rows)
DataFrame + Series Broadcast Series across rows df + s (where s indexed by
(col) (axis=1) columns)
Operations with Use a default value for missing [Link](df2, fill_value=0)
fill_value labels
[Link] Application and Mapping
Pandas: Function Application and Mapping
Pandas provides several ways to apply functions element-wise, row-wise, or column-wise on
Series and DataFrames.
1. Applying Functions on a Series
a) Using .apply() with a function
import pandas as pd
s = [Link]([1, 2, 3, 4])
# Define a function
def square(x):
return x ** 2
# Apply function element-wise
s_squared = [Link](square)
print(s_squared)
Output:
0 1
1 4
2 9
3 16
dtype: int64
b) Using lambda functions
s_doubled = [Link](lambda x: x * 2)
print(s_doubled)
c) Using vectorized operations (faster!)
For many common functions, just use vectorized operators instead of .apply():
s_plus_one = s + 1
2. Applying Functions on a DataFrame
a) Applying a function to each element (applymap)
df = [Link]({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
# Square each element in DataFrame
df_squared = [Link](lambda x: x ** 2)
print(df_squared)
b) Applying a function along rows or columns (apply)
• axis=0 — apply function to each column (default)
• axis=1 — apply function to each row
# Sum of each column
col_sum = [Link](sum, axis=0)
print(col_sum)
# Sum of each row
row_sum = [Link](sum, axis=1)
print(row_sum)
c) Example: Applying a custom function to rows
def range_func(row):
return [Link]() - [Link]()
row_range = [Link](range_func, axis=1)
print(row_range)
3. Mapping Values in a Series
a) Using .map() with a dictionary
You can map values of a Series to new values using a dictionary.
s = [Link](['cat', 'dog', 'bird', 'cat'])
mapping = {'cat': 'meow', 'dog': 'woof'}
s_mapped = [Link](mapping)
print(s_mapped)
Output:
0 meow
1 woof
2 NaN
3 meow
dtype: object
Note: Values not in the mapping dictionary become NaN.
b) Using .map() with a function
s_mapped = [Link](lambda x: [Link]())
print(s_mapped)
4. Replacing Values Using .replace()
Similar to .map(), but better suited for replacing values:
s_replaced = [Link]({'cat': 'feline', 'dog': 'canine'})
print(s_replaced)
5. Vectorized String Methods with .str
For string data, you can apply vectorized string functions:
s = [Link](['apple', 'banana', 'cherry'])
print([Link]())
print([Link]('a'))
Summary Table
Method Applies To Functionality
.apply() Series, DataFrame Apply function element-wise or along axis
.applymap() DataFrame only Apply function element-wise on DataFrame
.map() Series Map values using dict or function
.replace() Series, DataFrame Replace specified values
.str Series (strings) Vectorized string operations
[Link] AND RANKING
Pandas: Sorting and Ranking
1. Sorting Data
a) Sorting a Series
import pandas as pd
s = [Link]([4, 2, 8, 1])
# Sort values ascending
print(s.sort_values())
# Sort values descending
print(s.sort_values(ascending=False))
# Sort by index
print(s.sort_index())
b) Sorting a DataFrame
• Sort by one or multiple columns
• Sort rows based on column values
df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'Score': [85.5, 90.0, 88.5]
})
# Sort by Age ascending
print(df.sort_values(by='Age'))
# Sort by Score descending
print(df.sort_values(by='Score', ascending=False))
# Sort by multiple columns: Age ascending, Score descending
print(df.sort_values(by=['Age', 'Score'], ascending=[True, False]))
c) Sorting in-place
To modify the DataFrame without creating a new copy:
df.sort_values(by='Age', inplace=True)
2. Ranking Data
Ranking assigns ranks to data, handling ties as needed.
a) Ranking in a Series
s = [Link]([100, 200, 100, 300])
print([Link]()) # Average rank for ties (default)
print([Link](method='min')) # Use minimum rank for ties
print([Link](method='max')) # Use maximum rank for ties
print([Link](method='dense')) # Like 'min' but rank always increments by 1
Output:
0 1.5
1 3.0
2 1.5
3 4.0
dtype: float64
b) Ranking in a DataFrame
Rank along rows or columns.
df = [Link]({
'Math': [90, 80, 90],
'English': [70, 90, 80]
})
# Rank each column
print([Link]())
# Rank each row
print([Link](axis=1))
c) Ranking with ascending/descending order
print([Link](ascending=False))
Summary Table
Operation Description Example
sort_values()
Sort by values in df.sort_values(by='Age')
Series/DataFrame
sort_index() Sort by index s.sort_index()
rank() Assign ranks, handling ties [Link](method='min')
df.sort_values(by='Age',
inplace param Modify original object inplace=True)
[Link] AND COVARIANCE
1. Covariance
What is Covariance?
• Measures how two variables vary together.
• Positive covariance → variables increase or decrease together.
• Negative covariance → one variable increases when the other decreases.
• Zero covariance → no linear relationship.
Calculate Covariance in pandas
import pandas as pd
data = {
'X': [1, 2, 3, 4, 5],
'Y': [2, 4, 6, 8, 10],
'Z': [5, 4, 3, 2, 1]
}
df = [Link](data)
# Covariance matrix of DataFrame columns
cov_matrix = [Link]()
print(cov_matrix)
Output:
X Y Z
X 2.5 5.0 -2.5
Y 5.0 10.0 -5.0
Z -2.5 -5.0 2.5
• Diagonal elements = variances.
• Off-diagonal = covariance between variables.
2. Correlation
What is Correlation?
• Measures the strength and direction of linear relationship between two variables.
• Values range from -1 to +1:
o +1 = perfect positive linear correlation
o -1 = perfect negative linear correlation
o 0 = no linear correlation
• More interpretable than covariance as it is normalized.
Calculate Correlation in pandas
corr_matrix = [Link]()
print(corr_matrix)
Output:
X Y Z
X 1.000000 1.000000 -1.000000
Y 1.000000 1.000000 -1.000000
Z -1.000000 -1.000000 1.000000
• X and Y are perfectly positively correlated.
• X and Z are perfectly negatively correlated.
Correlation Methods
By default, [Link]() uses Pearson correlation.
Other methods:
• 'pearson' (default)
• 'kendall' (Kendall Tau)
• 'spearman' (Spearman rank)
Example:
[Link](method='spearman')
3. Correlation/Covariance between two Series
x = df['X']
y = df['Y']
print([Link](y))
print([Link](y))
Summary
Function Purpose Example
[Link]() Covariance matrix [Link]()
[Link]() Correlation matrix (Pearson by default) [Link]()
[Link]() Correlation between two Series [Link](y)
[Link]() Covariance between two Series [Link](y)
16.―Not a Number Data
Pandas: Handling Not a Number (NaN)
Data
1. What is NaN?
• NaN stands for "Not a Number".
• It’s the standard missing data marker in pandas (and NumPy).
• Used to represent missing or undefined values in numeric arrays.
• Can also appear in object/string columns.
2. Detecting NaN Values
a) Using isna() or isnull()
Both methods are equivalent and return a boolean mask.
import pandas as pd
import numpy as np
df = [Link]({
'A': [1, 2, [Link], 4],
'B': [[Link], 2, 3, 4]
})
print([Link]())
Output:
A B
0 False True
1 False False
2 True False
3 False False
b) Check if any NaN in entire DataFrame
print([Link]().any()) # Per column
print([Link]().any().any()) # Overall
3. Handling NaN Data
a) Removing rows or columns with NaN (dropna())
# Drop rows with any NaN
cleaned_rows = [Link]()
# Drop columns with any NaN
cleaned_cols = [Link](axis=1)
print(cleaned_rows)
print(cleaned_cols)
b) Filling NaN values (fillna())
Replace NaN with a specified value:
df_filled = [Link](0)
print(df_filled)
You can also forward-fill or backward-fill:
df_ffill = [Link](method='ffill') # Forward fill
df_bfill = [Link](method='bfill') # Backward fill
c) Filling with a different value per column:
[Link]({'A': 0, 'B': 99})
4. Replacing NaN with Interpolation
Useful for time-series or numeric data:
df_interpolated = [Link]()
print(df_interpolated)
5. Checking for NaN in Series/DataFrame
print([Link](df['A'][2])) # True
print([Link](df['A'][1])) # True
6. Why NaN is Important
• NaNs propagate in calculations, preventing misleading results.
• Many pandas functions have parameters to ignore or handle NaNs gracefully.
• Essential to clean or impute missing data for accurate analysis.
Summary Table
Method Purpose Example
isna()/isnull() Detect NaN values [Link]()
Drop rows/columns with
dropna() [Link](), [Link](axis=1)
NaNs
fillna()
Fill NaNs with a specified [Link](0),
value [Link](method='ffill')
interpolate() Fill NaNs via interpolation [Link]()
notna() Detect non-NaN values [Link]()
[Link] INDEXING AND LEVELING
Pandas: Hierarchical Indexing
(MultiIndex) and Leveling
1. What is Hierarchical Indexing?
• It allows multiple index levels on rows and/or columns.
• You can think of it as nested indexing.
• Enables working with higher-dimensional data (3D+) in a 2D table.
• Makes grouping and slicing complex datasets easier.
2. Creating a MultiIndex (Hierarchical Index)
a) From tuples:
import pandas as pd
arrays = [
['A', 'A', 'B', 'B'],
[1, 2, 1, 2]
]
index = [Link].from_arrays(arrays, names=['letter', 'number'])
s = [Link]([10, 20, 30, 40], index=index)
print(s)
Output:
letter number
A 1 10
2 20
B 1 30
2 40
dtype: int64
b) From a DataFrame:
df = [Link]({
'City': ['NY', 'NY', 'LA', 'LA'],
'Year': [2020, 2021, 2020, 2021],
'Value': [100, 110, 90, 95]
})
df = df.set_index(['City', 'Year'])
print(df)
3. Accessing Data in MultiIndex
a) Using .loc[] with tuples:
print([Link][('A', 2)]) # Output: 20
b) Slicing across levels:
print([Link]['A']) # All data for 'A'
print([Link][('A', slice(1,2))])
4. Index Level Operations
a) Getting levels and labels
print([Link]) # List of unique values at each level
print([Link]) # Names of each level
b) Resetting index levels
df_reset = df.reset_index()
print(df_reset)
c) Swapping levels
s_swapped = [Link]('letter', 'number')
print(s_swapped)
d) Sorting by index levels
s_sorted = s.sort_index(level='number')
print(s_sorted)
5. Aggregation on MultiIndex DataFrames
You can perform aggregation grouped by levels:
df = [Link]({
'City': ['NY', 'NY', 'LA', 'LA'],
'Year': [2020, 2021, 2020, 2021],
'Value': [100, 110, 90, 95]
}).set_index(['City', 'Year'])
print([Link](level='City').sum())
Summary Table
Operation Description Example
Create MultiIndex Using arrays or set_index [Link].from_arrays()
Access with .loc Access data using tuples or slices [Link][('A', 1)]
Flatten MultiIndex back to
Reset index df.reset_index()
columns
Swap index levels Swap order of index levels [Link]()
Sort by index levels Sort by specified index level s.sort_index(level='number')
Group by index
Aggregate data based on level [Link](level='City').sum()
level
[Link] AND WRITING DATA: CSV OR TEXT FILE
Pandas: Reading and Writing CSV or
Text Files
1. Reading CSV or Text Files
a) pd.read_csv()
• The most common function to read CSV files (also works with many text files).
• Automatically parses CSV into a DataFrame.
import pandas as pd
# Basic CSV read
df = pd.read_csv('[Link]')
print([Link]())
b) Common Parameters of read_csv
Parameter Description Example
filepath_or_buffer Path to file or URL '[Link]'
sep Delimiter (default is comma ,) sep='\t' for tab-separated
Row number(s) to use as column
header header=0 (default), None
names
names List of column names to use names=['A', 'B', 'C']
index_col Column(s) to set as index index_col=0
usecols Return a subset of columns usecols=['A', 'C']
Parameter Description Example
dtype={'A': int, 'B':
dtype Data type for columns float}
na_values
Additional strings to recognize as na_values=['NA', 'missing']
NaN
parse_dates Parse columns as dates parse_dates=['date_column']
skiprows
Number of rows or list of rows to skiprows=1
skip
nrows Number of rows to read nrows=100
Example: Reading a tab-separated file with no header
df = pd.read_csv('[Link]', sep='\t', header=None, names=['A', 'B', 'C'])
2. Writing Data to CSV or Text Files
a) df.to_csv()
Saves DataFrame to CSV file.
df.to_csv('[Link]', index=False) # index=False to avoid writing row
numbers
b) Common Parameters of to_csv
Parameter Description Example
path_or_buf File path or object '[Link]'
sep Field delimiter sep='\t'
index Write row names (index) index=False
header Write column names header=True
columns Specify columns to write columns=['A', 'B']
mode Write mode, e.g., append ('a') mode='a'
na_rep Representation for missing data na_rep='NA'
compression Compression mode (e.g., 'gzip', 'bz2') compression='gzip'
Example: Writing DataFrame to a tab-separated file without index
df.to_csv('[Link]', sep='\t', index=False)
3. Reading Large Files in Chunks
You can read large files in chunks to save memory:
chunk_iter = pd.read_csv('large_file.csv', chunksize=1000)
for chunk in chunk_iter:
print([Link]())
Summary Table
Function Purpose Basic Usage
pd.read_csv() Read CSV/text file pd.read_csv('[Link]')
df.to_csv() Write DataFrame to CSV df.to_csv('[Link]', index=False)
chunksize Read file in chunks pd.read_csv('[Link]', chunksize=1000)
[Link] FILES
Pandas: Reading and Writing HTML
Files
1. Reading HTML Tables
Pandas can read tables embedded in HTML files or web pages using the pd.read_html()
function.
a) Read tables from a local or web HTML file
import pandas as pd
# From a URL
url = '[Link]
tables = pd.read_html(url)
print(f"Number of tables found: {len(tables)}")
# Access the first table
df = tables[0]
print([Link]())
b) Reading from a local HTML file
tables = pd.read_html('local_file.html')
df = tables[0] # First table in the HTML file
c) Parameters of read_html
Parameter Description Example
io URL or local file path or string '[Link] '[Link]'
match String or regex to match table content 'GDP' to find tables with GDP keyword
flavor Parser engine: 'bs4' or 'lxml' 'bs4' (default if installed)
header Row to use as header header=0
skiprows Rows to skip skiprows=1
attrs Dict of HTML attributes to match attrs = {'class': 'wikitable'}
encoding Character encoding 'utf-8'
2. Writing DataFrames to HTML
a) Save a DataFrame as an HTML table
df.to_html('[Link]')
This saves the DataFrame as a basic HTML table.
b) Customizing HTML output
• You can customize the table by specifying parameters:
df.to_html('[Link]', index=False, border=0, classes='table table-
striped')
• This removes the index column, sets border to 0, and adds CSS classes (good for
Bootstrap styling).
3. Example: Reading and Writing HTML Table
# Read tables from Wikipedia
tables =
pd.read_html('[Link]
_(United_Nations)')
# Extract first table
df = tables[0]
# Save to local HTML
df.to_html('countries_population.html', index=False)
Notes:
• Reading HTML tables requires lxml and beautifulsoup4 libraries installed. You
can install them via:
pip install lxml beautifulsoup4 html5lib
• read_html returns a list of DataFrames since one HTML page can have multiple
tables.
[Link] EXCEL FILES
Pandas: Reading and Writing Microsoft
Excel Files
1. Reading Excel Files
a) Basic usage with pd.read_excel()
import pandas as pd
# Read Excel file (.xls or .xlsx)
df = pd.read_excel('[Link]')
print([Link]())
b) Reading specific sheets
• By default, reads the first sheet.
# Read a specific sheet by name
df_sheet = pd.read_excel('[Link]', sheet_name='Sheet2')
# Read a sheet by index (0-based)
df_sheet = pd.read_excel('[Link]', sheet_name=0)
c) Reading multiple sheets at once
dfs = pd.read_excel('[Link]', sheet_name=None) # Reads all sheets
# dfs is a dictionary with sheet names as keys and DataFrames as values
print([Link]())
d) Common parameters
Parameter Description Example
sheet_name Sheet name, index, list of names/indexes, or None 'Sheet1', [0, 2], None
header Row number to use as column names header=0 (default)
names List of column names to use names=['A', 'B', 'C']
usecols Columns to read usecols='A:C' or [0,2]
skiprows Rows to skip skiprows=2
nrows Number of rows to read nrows=100
dtype Data types for columns dtype={'A': int}
2. Writing to Excel Files
a) Basic write with to_excel()
df.to_excel('[Link]', index=False) # index=False to skip row numbers
b) Writing multiple DataFrames to different sheets
with [Link]('output_multi.xlsx') as writer:
df1.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')
c) Important parameters for to_excel()
Parameter Description Example
excel_writer File path or ExcelWriter object '[Link]'
sheet_name Sheet name 'Sheet1'
index Write row index index=False
header Write column headers header=True
startrow Upper left cell row to start writing startrow=2
startcol Upper left cell column to start writing startcol=1
Parameter Description Example
engine Engine to use ('openpyxl', 'xlsxwriter') 'xlsxwriter'
3. Requirements
• To work with Excel files, pandas uses external libraries:
o openpyxl for .xlsx files
o xlrd for .xls files (Note: recent versions of xlrd dropped support for .xlsx)
o xlsxwriter for writing (optional, faster features)
Install them via pip if needed:
pip install openpyxl xlrd xlsxwriter
4. Example: Read, modify, and save Excel file
df = pd.read_excel('[Link]', sheet_name='2023')
# Add a new column
df['Total'] = df['Quantity'] * df['Price']
# Save to new Excel file
df.to_excel('sales_updated.xlsx', index=False)
Summary Table
Function Purpose Basic Usage
pd.read_excel() Read Excel file pd.read_excel('[Link]')
df.to_excel() Write DataFrame to Excel file df.to_excel('[Link]')
ExcelWriter Write multiple sheets with [Link]() as w:
If you'd like, I can show you advanced Excel writing features like formatting, charts, or
formulas using pandas + xlsxwriter!