Python & AI/ML
Comprehensive Study Guide
NumPy · Pandas · Matplotlib
From Basics to Advanced — AI/ML Focused
Covers: Core Python · NumPy · Pandas · Matplotlib
Data Structures · OOP · Generators · Decorators
Vectorization · Feature Engineering · Visualization
2024 Edition | Beginner → Advanced | Interview-Ready
Table of Contents
1. Python — Core to Advanced
1.1 Basics: Syntax, Variables, Data Types, Operators
1.2 Control Flow: Loops & Conditionals
1.3 Functions (all variants incl. Decorators)
1.4 Data Structures (List, Tuple, Set, Dict)
1.5 Object-Oriented Programming
1.6 File & Exception Handling
1.7 Modules, Packages, Virtual Environments
1.8 Iterators & Generators
1.9 Functional Programming
1.10 Memory Management & Performance
1.11 Python in the AI/ML Ecosystem
2. NumPy — Basic to Advanced
2.1 ndarray Creation, Indexing, Slicing
2.2 Broadcasting & Vectorization
2.3 Mathematical & Linear Algebra Operations
2.4 Random Module
2.5 Advanced Indexing & Masking
2.6 Memory Efficiency & Performance
2.7 Role of NumPy in ML
3. Pandas — Basic to Advanced
3.1 Series & DataFrame
3.2 Data Loading (CSV, Excel, JSON)
3.3 Data Cleaning & Missing Values
3.4 Filtering, Selection, GroupBy
3.5 Merge, Join, Concat
3.6 Time Series & Feature Engineering
3.7 Performance Tips & ML Pipelines
4. Matplotlib — Basic to Advanced
4.1 Line, Bar, Scatter, Histogram
4.2 Customization & Subplots
4.3 Advanced Visualizations
4.4 Saving/Exporting & ML Use Cases
5. End-to-End Mini Project
5.1 Data → Preprocessing → Visualization → Insights
6. Reference Tables & Interview Guide
6.1 Complexity Cheat Sheet
6.2 Common Pitfalls
6.3 Top Interview Questions & Answers
Section 1 — Python: Core to Advanced
1.1 Basics: Syntax, Variables, Data Types, Operators
Python is an interpreted, high-level, dynamically-typed language. Its clean syntax makes it the dominant
language in data science and AI/ML.
Variables & Dynamic Typing
Python uses duck typing — the type is determined at runtime. Variable names are case-sensitive and should
follow snake_case convention.
# Variable assignment — no type declaration needed
x = 10 # int
name = 'Alice' # str
pi = 3.14159 # float
flag = True # bool
data = None # NoneType
# Multiple assignment
a, b, c = 1, 2, 3
x = y = z = 0
# Type checking
print(type(x)) # <class 'int'>
print(isinstance(x, int)) # True
# Type conversion
s = str(42) # '42'
n = int('100') # 100
f = float('3.14') # 3.14
Core Data Types
Type Example Mutable? AI/ML Use
int x = 42 No Index, count, label
float x = 3.14 No Weights, probabilities
complex x = 2+3j No Signal processing, FFT
str x = 'hello' No Text, feature names
bool x = True No Masks, flags
list x = [1,2,3] Yes Dataset samples
tuple x = (1,2) No Fixed configs, coords
dict x = {'a':1} Yes Hyperparameters, configs
set x = {1,2,3} Yes Unique classes, vocab
NoneType x = None — Missing values
Operators
# Arithmetic
5 + 3, 5 - 3, 5 * 3, 5 / 3 # 8, 2, 15, 1.666...
5 // 3 # 1 (floor division — common in ML indexing)
5 % 3 # 2 (modulo)
2 ** 10 # 1024 (power — used in learning rate schedules)
# Comparison → returns bool
x == y, x != y, x > y, x >= y
# Logical
True and False # False
True or False # True
not True # False
# Bitwise (used in masks, GPU kernels)
0b1010 & 0b1100 # 0b1000 = 8
0b1010 | 0b1100 # 0b1110 = 14
1 << 3 # 8 (left shift)
# Identity & membership
x is None # identity check (prefer over == None)
3 in [1, 2, 3] # True
5 not in [1,2,3] # True
■■ In AI/ML, use 'is None' (identity) rather than '== None' to check for missing values. The == operator
can be overloaded by NumPy/pandas arrays and raise ambiguous truth-value errors.
1.2 Control Flow: Loops & Conditionals
# if / elif / else
score = 0.87
if score >= 0.90:
print('Excellent')
elif score >= 0.75:
print('Good') # ← this runs
else:
print('Needs improvement')
# Ternary expression
label = 'positive' if score > 0.5 else 'negative'
# for loop with range
for epoch in range(1, 6):
print(f'Epoch {epoch}')
# Enumerate — index + value (very common in ML training loops)
dataset = [0.1, 0.4, 0.7]
for idx, val in enumerate(dataset):
print(idx, val)
# zip — iterate multiple iterables in parallel
preds = [0, 1, 1]
labels= [0, 1, 0]
for p, l in zip(preds, labels):
print('pred:', p, 'label:', l)
# while loop
loss = 1.0
while loss > 0.01:
loss *= 0.9
# List comprehension (Pythonic, faster than map+lambda for simple ops)
squares = [x**2 for x in range(10)]
evens = [x for x in range(20) if x % 2 == 0]
# Dict comprehension
word_len = {w: len(w) for w in ['hello', 'world', 'ai']}
# break / continue / pass
for x in range(100):
if x == 5: break # early exit
if x % 2: continue # skip odd
pass # no-op placeholder
■ List comprehensions are ~30% faster than equivalent for-loops with append() in CPython because they
avoid repeated LOAD_ATTR lookups on the list object.
1.3 Functions
Definition, Scope, Default & Keyword Arguments
# Basic function
def greet(name, greeting='Hello'): # 'greeting' has a default
return f'{greeting}, {name}!'
greet('Alice') # Hello, Alice!
greet('Bob', 'Hi') # Hi, Bob!
greet(greeting='Hey', name='Carol') # keyword args — order-free
# *args — variable positional arguments (tuple inside)
def sum_all(*args):
return sum(args)
sum_all(1, 2, 3, 4) # 10
# **kwargs — variable keyword arguments (dict inside)
def build_model(**kwargs):
for k, v in [Link]():
print(f'{k}: {v}')
build_model(layers=3, lr=0.001, dropout=0.2)
# Combined signature — order matters: pos, *args, kw-only, **kwargs
def train(data, *callbacks, epochs=10, **hyperparams):
pass
Lambda Functions
# Lambda = anonymous single-expression function
square = lambda x: x ** 2
square(5) # 25
# Very useful with sorted(), map(), filter()
students = [('Alice', 88), ('Bob', 95), ('Carol', 72)]
sorted_students = sorted(students, key=lambda s: s[1], reverse=True)
# [('Bob', 95), ('Alice', 88), ('Carol', 72)]
# In pandas: apply a lambda to a column
# df['score_norm'] = df['score'].apply(lambda x: (x - [Link]()) / [Link]())
Recursion
# Factorial
def factorial(n):
if n <= 1: return 1 # base case
return n * factorial(n - 1) # recursive case
# Tree traversal (common in decision trees)
def dfs(node):
if node is None: return
print([Link])
dfs([Link])
dfs([Link])
# Python default recursion limit = 1000
import sys
[Link](10000) # increase if needed
Decorators — With Real-World Use Cases
A decorator is a function that wraps another function to extend its behaviour without modifying it. Heavily
used in ML frameworks (Flask routes, TensorFlow, PyTorch).
import time, functools
# ■■ Basic decorator ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
def timer(func):
@[Link](func) # preserves __name__, __doc__
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
end = time.perf_counter()
print(f'{func.__name__} ran in {end-start:.4f}s')
return result
return wrapper
@timer
def train_model(epochs):
[Link](0.1)
train_model(10) # train_model ran in 0.1002s
# ■■ Logger decorator ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
def logger(func):
@[Link](func)
def wrapper(*args, **kwargs):
print(f'Calling {func.__name__} with {args}, {kwargs}')
return func(*args, **kwargs)
return wrapper
# ■■ Decorator with arguments (factory pattern) ■■■
def retry(max_attempts=3):
def decorator(func):
@[Link](func)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_attempts - 1: raise
print(f'Attempt {attempt+1} failed: {e}')
return wrapper
return decorator
@retry(max_attempts=5)
def fetch_data(url): ...
# ■■ Stacking decorators ■■■■■■■■■■■■■■■■■■■■■■■■■■
@timer
@logger
def predict(x): return x * 2 # logger runs first, then timer
# ■■ Class-based decorator ■■■■■■■■■■■■■■■■■■■■■■■■
class Memoize:
def __init__(self, func):
[Link] = func
[Link] = {}
def __call__(self, *args):
if args not in [Link]:
[Link][args] = [Link](*args)
return [Link][args]
@Memoize
def fib(n):
if n < 2: return n
return fib(n-1) + fib(n-2)
■ Real-world ML decorators: @[Link] (JIT compilation in TensorFlow), @torch.no_grad() (disable
grad tracking during inference), @[Link]() (Flask/FastAPI endpoint registration), @property (getters in
model classes).
1.4 Data Structures
List — Ordered, Mutable, Allows Duplicates
lst = [1, 2, 3, 4, 5]
# Indexing & slicing
lst[0] # 1 (first)
lst[-1] # 5 (last)
lst[1:4] # [2,3,4]
lst[::2] # [1,3,5] (step=2)
lst[::-1] # [5,4,3,2,1] (reverse)
# Common methods
[Link](6) # [1,2,3,4,5,6]
[Link]([7,8]) # extend from iterable
[Link](0, 0) # insert at index
[Link](3) # remove first occurrence
[Link]() # remove & return last
[Link](0) # remove & return index 0
[Link]() # in-place sort O(n log n)
sorted(lst) # new sorted list
[Link]() # in-place reverse
[Link](4) # find index
[Link](2) # count occurrences
# List of lists (2D — used for matrices before NumPy)
matrix = [[1,2,3],[4,5,6],[7,8,9]]
matrix[1][2] # 6
# Flatten with comprehension
flat = [x for row in matrix for x in row]
Operation List Complexity Notes
Append [Link](x) O(1) amortized Fast — common in loops
Insert at i [Link](i,x) O(n) Shifts elements
Access by index lst[i] O(1) Direct memory offset
Search x in lst O(n) Use set for O(1) lookup
Delete by index del lst[i] O(n) Shifts elements
Sort [Link]() O(n log n) Tim sort (stable)
Slice copy lst[a:b] O(b-a) Creates new list
Tuple — Ordered, Immutable
t = (1, 2, 3)
t = 1, 2, 3 # parentheses optional
single = (42,) # single-element needs trailing comma!
# Useful as dict keys (lists can't be dict keys)
coords = {(0,0): 'origin', (1,0): 'right'}
# Named tuple — self-documenting, used in dataset pipelines
from collections import namedtuple
Sample = namedtuple('Sample', ['features', 'label'])
s = Sample(features=[0.1, 0.9], label=1)
print([Link]) # 1
# Tuple unpacking
x, y, z = (10, 20, 30)
first, *rest = (1, 2, 3, 4, 5) # rest = [2,3,4,5]
Dictionary — Key-Value, Ordered (Python 3.7+), Mutable
d = {'lr': 0.001, 'batch': 32, 'epochs': 100}
# Access & defaults
d['lr'] # 0.001 — KeyError if missing
[Link]('dropout', 0.0) # 0.0 — safe default
# Modification
d['optimizer'] = 'adam'
[Link]({'lr': 0.0001, 'weight_decay': 1e-5})
# Iteration
for key in d: pass # iterate keys
for val in [Link](): pass
for k, v in [Link](): pass # most common
# Dict comprehension
squared = {x: x**2 for x in range(5)}
# Merge dicts (Python 3.9+)
merged = d1 | d2
# defaultdict — avoids KeyError
from collections import defaultdict
word_count = defaultdict(int)
for word in [Link]():
word_count[word] += 1
# Counter — instant frequency count
from collections import Counter
c = Counter(['cat','dog','cat','bird','cat'])
c.most_common(2) # [('cat',3), ('dog',1)]
# OrderedDict (legacy — dicts are ordered since 3.7)
from collections import OrderedDict
Set — Unordered, Unique Elements
s = {1, 2, 3, 4}
[Link](5) # {1,2,3,4,5}
[Link](6) # no error if missing (vs remove which raises)
# Set operations (useful for NLP / class analysis)
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}
a | b # union {1,2,3,4,5,6}
a & b # intersect {3,4}
a - b # difference {1,2}
a ^ b # symmetric diff {1,2,5,6}
# O(1) membership test — much faster than list for large sets
vocab = set(words) # build once
'hello' in vocab # O(1)
■ Interview: Why is set lookup O(1)? — Sets are implemented as hash tables. The hash of an element
directly maps to a bucket, avoiding linear scan.
1.5 Object-Oriented Programming
# ■■ Class definition ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
class NeuralNetwork:
# Class attribute (shared across all instances)
framework = 'PyTorch'
def __init__(self, layers, lr=0.001): # constructor
[Link] = layers # instance attribute
[Link] = lr
self._weights = None # '_' = private convention
self.__bias = 0.0 # '__' = name-mangled
def __repr__(self): # developer-facing string
return f'NeuralNetwork(layers={[Link]}, lr={[Link]})'
def __str__(self): # user-facing string
return f'{[Link]} model with {len([Link])} layers'
def forward(self, x):
return x # placeholder
@classmethod
def from_config(cls, config: dict):
return cls(config['layers'], config['lr'])
@staticmethod
def relu(x): # no self/cls — utility method
return max(0, x)
@property
def num_params(self):
return sum(l['units'] for l in [Link])
# ■■ Inheritance ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
class CNN(NeuralNetwork): # CNN inherits NeuralNetwork
def __init__(self, layers, lr, kernel_size=3):
super().__init__(layers, lr) # call parent __init__
self.kernel_size = kernel_size
def forward(self, x): # method override (polymorphism)
print(f'CNN forward with kernel {self.kernel_size}')
return super().forward(x)
# ■■ Multiple inheritance ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
class Trainable:
def fit(self, X, y): pass
class Serializable:
def save(self, path): pass
class Model(NeuralNetwork, Trainable, Serializable):
pass # MRO (Method Resolution Order) defines lookup order
print(Model.__mro__) # shows resolution chain
# ■■ Dunder (magic) methods ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
class Dataset:
def __init__(self, data):
[Link] = data
def __len__(self): return len([Link])
def __getitem__(self, i): return [Link][i]
def __iter__(self): return iter([Link])
def __contains__(self, x): return x in [Link]
ds = Dataset([1,2,3,4,5])
len(ds) # 5
ds[2] # 3
for x in ds: pass
3 in ds # True
OOP Pillar Concept Python Mechanism ML Example
Model weights hidden
Encapsulation Hide internal state _attr, __attr, @property
from user
Reuse parent CNNLayer extends
Inheritance class Child(Parent):
behaviour Layer
Same interface, diff Method override / duck [Link]() for
Polymorphism
impl typing CNN/RNN
Abstraction Hide complexity ABC, abstract methods sklearn BaseEstimator
1.6 File Handling & Exception Handling
# ■■ File I/O ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# Always use context manager (with) — auto-closes file
with open('[Link]', 'r', encoding='utf-8') as f:
content = [Link]() # whole file as string
with open('[Link]', 'r') as f:
lines = [Link]() # list of lines
with open('[Link]', 'w') as f:
[Link]('Hello, AI!\n')
with open('[Link]', 'a') as f: # append mode
[Link]('epoch 10 done\n')
# ■■ pathlib (modern, preferred) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
from pathlib import Path
p = Path('data/models')
[Link](parents=True, exist_ok=True)
for f in [Link]('*.pt'): # glob pattern matching
print([Link], [Link])
# ■■ Exception handling ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
try:
model = load_model('[Link]')
preds = [Link](X)
except FileNotFoundError as e:
print(f'Model not found: {e}')
except ValueError as e:
print(f'Bad input shape: {e}')
except Exception as e: # catch-all (use sparingly)
print(f'Unexpected error: {e}')
raise # re-raise to not swallow
else:
print('Prediction succeeded') # runs if no exception
finally:
print('Cleanup') # always runs
# Custom exceptions
class DataValidationError(ValueError):
def __init__(self, msg, column=None):
super().__init__(msg)
[Link] = column
raise DataValidationError('NaN found', column='age')
1.7 Modules, Packages & Virtual Environments
# Import styles
import numpy # standard — use [Link]()
import numpy as np # aliased — use [Link]()
from numpy import array # direct import
from numpy import * # wildcard — AVOID (pollutes namespace)
# __name__ guard — code only runs when file executed directly
if __name__ == '__main__':
main()
# Package structure
# mypackage/
# ■■■ __init__.py (makes it a package)
# ■■■ [Link]
# ■■■ models/
# ■ ■■■ __init__.py
# ■ ■■■ [Link]
# Virtual environments
# python -m venv venv # create
# source venv/bin/activate # activate (Linux/Mac)
# venv\Scripts\activate # activate (Windows)
# pip install numpy pandas # install packages
# pip freeze > [Link] # save deps
# pip install -r [Link] # restore
# Useful standard library modules for ML
import os, sys, json, csv, re
from datetime import datetime
from collections import defaultdict, Counter, deque
from itertools import chain, product, combinations
import functools, operator
import multiprocessing, [Link]
import logging, warnings
import pickle, joblib # serialization
1.8 Iterators & Generators
# ■■ Iterator protocol ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
class Range:
def __init__(self, n): self.n = n; self.i = 0
def __iter__(self): return self
def __next__(self):
if self.i >= self.n: raise StopIteration
val = self.i; self.i += 1; return val
# ■■ Generator function (yield) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
def data_loader(file_path, batch_size=32):
'''Yields mini-batches — memory efficient for large datasets'''
batch = []
with open(file_path) as f:
for line in f:
[Link]([Link]())
if len(batch) == batch_size:
yield batch
batch = []
if batch: yield batch # last partial batch
for batch in data_loader('[Link]'):
train_step(batch) # never loads full file into memory!
# ■■ Generator expression (lazy list comprehension) ■■■■■■■■■■■■■■■
gen = (x**2 for x in range(1_000_000)) # no memory used yet
next(gen) # compute on demand
# ■■ Send values INTO a generator ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
def accumulator():
total = 0
while True:
val = yield total
if val is None: break
total += val
acc = accumulator()
next(acc) # prime the generator
[Link](10) # 10
[Link](20) # 30
# ■■ itertools (memory-efficient combinatorics) ■■■■■■■■■■■■■■■■■■■■
from itertools import islice, chain, product, cycle
first_5 = list(islice(gen, 5)) # take first 5 from generator
all_data = chain(train, val, test) # chain iterables lazily
■ Generators are fundamental to PyTorch's DataLoader and TensorFlow's [Link] pipeline — they
process datasets larger than RAM by streaming batches on demand.
1.9 Functional Programming
from functools import reduce, partial, lru_cache
nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# map — apply function to every element (lazy, returns iterator)
squares = list(map(lambda x: x**2, nums)) # [1,4,9,16,...]
# filter — keep elements where function returns True
evens = list(filter(lambda x: x % 2 == 0, nums)) # [2,4,6,8,10]
# reduce — fold list into single value
product = reduce(lambda a, b: a * b, nums) # 3628800
# In practice, prefer comprehensions or NumPy over map/filter
squares_comp = [x**2 for x in nums] # more readable
# partial — fix some arguments (creates new function)
def power(base, exp): return base ** exp
square = partial(power, exp=2)
cube = partial(power, exp=3)
square(4) # 16
# lru_cache — memoize expensive function calls
@lru_cache(maxsize=None) # maxsize=None = unbounded
def expensive_feature(param):
# simulate slow feature extraction
return param ** 3 + param ** 2
# [Link] (already covered in decorators)
# zip, enumerate, any, all — built-in functional tools
any(x > 5 for x in nums) # True (short-circuits)
all(x > 0 for x in nums) # True
sum(x**2 for x in nums) # 385 (generator expression)
1.10 Memory Management & Performance
import sys, tracemalloc
# Memory sizes
print([Link]([])) # 56 bytes (empty list)
print([Link](range(1000))) # 48 bytes (range is lazy!)
# ■■ Python memory model ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# Everything is an object. Integers -5 to 256 are cached (interned).
a = 256; b = 256; print(a is b) # True (same object)
a = 257; b = 257; print(a is b) # False (new objects)
# Reference counting + cyclic GC
import gc
[Link]() # force garbage collection
# ■■ __slots__ — eliminates per-instance __dict__ ■■■■■■■■■■■■■■■■■
class Point: # without __slots__: ~184 bytes per instance
__slots__ = ['x', 'y'] # with __slots__: ~56 bytes
def __init__(self, x, y): self.x = x; self.y = y
# ■■ Profiling ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# cProfile (function-level)
import cProfile
[Link]('train_loop()')
# timeit — micro-benchmarks
import timeit
t = [Link]('[x**2 for x in range(1000)]', number=1000)
# tracemalloc — memory tracking
[Link]()
# ... run code ...
snap = tracemalloc.take_snapshot()
for stat in [Link]('lineno')[:5]: print(stat)
# ■■ Performance tips ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# 1. Use NumPy arrays instead of Python lists for numbers
# 2. Use generators for large data (don't load all at once)
# 3. Avoid global variables in hot loops (LOAD_GLOBAL is slow)
# 4. Use local variable aliases in tight loops
# 5. String concatenation: use ''.join(lst) not += in loop
# 6. Use set/dict for O(1) lookups instead of list O(n)
# 7. Use [Link] for O(1) popleft instead of [Link](0)
# 8. Use multiprocessing for CPU-bound, asyncio for I/O-bound tasks
1.11 Python in the AI/ML Ecosystem
Library/Framework Purpose Key Abstractions
NumPy Numerical computing ndarray, broadcasting, ufuncs
Pandas Data manipulation DataFrame, Series, GroupBy
Matplotlib/Seaborn Visualization Figure, Axes, FacetGrid
Scikit-learn Classical ML fit/transform/predict API
PyTorch Deep learning Tensor, Module, autograd
TensorFlow/Keras Deep learning Tensor, Layer, Model
Hugging Face NLP/Transformers Tokenizer, Trainer, Pipeline
OpenCV Computer vision Mat, imread, VideoCapture
NLTK/spaCy NLP preprocessing Token, Doc, Span
XGBoost/LightGBM Gradient boosting DMatrix, Booster
MLflow/W&B; Experiment tracking Run, Artifact, Metric
FastAPI ML model serving APIRouter, Pydantic model
Section 2 — NumPy: Basic to Advanced
2.1 ndarray Creation, Indexing & Slicing
NumPy's ndarray is the fundamental data structure for numerical computing in Python. All major ML
frameworks use NumPy-compatible array formats internally.
import numpy as np
# ■■ Creation ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
a = [Link]([1, 2, 3, 4, 5]) # 1D from list
b = [Link]([[1,2,3],[4,5,6]]) # 2D (shape 2,3)
c = [Link]([1.0, 2, 3], dtype=np.float32) # explicit dtype
[Link]((3,4)) # zeros matrix
[Link]((2,3,4)) # ones tensor (3D)
[Link](5) # identity matrix
[Link]((3,3), 7) # filled with 7
[Link](0, 10, 0.5) # like range but float-capable
[Link](0, 1, 100) # 100 evenly spaced points [0,1]
[Link](0, 3, 10) # 10 log-spaced points [1, 1000]
[Link]((2,2)) # uninitialized (fast allocation)
# ■■ Array attributes ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
a = [Link]([[1,2,3],[4,5,6]])
[Link] # (2, 3)
[Link] # 2
[Link] # int64
[Link] # 6 (total elements)
[Link] # 48 (memory in bytes)
[Link] # 8 (bytes per element)
# ■■ Indexing & Slicing ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
a[0] # first row: [1,2,3]
a[1, 2] # row 1, col 2: 6
a[:, 1] # all rows, col 1: [2,5]
a[0, :] # row 0 (same as a[0])
a[0:2, 1:3] # sub-matrix: [[2,3],[5,6]]
a[::1, ::-1] # reverse columns
# ■■ Reshaping ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link](3, 2) # new shape (same data, view)
[Link](-1, 1) # column vector (rows inferred)
[Link]() # 1D copy
[Link]() # 1D view (faster than flatten)
a.T # transpose (view, not copy)
np.expand_dims(a, axis=0) # add dimension: (1,2,3)
[Link](a) # remove size-1 dimensions
a[[Link], :] # same as expand_dims
2.2 Broadcasting & Vectorization
Broadcasting is NumPy's mechanism for performing element-wise operations on arrays of different shapes
without copying data. It is the key to efficient ML computations.
# ■■ Broadcasting rules ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# 1. If arrays differ in ndim, prepend 1s to smaller shape
# 2. Arrays with size 1 along a dim are stretched to match
# 3. Shapes must match or one must be 1
a = [Link]([[1],[2],[3]]) # shape (3,1)
b = [Link]([10, 20, 30]) # shape (3,)
# b treated as (1,3) → broadcast to (3,3)
a + b
# [[11,21,31],
# [12,22,32],
# [13,23,33]]
# ■■ Common ML broadcasting patterns ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# Batch normalization: (batch, features) - (1, features)
X = [Link](1000, 128) # 1000 samples, 128 features
mean = [Link](axis=0) # (128,)
std = [Link](axis=0) # (128,)
X_norm = (X - mean) / std # broadcasts (1000,128) - (128,)
# Bias addition in neural net layers
W = [Link](128, 64) # weight matrix
b = [Link](64) # bias vector (64,)
Z = X @ W + b # (1000,64) + (64,) → broadcast (1000,64)
# ■■ Vectorization vs Python loop ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# SLOW: pure Python
result = [x**2 + 2*x for x in range(1000000)]
# FAST: NumPy vectorized (typically 50-200x faster)
x = [Link](1000000)
result = x**2 + 2*x
# Universal functions (ufuncs) — applied element-wise in C
[Link](x), [Link](x), [Link](x), [Link](x)
[Link](x), [Link](x), [Link](x)
[Link](x, 0) # ReLU!
[Link](x, 0, 1) # Sigmoid-output clipping
■ [Link](x, 0) IS ReLU activation. [Link](logits, -500, 500) prevents overflow in softmax. Vectorize
everything that runs in a training loop.
2.3 Mathematical & Linear Algebra Operations
# ■■ Aggregate operations ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
a = [Link]([[1,2,3],[4,5,6]])
[Link](a) # 21 (all elements)
[Link](a, axis=0) # [5,7,9] (sum per column)
[Link](a, axis=1) # [6,15] (sum per row)
[Link](a, axis=0)
[Link](a, axis=0)
[Link](a), [Link](a)
[Link](a, axis=1) # index of min in each row (useful for predictions)
[Link](a, axis=1) # predicted class in classification!
[Link](a) # cumulative sum
# ■■ Linear algebra ([Link]) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
A = [Link](4, 4)
[Link](A, A.T) # matrix multiplication (A·A■)
A @ A.T # same, cleaner syntax (Python 3.5+)
[Link](A) # determinant
[Link](A) # inverse
[Link](A) # Frobenius norm
[Link](A, axis=1) # row-wise L2 norm
# Eigendecomposition (PCA, graph algorithms)
eigenvalues, eigenvectors = [Link](A @ A.T)
# SVD (dimensionality reduction, matrix factorization)
U, s, Vt = [Link](A, full_matrices=False)
# Solve linear system Ax = b
b = [Link](4)
x = [Link](A, b)
# QR decomposition
Q, R = [Link](A)
# Least squares (linear regression!)
# min ||Xw - y||^2
X = [Link](100, 4)
y = [Link](100)
w, residuals, rank, sv = [Link](X, y, rcond=None)
2.4 Random Module
rng = [Link].default_rng(seed=42) # recommended modern API
[Link]((3,4)) # uniform [0,1)
[Link](0, 10, (3,4)) # random ints
[Link](0, 1, (100,10)) # Gaussian — weight init
[Link](-1, 1, 1000) # uniform range
[Link]([1,2,3,4,5], size=3, replace=False) # sampling
[Link](arr) # in-place shuffle
[Link](100) # shuffled indices — train/val split
# Legacy API (still widely seen)
[Link](42)
[Link](3, 3) # standard normal
[Link](0, 100, (5,))
# Distributions used in ML
rng.standard_normal((batch, features)) # Xavier init approximation
[Link](n=1, p=0.5, size=1000) # Bernoulli — coin flip
[Link](n=1, pvals=[.2,.3,.5]) # categorical sampling
2.5 Advanced Indexing & Masking
a = [Link](10) # [0,1,2,3,4,5,6,7,8,9]
# ■■ Boolean (mask) indexing ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
mask = a > 5
a[mask] # [6,7,8,9] — returns copy
a[a % 2 == 0] # [0,2,4,6,8]
# Set values using mask
a[a < 0] = 0 # ReLU in one line!
# ■■ Fancy indexing ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
idx = [1, 3, 7]
a[idx] # [1,3,7] — returns copy
# 2D fancy indexing
mat = [Link](25).reshape(5,5)
rows = [0, 2, 4]
cols = [1, 3, 0]
mat[rows, cols] # [1, 13, 20] — element (0,1),(2,3),(4,0)
# [Link] — conditional selection / replacement
x = [Link]([-2, -1, 0, 1, 2])
[Link](x > 0, x, 0) # [0,0,0,1,2] — ReLU again!
[Link](x > 0, 1, -1) # sign function
# [Link] — indices of nonzero elements
[Link](x) # (array([3,4]),) — indices of 1 and 2
# [Link] & [Link]
[Link](a, [2,4,6]) # a[[2,4,6]]
2.6 Memory Efficiency & Performance
# ■■ Views vs Copies ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
a = [Link]([1, 2, 3, 4, 5])
b = a[1:4] # VIEW — shares memory
b[0] = 99 # a is also modified!
c = a[1:4].copy() # COPY — independent
# Check if it's a view
[Link] is a # True
# ■■ dtype selection — critical for memory in DL ■■■■■■■■■■■■■■■■■■
# float64: 8 bytes (default Python float)
# float32: 4 bytes (standard in deep learning)
# float16: 2 bytes (mixed precision training)
# int32: 4 bytes
# int8: 1 byte (quantized models)
X = [Link](1000, 1000).astype(np.float32)
print([Link]) # 4MB vs 8MB for float64
# ■■ Contiguous arrays — for C/Fortran interop ■■■■■■■■■■■■■■■■■■■
[Link]['C_CONTIGUOUS'] # row-major (C order)
[Link]['F_CONTIGUOUS'] # column-major (Fortran order)
[Link](a) # ensure C-contiguous
# ■■ [Link] — readable and efficient tensor ops ■■■■■■■■■■■■■■■■
# Matrix multiplication
[Link]('ij,jk->ik', A, B) # same as A @ B
# Batch matrix multiply
[Link]('bij,bjk->bik', A_batch, B_batch)
# Trace
[Link]('ii->', A)
# Outer product
[Link]('i,j->ij', a, b)
Operation Python List NumPy Array Speedup
Sum of 1M ints ~60ms ~1ms ~60x
Element-wise multiply ~100ms ~2ms ~50x
Memory (1M floats) ~28 MB ~8 MB (float64) 3.5x
Matrix multiply 1000x1000 Manual loops ~5ms (BLAS) 1000x+
Boolean indexing List comp Native (C) ~30x
2.7 NumPy's Role in ML
ML Operation NumPy Implementation Notes
Linear regression forward y_pred = X @ w + b Matrix-vector product
MSE loss [Link]((y_pred - y)**2) Vectorized over batch
Gradient descent step w -= lr * grad Broadcasting update
[Link](z)/[Link](z).sum(axis=1,keepd Numerically stable with max
Softmax
ims=True) subtraction
ReLU activation [Link](0, z) Element-wise ufunc
One-hot encoding [Link](n_classes)[labels] Fancy indexing trick
Train/test split idx = [Link](n) Shuffle then slice
Normalization (X - [Link](0)) / [Link](0) Broadcasting over axis=0
Cosine similarity X @ Y.T / (norm(X)*norm(Y)) Linear algebra
PCA (manual) U,s,Vt = [Link](X_centered) SVD decomposition
Section 3 — Pandas: Basic to Advanced
3.1 Series & DataFrame
import pandas as pd
import numpy as np
# ■■ Series — 1D labelled array ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
s = [Link]([10, 20, 30, 40], index=['a','b','c','d'])
s['b'] # 20 — label-based
s[1] # 20 — position-based
s[['a','c']] # [10,30]
s[s > 15] # filter: [20,30,40]
# ■■ DataFrame — 2D table ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
df = [Link]({
'age': [25, 30, 35, 28],
'salary': [50000, 80000, 120000, 60000],
'dept': ['Eng','Mkt','Eng','HR'],
'score': [0.8, 0.9, 0.95, 0.7]
})
# ■■ Inspection ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link] # (4, 4)
[Link] # column data types
[Link]() # dtype + non-null counts + memory
[Link]() # statistics (count, mean, std, min, 25%, 50%, 75%, max)
[Link](2) # first 2 rows
[Link](2) # last 2 rows
[Link] # Index(['age','salary','dept','score'])
[Link] # RangeIndex(start=0, stop=4)
[Link] # NumPy array underlying
[Link]() # unique count per column
df.value_counts('dept') # frequency of each dept
3.2 Data Loading
# CSV
df = pd.read_csv('[Link]')
df = pd.read_csv('[Link]',
sep=',',
header=0,
index_col='id',
usecols=['age','salary','target'],
dtype={'age': np.int32},
parse_dates=['date'],
na_values=['NA','N/A','?'],
chunksize=10000) # iterator for large files
# Excel
df = pd.read_excel('[Link]', sheet_name='Sheet1')
# JSON
df = pd.read_json('[Link]')
df = pd.read_json('[Link]', orient='records')
# SQL
import sqlalchemy
engine = sqlalchemy.create_engine('sqlite:///[Link]')
df = pd.read_sql('SELECT * FROM users WHERE active=1', engine)
# Parquet (recommended for large datasets — columnar format)
df = pd.read_parquet('[Link]')
df.to_parquet('[Link]', compression='snappy')
# Save
df.to_csv('[Link]', index=False)
df.to_excel('[Link]', index=False)
3.3 Data Cleaning & Missing Values
# ■■ Detect missing values ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link]().sum() # count NaN per column
[Link]().mean() * 100 # % missing per column
df[[Link]().any(axis=1)] # rows with any NaN
# ■■ Drop ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link]() # drop rows with any NaN
[Link](subset=['age','salary']) # only check these cols
[Link](thresh=3) # keep rows with >= 3 non-NaN
[Link](axis=1) # drop columns
# ■■ Fill ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link](0) # fill all NaN with 0
df['age'].fillna(df['age'].mean()) # mean imputation
df['dept'].fillna('Unknown') # categorical fill
[Link](method='ffill') # forward fill
[Link](method='bfill') # backward fill
# ■■ Advanced imputation (ML-based) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
from [Link] import SimpleImputer, KNNImputer
imp = KNNImputer(n_neighbors=5)
df_imputed = [Link](imp.fit_transform(df), columns=[Link])
# ■■ Duplicates ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link]().sum() # count duplicates
df.drop_duplicates(inplace=True) # remove
df.drop_duplicates(subset=['age','dept']) # based on subset
# ■■ Data type issues ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
df['salary'] = pd.to_numeric(df['salary'], errors='coerce') # str→num
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df['dept'] = df['dept'].astype('category') # save memory
df['age'] = df['age'].astype(np.int32)
# ■■ Outlier detection ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR
outliers = df[(df['salary'] < lower) | (df['salary'] > upper)]
df_clean = df[df['salary'].between(lower, upper)]
# Z-score method
from scipy import stats
z_scores = [Link]([Link](df['salary']))
df_clean = df[z_scores < 3]
3.4 Filtering, Selection & GroupBy
# ■■ Selection ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
df['age'] # single column → Series
df[['age','salary']] # multi-column → DataFrame
# loc — label-based (inclusive on both ends)
[Link][0, 'age'] # row 0, col 'age'
[Link][0:2, ['age','dept']] # rows 0-2 (inclusive), 2 cols
[Link][df['age'] > 28] # boolean filter with loc
# iloc — position-based (exclusive end, like Python slicing)
[Link][0, 1] # row 0, col index 1
[Link][1:3, :] # rows 1,2 all cols
[Link][:, -1] # last column
# ■■ Boolean filtering ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
df[df['age'] > 28]
df[(df['age'] > 25) & (df['salary'] > 60000)] # AND
df[(df['dept'] == 'Eng') | (df['dept'] == 'HR')] # OR
df[~(df['dept'] == 'HR')] # NOT
df[df['dept'].isin(['Eng','Mkt'])]
df[df['dept'].[Link]('E')] # string method
# query() — SQL-like string syntax (readable)
[Link]('age > 25 and salary > 60000')
[Link]('dept in ["Eng", "HR"]')
# ■■ GroupBy ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# Split → Apply → Combine
grp = [Link]('dept')
grp['salary'].mean() # mean salary per dept
grp['salary'].agg(['mean','std','min','max'])
[Link]() # count per group
# Multiple aggregations (common in feature engineering)
result = [Link]('dept').agg(
avg_salary = ('salary', 'mean'),
max_score = ('score', 'max'),
count = ('age', 'count')
).reset_index()
# transform — returns same shape as input (useful for features)
df['salary_rank_in_dept'] = [Link]('dept')['salary'].rank()
df['dept_mean_salary'] = [Link]('dept')['salary'].transform('mean')
# filter groups
big_depts = [Link]('dept').filter(lambda g: len(g) >= 2)
3.5 Merge, Join & Concat
left = [Link]({'id':[1,2,3], 'name':['A','B','C']})
right = [Link]({'id':[2,3,4], 'score':[90,85,78]})
# ■■ merge (like SQL JOIN) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link](left, right, on='id', how='inner') # 2 rows
[Link](left, right, on='id', how='left') # 3 rows (keep all left)
[Link](left, right, on='id', how='right') # 3 rows (keep all right)
[Link](left, right, on='id', how='outer') # 4 rows (union)
# merge on different column names
[Link](left, right, left_on='id', right_on='emp_id')
# ■■ join (index-based) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
left.set_index('id').join(right.set_index('id'), how='left')
# ■■ concat (stack DataFrames) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link]([df1, df2], axis=0, ignore_index=True) # vertical stack
[Link]([df1, df2], axis=1) # horizontal stack
[Link]([df1, df2], keys=['train','val']) # multi-index
Operation Use Case SQL Equivalent Key Param
inner merge Records in both tables INNER JOIN how='inner'
left merge Keep all left, match right LEFT JOIN how='left'
outer merge Keep all records FULL OUTER JOIN how='outer'
ignore_index=Tr
concat axis=0 Stack new rows UNION ALL
ue
Add columns
concat axis=1 Lateral join axis=1
side-by-side
join Index-based merge JOIN on index on parameter
3.6 Time Series & Feature Engineering
# ■■ Time series ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').sort_index()
# Resampling (change frequency)
[Link]('D').mean() # daily mean
[Link]('W').sum() # weekly sum
[Link]('ME').last() # month-end last value
# Rolling window (moving average, volatility)
df['ma_7'] = df['close'].rolling(7).mean()
df['std_30'] = df['close'].rolling(30).std()
df['ema_12'] = df['close'].ewm(span=12).mean()
# Datetime components
df['year'] = [Link]
df['month'] = [Link]
df['weekday'] = [Link]
df['is_weekend'] = [Link] >= 5
# ■■ Feature Engineering ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# Binning continuous → categorical
df['age_group'] = [Link](df['age'], bins=[0,25,35,100],
labels=['young','mid','senior'])
# qcut — equal-frequency bins
df['salary_quartile'] = [Link](df['salary'], q=4,
labels=['Q1','Q2','Q3','Q4'])
# Label encoding
df['dept_code'] = df['dept'].astype('category').[Link]
# One-hot encoding
df_ohe = pd.get_dummies(df, columns=['dept'], drop_first=True)
# Lag features (important for time series ML)
df['lag_1'] = df['value'].shift(1)
df['lag_7'] = df['value'].shift(7)
df['diff_1'] = df['value'].diff(1)
# Target encoding (mean of target per category)
means = [Link]('dept')['salary'].transform('mean')
df['dept_salary_mean'] = means
3.7 Performance Tips & Role in ML Pipelines
# ■■ Memory reduction ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
def reduce_memory(df):
for col in [Link]:
if df[col].dtype == 'float64':
df[col] = df[col].astype('float32')
elif df[col].dtype == 'int64':
mx = df[col].max()
if mx < 127: df[col] = df[col].astype('int8')
elif mx < 32767: df[col] = df[col].astype('int16')
elif mx < 2147483647: df[col] = df[col].astype('int32')
elif df[col].dtype == 'object' and df[col].nunique() / len(df) < 0.5:
df[col] = df[col].astype('category')
return df
# ■■ Fast operations ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# Use vectorized string methods (not apply+lambda)
df['name'].[Link]() # fast
df['name'].[Link]('AI') # fast
# Avoid iterrows — very slow
# BAD: for idx, row in [Link](): ...
# GOOD: df['new_col'] = df['col1'] + df['col2'] (vectorized)
# GOOD: [Link](func, axis=1) (still slow but better than iterrows)
# BEST: Use numpy operations on .values
# eval() and query() use numexpr — faster on large frames
[Link]('new_col = salary * score + age * 100')
# ■■ pandas in ML pipeline ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
from [Link] import Pipeline
from [Link] import StandardScaler
from sklearn.linear_model import LogisticRegression
# Typical ML workflow
# 1. Load with pd.read_csv / read_parquet
# 2. Clean: dropna, fillna, fix dtypes
# 3. Feature engineer: new columns, encoding
# 4. Split: X = df[features], y = df[target]
# 5. Convert: [Link] or X.to_numpy() → pass to sklearn
X = df[['age','salary','score']].to_numpy()
y = df['dept_code'].to_numpy()
pipe = Pipeline([('scaler', StandardScaler()),
('clf', LogisticRegression())])
[Link](X_train, y_train)
Section 4 — Matplotlib: Basic to Advanced
4.1 Core Plot Types
import [Link] as plt
import numpy as np
# ■■ Object-oriented API (preferred for complex plots) ■■■■■■■■■■■■
fig, ax = [Link](figsize=(8, 4))
# ■■ Line plot ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
x = [Link](0, 10, 200)
[Link](x, [Link](x), color='royalblue', lw=2,
linestyle='--', label='sin(x)')
[Link](x, [Link](x), color='tomato', lw=2, label='cos(x)')
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.set_title('Trigonometric Functions')
[Link](loc='upper right')
[Link](True, alpha=0.3)
plt.tight_layout()
[Link]('[Link]', dpi=150, bbox_inches='tight')
# ■■ Scatter plot ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
fig, ax = [Link]()
[Link](X[:,0], X[:,1], c=y, cmap='viridis',
s=30, alpha=0.7, edgecolors='none')
[Link]([Link][0], ax=ax, label='Class')
# ■■ Bar chart ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
categories = ['Accuracy','Precision','Recall','F1']
values = [0.92, 0.89, 0.94, 0.91]
colors = ['#2196F3','#4CAF50','#FF9800','#E91E63']
fig, ax = [Link]()
bars = [Link](categories, values, color=colors, width=0.5, edgecolor='white')
ax.bar_label(bars, fmt='%.2f', padding=3) # value on top of bar
ax.set_ylim(0, 1.1)
# ■■ Horizontal bar ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link](categories, values, color=colors)
# ■■ Histogram ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
fig, ax = [Link]()
[Link](data, bins=30, color='steelblue', edgecolor='white',
alpha=0.8, density=True) # density=True for PDF
ax.set_xlabel('Value'); ax.set_ylabel('Density')
4.2 Customization & Subplots
# ■■ Subplots — the Swiss Army knife ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
fig, axes = [Link](2, 3, figsize=(15, 8))
[Link]('Model Analysis Dashboard', fontsize=16, fontweight='bold')
# Access subplots
ax = axes[0, 0] # row 0, col 0
ax = axes[1, 2] # row 1, col 2
# Flatten for easy iteration
for ax in [Link]():
ax.set_visible(False) # hide unused
# Shared axes
fig, (ax1, ax2) = [Link](1, 2, sharey=True)
# gridspec — unequal subplot sizes
import [Link] as gridspec
gs = [Link](2, 3, hspace=0.4, wspace=0.3)
ax_big = fig.add_subplot(gs[:, 0]) # spans both rows
ax_tr = fig.add_subplot(gs[0, 1])
ax_br = fig.add_subplot(gs[1, 1])
# ■■ Styling ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link]('seaborn-v0_8-whitegrid') # built-in theme
[Link]('ggplot')
[Link]({'[Link]': 12, '[Link]': 120})
# ■■ Annotations ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link]('Peak',
xy=(peak_x, peak_y),
xytext=(peak_x+0.5, peak_y+0.1),
arrowprops=dict(arrowstyle='->', color='red'),
fontsize=10, color='red')
[Link](y=threshold, color='red', linestyle='--', label='Threshold')
[Link](x=split_point, color='green', linestyle=':')
ax.fill_between(x, y_lower, y_upper, alpha=0.2, label='CI')
# ■■ Twin axes (two y scales) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
ax2 = [Link]()
[Link](x, loss, 'b-', label='Loss')
[Link](x, accuracy, 'r-', label='Accuracy')
ax.set_ylabel('Loss', color='blue')
ax2.set_ylabel('Accuracy', color='red')
4.3 Advanced ML Visualizations
import [Link] as plt
import numpy as np
# ■■ Confusion Matrix ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
from [Link] import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_true, y_pred)
fig, ax = [Link](figsize=(6,5))
[Link](cm, annot=True, fmt='d', cmap='Blues',
xticklabels=class_names, yticklabels=class_names, ax=ax)
ax.set_xlabel('Predicted'); ax.set_ylabel('True')
ax.set_title('Confusion Matrix')
# ■■ ROC Curve ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
from [Link] import roc_curve, auc
fpr, tpr, _ = roc_curve(y_true, y_prob)
roc_auc = auc(fpr, tpr)
[Link](fpr, tpr, lw=2, label=f'AUC = {roc_auc:.3f}')
[Link]([0,1],[0,1],'k--', label='Random') # chance line
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
# ■■ Loss / Accuracy curves ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
fig, (ax1, ax2) = [Link](1, 2, figsize=(12, 4))
[Link](train_loss, label='Train Loss')
[Link](val_loss, label='Val Loss', linestyle='--')
[Link](train_acc, label='Train Acc')
[Link](val_acc, label='Val Acc', linestyle='--')
for ax in (ax1, ax2):
[Link](); [Link](True, alpha=0.3)
# ■■ Feature importance ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
importances = model.feature_importances_
sorted_idx = [Link](importances)[::-1][:15]
[Link]([feature_names[i] for i in sorted_idx[::-1]],
importances[sorted_idx[::-1]])
ax.set_title('Top 15 Feature Importances')
# ■■ Correlation heatmap ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
corr = [Link]()
mask = [Link](np.ones_like(corr, dtype=bool))
[Link](corr, mask=mask, annot=True, fmt='.2f',
cmap='coolwarm', center=0, ax=ax)
# ■■ Learning curve ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
from sklearn.model_selection import learning_curve
sizes, train_sc, val_sc = learning_curve(
model, X, y, cv=5, train_sizes=[Link](0.1,1,10))
ax.fill_between(sizes, train_sc.mean(1)-train_sc.std(1),
train_sc.mean(1)+train_sc.std(1), alpha=0.1)
[Link](sizes, train_sc.mean(1), 'o-', label='Train')
[Link](sizes, val_sc.mean(1), 'o-', label='Val')
4.4 Saving, Exporting & Best Practices
# ■■ Saving figures ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link]('[Link]', dpi=300, bbox_inches='tight')
[Link]('[Link]', bbox_inches='tight') # vector format
[Link]('[Link]', bbox_inches='tight') # editable in Inkscape
[Link]('[Link]', bbox_inches='tight') # LaTeX / journals
# ■■ In-memory save (for APIs / web) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
from io import BytesIO
buf = BytesIO()
[Link](buf, format='png', dpi=150, bbox_inches='tight')
[Link](0)
img_bytes = [Link]() # send over HTTP
# ■■ Closing figures to free memory ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
[Link](fig) # close specific figure
[Link]('all') # close all figures
# ■■ Best practices ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
# 1. Always use fig, ax = [Link]() over [Link]()
# 2. Set DPI >= 150 for reports, >= 300 for publications
# 3. Use tight_layout() or constrained_layout=True
# 4. Label everything: title, xlabel, ylabel, legend
# 5. Use colorblind-friendly palettes (viridis, plasma, tab10)
# 6. Avoid 3D plots unless truly necessary (they distort perception)
# 7. Close figures after saving in loops to prevent memory leak
# 8. For interactive plots: use plotly or bokeh instead
Section 5 — End-to-End Mini Project
We will walk through a complete ML workflow: loading the Titanic dataset, cleaning it, engineering features,
building a classifier, and visualizing results. Every step uses Python, NumPy, Pandas, and Matplotlib
together.
Step 1 — Data Loading & Inspection
import pandas as pd, numpy as np
import [Link] as plt
from pathlib import Path
# Load (use Titanic or any CSV)
df = pd.read_csv('[Link]')
print([Link]) # (891, 12)
print([Link])
print([Link]().sum())
# Age: 177 missing, Cabin: 687 missing, Embarked: 2 missing
print([Link]())
Step 2 — Data Cleaning
# Drop high-missingness column
[Link]('Cabin', axis=1, inplace=True)
# Fill missing Age with median grouped by class
df['Age'] = [Link]('Pclass')['Age'].transform(
lambda x: [Link]([Link]()))
# Fill missing Embarked with mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
# Drop irrelevant columns
[Link](['PassengerId','Name','Ticket'], axis=1, inplace=True)
print([Link]().sum()) # all zeros now
Step 3 — Feature Engineering
# Family size
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
# Title from Name (if we had kept it)
# df['Title'] = df['Name'].[Link](r' ([A-Za-z]+)\.', expand=False)
# Age bins
df['AgeBin'] = [Link](df['Age'], bins=[0,12,18,35,60,100],
labels=[0,1,2,3,4]).astype(int)
# Fare log transform (reduce skew)
df['LogFare'] = np.log1p(df['Fare'])
# Encode categorical
df['Sex'] = (df['Sex'] == 'male').astype(int)
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
print([Link]())
print([Link]) # (891, 15)
Step 4 — Train a Model
from sklearn.model_selection import train_test_split, cross_val_score
from [Link] import StandardScaler
from [Link] import RandomForestClassifier
from [Link] import accuracy_score, classification_report
X = [Link]('Survived', axis=1).to_numpy(dtype=np.float32)
y = df['Survived'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = [Link](X_test)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
[Link](X_train_sc, y_train)
y_pred = [Link](X_test_sc)
print('Accuracy:', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
# Cross-validation
cv_scores = cross_val_score(clf, X_train_sc, y_train, cv=5)
print(f'CV Accuracy: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}')
Step 5 — Visualisation & Insights
feature_names = [Link]('Survived',axis=1).columns
importances = clf.feature_importances_
sorted_idx = [Link](importances)
fig, axes = [Link](1, 3, figsize=(18, 5))
[Link]('Titanic ML Analysis', fontsize=15, fontweight='bold')
# ■■ Panel 1: Feature importance ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
axes[0].barh([feature_names[i] for i in sorted_idx],
importances[sorted_idx], color='steelblue')
axes[0].set_title('Feature Importances')
# ■■ Panel 2: Survival by class & sex ■■■■■■■■■■■■■■■■■■■■■■■■■■■■
pivot = [Link](['Pclass','Sex'])['Survived'].mean().unstack()
[Link](kind='bar', ax=axes[1], color=['#EF5350','#42A5F5'])
axes[1].set_title('Survival Rate by Class & Sex')
axes[1].set_ylabel('Survival Rate')
axes[1].legend(['Female (0)','Male (1)'])
# ■■ Panel 3: Age distribution by survival ■■■■■■■■■■■■■■■■■■■■■■■
for survived, grp in [Link]('Survived')['Age']:
lbl = 'Survived' if survived else 'Did not survive'
axes[2].hist(grp, bins=25, alpha=0.6, label=lbl, density=True)
axes[2].set_title('Age Distribution by Survival')
axes[2].set_xlabel('Age'); axes[2].legend()
plt.tight_layout()
[Link]('titanic_analysis.png', dpi=150, bbox_inches='tight')
[Link]()
Section 6 — Reference Tables & Interview
Guide
6.1 Time & Space Complexity Cheat Sheet
Structure / Op Access Search Insert Delete Notes
list (end) O(1) O(n) O(1)* O(n) *amortized
list (front) O(1) O(n) O(n) O(n) Shifts all elements
dict O(1)* O(1)* O(1)* O(1)* *average; worst O(n)
set — O(1)* O(1)* O(1)* Hash table
tuple O(1) O(n) — — Immutable
deque O(n) O(n) O(1) O(1) Both ends O(1)
heapq O(1) min O(n) O(log n) O(log n) Priority queue
[Link] O(1) O(n) O(n) O(n) Contiguous memory
[Link] O(1) col O(n) O(n) O(n) Column-oriented
6.2 Common Pitfalls
■ Mutable default argument
■■ Problem: def f(lst=[]): [Link](1) — list is shared across calls
Fix: Use def f(lst=None): if lst is None: lst=[]
■ == None vs is None
■■ Problem: NumPy arrays will raise 'ambiguous truth value' with ==
Fix: Always use 'is None' or 'is not None'
■ Modifying list while iterating
■■ Problem: for x in lst: [Link](x) — skips elements
Fix: Iterate over a copy: for x in [Link]()
■ Integer caching surprise
■■ Problem: a=1000; b=1000; a is b → False (no caching above 256)
Fix: Use == for value comparison, 'is' only for None/True/False
■ Broadcasting shape mismatch
■■ Problem: [Link]([1,2,3]) + [Link]([[1],[2]]) → (3,3) not error!
Fix: Explicitly reshape arrays; always check shapes first
■ pandas chained indexing
■■ Problem: df['col'][mask] = val — may not modify original
Fix: Use [Link][mask, 'col'] = val
■ pandas SettingWithCopy
■■ Problem: df2 = df[[Link]>25]; df2['x'] = 1 — SettingWithCopyWarning
Fix: Use [Link]() or loc-based assignment
■ float precision
■■ Problem: 0.1 + 0.2 == 0.3 → False in Python
Fix: Use [Link](a, b, rel_tol=1e-9) for float comparison
■ Generator exhaustion
■■ Problem: gen = (x for x in range(5)); sum(gen); sum(gen) → 0
Fix: Generators can only be consumed once; recreate or use list()
6.3 Top Interview Questions & Answers
■ Q: What is GIL and how does it affect ML code?
The Global Interpreter Lock (GIL) prevents multiple native threads from executing Python bytecode
simultaneously. For CPU-bound ML code, use multiprocessing (bypasses GIL) or NumPy (releases GIL in
C extensions). I/O-bound code uses asyncio/threading effectively.
■ Q: Difference between deep copy and shallow copy?
Shallow copy ([Link]) copies the object but not nested objects — nested mutable objects are shared.
Deep copy ([Link]) recursively copies everything. For ML: [Link]() is a deep copy of the
DataFrame; a NumPy slice is a VIEW (shallow); .copy() on an array makes an independent copy.
■ Q: When would you use a generator over a list?
When data is too large to fit in memory (e.g., streaming millions of rows), when you only need one pass, or
when you want to compose lazy pipelines. PyTorch DataLoader and [Link] use generators internally.
■ Q: What is broadcasting and why does it matter in ML?
Broadcasting is NumPy's rule for performing element-wise operations on arrays with different but
compatible shapes without copying data. It makes batch operations (normalize features, add bias vectors)
efficient and concise — the backbone of forward/backward pass implementations.
■ Q: Explain pandas .loc vs .iloc vs boolean indexing
.loc uses labels (inclusive end), .iloc uses integer positions (exclusive end, like Python slicing), boolean
indexing filters rows by a True/False mask. For ML: use .loc for feature selection by name, .iloc for
positional splits, boolean for data cleaning filters.
■ Q: What are decorators and where are they used in ML frameworks?
Decorators are functions that wrap other functions to add behaviour (timing, logging, caching, validation). In
ML: @[Link] JIT-compiles Python functions to TensorFlow graph ops; @torch.no_grad() disables
gradient tracking during inference; @property exposes computed attributes like model.num_parameters.
■ Q: How does Python manage memory? What is reference counting?
Python uses reference counting as its primary mechanism — each object tracks how many references point
to it. When the count drops to zero, the object is deallocated. A cyclic garbage collector handles reference
cycles. Large NumPy arrays should be explicitly deleted (del arr) and [Link]() called when memory is
tight.
■ Q: Difference between map(), filter() and list comprehension?
All three transform sequences. List comprehensions are more Pythonic, readable, and slightly faster
because they avoid function call overhead per element. map() and filter() return lazy iterators. For ML use:
[Link] or NumPy ufuncs are preferred over all three for numerical data.
■ Q: What is vectorization and why is it critical in ML?
Vectorization replaces explicit Python loops with C-level array operations (NumPy ufuncs, matrix multiply).
A Python loop over 1M elements may take 100ms; the NumPy equivalent takes ~1ms. In deep learning, the
entire forward pass is a sequence of matrix multiplications — essentially vectorized operations.
■ Q: How would you detect and handle data leakage in a pandas pipeline?
Data leakage occurs when test-set information is used during training (e.g., fitting a scaler on all data). Fix:
always fit preprocessors (StandardScaler, imputers, encoders) only on training data, then transform both.
Use sklearn Pipeline to enforce this. Check for temporal leakage in time series by using time-based splits.
Quick-Reference: Python Ecosystem
Interconnections
Concept Python NumPy Pandas Matplotlib
Data container list, dict ndarray DataFrame/Series —
Iteration for, generators [Link], vectorize iterrows (avoid) —
Functional ops map,filter,reduce ufuncs,where apply,map,transform —
Math math module [Link], [Link] describe(), corr() —
I/O open(), pathlib [Link]/load .npy read_csv/to_csv savefig
Missing data None [Link] NaN, isnull() —
Type system dynamic dtype per array dtype per column —
Memory model ref count + GC contiguous buffer column store —
Performance tip __slots__, deque vectorize, einsum eval(), query() close(fig)
Tensor math
ML integration Base language Data pipeline Result visualization
backbone
Python & AI/ML Comprehensive Study Guide • All sections complete • Good luck! ■