0% found this document useful (0 votes)
26 views54 pages

AI ML Python Study Guide

The document is a comprehensive study guide for Python and AI/ML, covering topics from core Python concepts to advanced libraries like NumPy, Pandas, and Matplotlib. It includes sections on data structures, object-oriented programming, and practical applications in machine learning, along with interview preparation materials. The guide is suitable for beginners to advanced learners and emphasizes real-world use cases in AI/ML.

Uploaded by

mambiswas769
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views54 pages

AI ML Python Study Guide

The document is a comprehensive study guide for Python and AI/ML, covering topics from core Python concepts to advanced libraries like NumPy, Pandas, and Matplotlib. It includes sections on data structures, object-oriented programming, and practical applications in machine learning, along with interview preparation materials. The guide is suitable for beginners to advanced learners and emphasizes real-world use cases in AI/ML.

Uploaded by

mambiswas769
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Python & AI/ML

Comprehensive Study Guide

NumPy · Pandas · Matplotlib


From Basics to Advanced — AI/ML Focused

Covers: Core Python · NumPy · Pandas · Matplotlib


Data Structures · OOP · Generators · Decorators
Vectorization · Feature Engineering · Visualization

2024 Edition | Beginner → Advanced | Interview-Ready


Table of Contents
1. Python — Core to Advanced
1.1 Basics: Syntax, Variables, Data Types, Operators
1.2 Control Flow: Loops & Conditionals
1.3 Functions (all variants incl. Decorators)
1.4 Data Structures (List, Tuple, Set, Dict)
1.5 Object-Oriented Programming
1.6 File & Exception Handling
1.7 Modules, Packages, Virtual Environments
1.8 Iterators & Generators
1.9 Functional Programming
1.10 Memory Management & Performance
1.11 Python in the AI/ML Ecosystem

2. NumPy — Basic to Advanced


2.1 ndarray Creation, Indexing, Slicing
2.2 Broadcasting & Vectorization
2.3 Mathematical & Linear Algebra Operations
2.4 Random Module
2.5 Advanced Indexing & Masking
2.6 Memory Efficiency & Performance
2.7 Role of NumPy in ML

3. Pandas — Basic to Advanced


3.1 Series & DataFrame
3.2 Data Loading (CSV, Excel, JSON)
3.3 Data Cleaning & Missing Values
3.4 Filtering, Selection, GroupBy
3.5 Merge, Join, Concat
3.6 Time Series & Feature Engineering
3.7 Performance Tips & ML Pipelines

4. Matplotlib — Basic to Advanced


4.1 Line, Bar, Scatter, Histogram
4.2 Customization & Subplots
4.3 Advanced Visualizations
4.4 Saving/Exporting & ML Use Cases

5. End-to-End Mini Project


5.1 Data → Preprocessing → Visualization → Insights

6. Reference Tables & Interview Guide


6.1 Complexity Cheat Sheet
6.2 Common Pitfalls
6.3 Top Interview Questions & Answers
Section 1 — Python: Core to Advanced

1.1 Basics: Syntax, Variables, Data Types, Operators


Python is an interpreted, high-level, dynamically-typed language. Its clean syntax makes it the dominant
language in data science and AI/ML.

Variables & Dynamic Typing


Python uses duck typing — the type is determined at runtime. Variable names are case-sensitive and should
follow snake_case convention.

# Variable assignment — no type declaration needed

x = 10 # int

name = 'Alice' # str

pi = 3.14159 # float

flag = True # bool

data = None # NoneType

# Multiple assignment

a, b, c = 1, 2, 3

x = y = z = 0

# Type checking

print(type(x)) # <class 'int'>

print(isinstance(x, int)) # True

# Type conversion

s = str(42) # '42'

n = int('100') # 100

f = float('3.14') # 3.14

Core Data Types


Type Example Mutable? AI/ML Use

int x = 42 No Index, count, label

float x = 3.14 No Weights, probabilities

complex x = 2+3j No Signal processing, FFT


str x = 'hello' No Text, feature names

bool x = True No Masks, flags

list x = [1,2,3] Yes Dataset samples

tuple x = (1,2) No Fixed configs, coords

dict x = {'a':1} Yes Hyperparameters, configs

set x = {1,2,3} Yes Unique classes, vocab

NoneType x = None — Missing values

Operators
# Arithmetic

5 + 3, 5 - 3, 5 * 3, 5 / 3 # 8, 2, 15, 1.666...

5 // 3 # 1 (floor division — common in ML indexing)

5 % 3 # 2 (modulo)

2 ** 10 # 1024 (power — used in learning rate schedules)

# Comparison → returns bool

x == y, x != y, x > y, x >= y

# Logical

True and False # False

True or False # True

not True # False

# Bitwise (used in masks, GPU kernels)

0b1010 & 0b1100 # 0b1000 = 8

0b1010 | 0b1100 # 0b1110 = 14

1 << 3 # 8 (left shift)

# Identity & membership

x is None # identity check (prefer over == None)

3 in [1, 2, 3] # True

5 not in [1,2,3] # True


■■ In AI/ML, use 'is None' (identity) rather than '== None' to check for missing values. The == operator
can be overloaded by NumPy/pandas arrays and raise ambiguous truth-value errors.

1.2 Control Flow: Loops & Conditionals


# if / elif / else

score = 0.87

if score >= 0.90:

print('Excellent')

elif score >= 0.75:

print('Good') # ← this runs

else:

print('Needs improvement')

# Ternary expression

label = 'positive' if score > 0.5 else 'negative'

# for loop with range

for epoch in range(1, 6):

print(f'Epoch {epoch}')

# Enumerate — index + value (very common in ML training loops)

dataset = [0.1, 0.4, 0.7]

for idx, val in enumerate(dataset):

print(idx, val)

# zip — iterate multiple iterables in parallel

preds = [0, 1, 1]

labels= [0, 1, 0]

for p, l in zip(preds, labels):

print('pred:', p, 'label:', l)

# while loop

loss = 1.0
while loss > 0.01:

loss *= 0.9

# List comprehension (Pythonic, faster than map+lambda for simple ops)

squares = [x**2 for x in range(10)]

evens = [x for x in range(20) if x % 2 == 0]

# Dict comprehension

word_len = {w: len(w) for w in ['hello', 'world', 'ai']}

# break / continue / pass

for x in range(100):

if x == 5: break # early exit

if x % 2: continue # skip odd

pass # no-op placeholder

■ List comprehensions are ~30% faster than equivalent for-loops with append() in CPython because they
avoid repeated LOAD_ATTR lookups on the list object.

1.3 Functions

Definition, Scope, Default & Keyword Arguments


# Basic function

def greet(name, greeting='Hello'): # 'greeting' has a default

return f'{greeting}, {name}!'

greet('Alice') # Hello, Alice!

greet('Bob', 'Hi') # Hi, Bob!

greet(greeting='Hey', name='Carol') # keyword args — order-free

# *args — variable positional arguments (tuple inside)

def sum_all(*args):

return sum(args)

sum_all(1, 2, 3, 4) # 10

# **kwargs — variable keyword arguments (dict inside)


def build_model(**kwargs):

for k, v in [Link]():

print(f'{k}: {v}')

build_model(layers=3, lr=0.001, dropout=0.2)

# Combined signature — order matters: pos, *args, kw-only, **kwargs

def train(data, *callbacks, epochs=10, **hyperparams):

pass

Lambda Functions
# Lambda = anonymous single-expression function

square = lambda x: x ** 2

square(5) # 25

# Very useful with sorted(), map(), filter()

students = [('Alice', 88), ('Bob', 95), ('Carol', 72)]

sorted_students = sorted(students, key=lambda s: s[1], reverse=True)

# [('Bob', 95), ('Alice', 88), ('Carol', 72)]

# In pandas: apply a lambda to a column

# df['score_norm'] = df['score'].apply(lambda x: (x - [Link]()) / [Link]())

Recursion
# Factorial

def factorial(n):

if n <= 1: return 1 # base case

return n * factorial(n - 1) # recursive case

# Tree traversal (common in decision trees)

def dfs(node):

if node is None: return

print([Link])

dfs([Link])

dfs([Link])
# Python default recursion limit = 1000

import sys

[Link](10000) # increase if needed

Decorators — With Real-World Use Cases


A decorator is a function that wraps another function to extend its behaviour without modifying it. Heavily
used in ML frameworks (Flask routes, TensorFlow, PyTorch).

import time, functools

# ■■ Basic decorator ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

def timer(func):

@[Link](func) # preserves __name__, __doc__

def wrapper(*args, **kwargs):

start = time.perf_counter()

result = func(*args, **kwargs)

end = time.perf_counter()

print(f'{func.__name__} ran in {end-start:.4f}s')

return result

return wrapper

@timer

def train_model(epochs):

[Link](0.1)

train_model(10) # train_model ran in 0.1002s

# ■■ Logger decorator ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

def logger(func):

@[Link](func)

def wrapper(*args, **kwargs):

print(f'Calling {func.__name__} with {args}, {kwargs}')

return func(*args, **kwargs)

return wrapper

# ■■ Decorator with arguments (factory pattern) ■■■


def retry(max_attempts=3):

def decorator(func):

@[Link](func)

def wrapper(*args, **kwargs):

for attempt in range(max_attempts):

try:

return func(*args, **kwargs)

except Exception as e:

if attempt == max_attempts - 1: raise

print(f'Attempt {attempt+1} failed: {e}')

return wrapper

return decorator

@retry(max_attempts=5)

def fetch_data(url): ...

# ■■ Stacking decorators ■■■■■■■■■■■■■■■■■■■■■■■■■■

@timer

@logger

def predict(x): return x * 2 # logger runs first, then timer

# ■■ Class-based decorator ■■■■■■■■■■■■■■■■■■■■■■■■

class Memoize:

def __init__(self, func):

[Link] = func

[Link] = {}

def __call__(self, *args):

if args not in [Link]:

[Link][args] = [Link](*args)

return [Link][args]

@Memoize

def fib(n):
if n < 2: return n

return fib(n-1) + fib(n-2)

■ Real-world ML decorators: @[Link] (JIT compilation in TensorFlow), @torch.no_grad() (disable


grad tracking during inference), @[Link]() (Flask/FastAPI endpoint registration), @property (getters in
model classes).

1.4 Data Structures

List — Ordered, Mutable, Allows Duplicates


lst = [1, 2, 3, 4, 5]

# Indexing & slicing

lst[0] # 1 (first)

lst[-1] # 5 (last)

lst[1:4] # [2,3,4]

lst[::2] # [1,3,5] (step=2)

lst[::-1] # [5,4,3,2,1] (reverse)

# Common methods

[Link](6) # [1,2,3,4,5,6]

[Link]([7,8]) # extend from iterable

[Link](0, 0) # insert at index

[Link](3) # remove first occurrence

[Link]() # remove & return last

[Link](0) # remove & return index 0

[Link]() # in-place sort O(n log n)

sorted(lst) # new sorted list

[Link]() # in-place reverse

[Link](4) # find index

[Link](2) # count occurrences

# List of lists (2D — used for matrices before NumPy)

matrix = [[1,2,3],[4,5,6],[7,8,9]]

matrix[1][2] # 6
# Flatten with comprehension

flat = [x for row in matrix for x in row]

Operation List Complexity Notes

Append [Link](x) O(1) amortized Fast — common in loops

Insert at i [Link](i,x) O(n) Shifts elements

Access by index lst[i] O(1) Direct memory offset

Search x in lst O(n) Use set for O(1) lookup

Delete by index del lst[i] O(n) Shifts elements

Sort [Link]() O(n log n) Tim sort (stable)

Slice copy lst[a:b] O(b-a) Creates new list

Tuple — Ordered, Immutable


t = (1, 2, 3)

t = 1, 2, 3 # parentheses optional

single = (42,) # single-element needs trailing comma!

# Useful as dict keys (lists can't be dict keys)

coords = {(0,0): 'origin', (1,0): 'right'}

# Named tuple — self-documenting, used in dataset pipelines

from collections import namedtuple

Sample = namedtuple('Sample', ['features', 'label'])

s = Sample(features=[0.1, 0.9], label=1)

print([Link]) # 1

# Tuple unpacking

x, y, z = (10, 20, 30)

first, *rest = (1, 2, 3, 4, 5) # rest = [2,3,4,5]

Dictionary — Key-Value, Ordered (Python 3.7+), Mutable


d = {'lr': 0.001, 'batch': 32, 'epochs': 100}

# Access & defaults

d['lr'] # 0.001 — KeyError if missing


[Link]('dropout', 0.0) # 0.0 — safe default

# Modification

d['optimizer'] = 'adam'

[Link]({'lr': 0.0001, 'weight_decay': 1e-5})

# Iteration

for key in d: pass # iterate keys

for val in [Link](): pass

for k, v in [Link](): pass # most common

# Dict comprehension

squared = {x: x**2 for x in range(5)}

# Merge dicts (Python 3.9+)

merged = d1 | d2

# defaultdict — avoids KeyError

from collections import defaultdict

word_count = defaultdict(int)

for word in [Link]():

word_count[word] += 1

# Counter — instant frequency count

from collections import Counter

c = Counter(['cat','dog','cat','bird','cat'])

c.most_common(2) # [('cat',3), ('dog',1)]

# OrderedDict (legacy — dicts are ordered since 3.7)

from collections import OrderedDict

Set — Unordered, Unique Elements


s = {1, 2, 3, 4}

[Link](5) # {1,2,3,4,5}

[Link](6) # no error if missing (vs remove which raises)


# Set operations (useful for NLP / class analysis)

a = {1, 2, 3, 4}

b = {3, 4, 5, 6}

a | b # union {1,2,3,4,5,6}

a & b # intersect {3,4}

a - b # difference {1,2}

a ^ b # symmetric diff {1,2,5,6}

# O(1) membership test — much faster than list for large sets

vocab = set(words) # build once

'hello' in vocab # O(1)

■ Interview: Why is set lookup O(1)? — Sets are implemented as hash tables. The hash of an element
directly maps to a bucket, avoiding linear scan.

1.5 Object-Oriented Programming


# ■■ Class definition ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

class NeuralNetwork:

# Class attribute (shared across all instances)

framework = 'PyTorch'

def __init__(self, layers, lr=0.001): # constructor

[Link] = layers # instance attribute

[Link] = lr

self._weights = None # '_' = private convention

self.__bias = 0.0 # '__' = name-mangled

def __repr__(self): # developer-facing string

return f'NeuralNetwork(layers={[Link]}, lr={[Link]})'

def __str__(self): # user-facing string

return f'{[Link]} model with {len([Link])} layers'

def forward(self, x):


return x # placeholder

@classmethod

def from_config(cls, config: dict):

return cls(config['layers'], config['lr'])

@staticmethod

def relu(x): # no self/cls — utility method

return max(0, x)

@property

def num_params(self):

return sum(l['units'] for l in [Link])

# ■■ Inheritance ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

class CNN(NeuralNetwork): # CNN inherits NeuralNetwork

def __init__(self, layers, lr, kernel_size=3):

super().__init__(layers, lr) # call parent __init__

self.kernel_size = kernel_size

def forward(self, x): # method override (polymorphism)

print(f'CNN forward with kernel {self.kernel_size}')

return super().forward(x)

# ■■ Multiple inheritance ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

class Trainable:

def fit(self, X, y): pass

class Serializable:

def save(self, path): pass

class Model(NeuralNetwork, Trainable, Serializable):

pass # MRO (Method Resolution Order) defines lookup order

print(Model.__mro__) # shows resolution chain


# ■■ Dunder (magic) methods ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

class Dataset:

def __init__(self, data):

[Link] = data

def __len__(self): return len([Link])

def __getitem__(self, i): return [Link][i]

def __iter__(self): return iter([Link])

def __contains__(self, x): return x in [Link]

ds = Dataset([1,2,3,4,5])

len(ds) # 5

ds[2] # 3

for x in ds: pass

3 in ds # True

OOP Pillar Concept Python Mechanism ML Example

Model weights hidden


Encapsulation Hide internal state _attr, __attr, @property
from user

Reuse parent CNNLayer extends


Inheritance class Child(Parent):
behaviour Layer

Same interface, diff Method override / duck [Link]() for


Polymorphism
impl typing CNN/RNN

Abstraction Hide complexity ABC, abstract methods sklearn BaseEstimator

1.6 File Handling & Exception Handling


# ■■ File I/O ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

# Always use context manager (with) — auto-closes file

with open('[Link]', 'r', encoding='utf-8') as f:

content = [Link]() # whole file as string

with open('[Link]', 'r') as f:

lines = [Link]() # list of lines

with open('[Link]', 'w') as f:


[Link]('Hello, AI!\n')

with open('[Link]', 'a') as f: # append mode

[Link]('epoch 10 done\n')

# ■■ pathlib (modern, preferred) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

from pathlib import Path

p = Path('data/models')

[Link](parents=True, exist_ok=True)

for f in [Link]('*.pt'): # glob pattern matching

print([Link], [Link])

# ■■ Exception handling ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

try:

model = load_model('[Link]')

preds = [Link](X)

except FileNotFoundError as e:

print(f'Model not found: {e}')

except ValueError as e:

print(f'Bad input shape: {e}')

except Exception as e: # catch-all (use sparingly)

print(f'Unexpected error: {e}')

raise # re-raise to not swallow

else:

print('Prediction succeeded') # runs if no exception

finally:

print('Cleanup') # always runs

# Custom exceptions

class DataValidationError(ValueError):

def __init__(self, msg, column=None):

super().__init__(msg)

[Link] = column
raise DataValidationError('NaN found', column='age')

1.7 Modules, Packages & Virtual Environments


# Import styles

import numpy # standard — use [Link]()

import numpy as np # aliased — use [Link]()

from numpy import array # direct import

from numpy import * # wildcard — AVOID (pollutes namespace)

# __name__ guard — code only runs when file executed directly

if __name__ == '__main__':

main()

# Package structure

# mypackage/

# ■■■ __init__.py (makes it a package)

# ■■■ [Link]

# ■■■ models/

# ■ ■■■ __init__.py

# ■ ■■■ [Link]

# Virtual environments

# python -m venv venv # create

# source venv/bin/activate # activate (Linux/Mac)

# venv\Scripts\activate # activate (Windows)

# pip install numpy pandas # install packages

# pip freeze > [Link] # save deps

# pip install -r [Link] # restore

# Useful standard library modules for ML

import os, sys, json, csv, re

from datetime import datetime


from collections import defaultdict, Counter, deque

from itertools import chain, product, combinations

import functools, operator

import multiprocessing, [Link]

import logging, warnings

import pickle, joblib # serialization

1.8 Iterators & Generators


# ■■ Iterator protocol ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

class Range:

def __init__(self, n): self.n = n; self.i = 0

def __iter__(self): return self

def __next__(self):

if self.i >= self.n: raise StopIteration

val = self.i; self.i += 1; return val

# ■■ Generator function (yield) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

def data_loader(file_path, batch_size=32):

'''Yields mini-batches — memory efficient for large datasets'''

batch = []

with open(file_path) as f:

for line in f:

[Link]([Link]())

if len(batch) == batch_size:

yield batch

batch = []

if batch: yield batch # last partial batch

for batch in data_loader('[Link]'):

train_step(batch) # never loads full file into memory!

# ■■ Generator expression (lazy list comprehension) ■■■■■■■■■■■■■■■


gen = (x**2 for x in range(1_000_000)) # no memory used yet

next(gen) # compute on demand

# ■■ Send values INTO a generator ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

def accumulator():

total = 0

while True:

val = yield total

if val is None: break

total += val

acc = accumulator()

next(acc) # prime the generator

[Link](10) # 10

[Link](20) # 30

# ■■ itertools (memory-efficient combinatorics) ■■■■■■■■■■■■■■■■■■■■

from itertools import islice, chain, product, cycle

first_5 = list(islice(gen, 5)) # take first 5 from generator

all_data = chain(train, val, test) # chain iterables lazily

■ Generators are fundamental to PyTorch's DataLoader and TensorFlow's [Link] pipeline — they
process datasets larger than RAM by streaming batches on demand.

1.9 Functional Programming


from functools import reduce, partial, lru_cache

nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# map — apply function to every element (lazy, returns iterator)

squares = list(map(lambda x: x**2, nums)) # [1,4,9,16,...]

# filter — keep elements where function returns True

evens = list(filter(lambda x: x % 2 == 0, nums)) # [2,4,6,8,10]


# reduce — fold list into single value

product = reduce(lambda a, b: a * b, nums) # 3628800

# In practice, prefer comprehensions or NumPy over map/filter

squares_comp = [x**2 for x in nums] # more readable

# partial — fix some arguments (creates new function)

def power(base, exp): return base ** exp

square = partial(power, exp=2)

cube = partial(power, exp=3)

square(4) # 16

# lru_cache — memoize expensive function calls

@lru_cache(maxsize=None) # maxsize=None = unbounded

def expensive_feature(param):

# simulate slow feature extraction

return param ** 3 + param ** 2

# [Link] (already covered in decorators)

# zip, enumerate, any, all — built-in functional tools

any(x > 5 for x in nums) # True (short-circuits)

all(x > 0 for x in nums) # True

sum(x**2 for x in nums) # 385 (generator expression)

1.10 Memory Management & Performance


import sys, tracemalloc

# Memory sizes

print([Link]([])) # 56 bytes (empty list)

print([Link](range(1000))) # 48 bytes (range is lazy!)

# ■■ Python memory model ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

# Everything is an object. Integers -5 to 256 are cached (interned).


a = 256; b = 256; print(a is b) # True (same object)

a = 257; b = 257; print(a is b) # False (new objects)

# Reference counting + cyclic GC

import gc

[Link]() # force garbage collection

# ■■ __slots__ — eliminates per-instance __dict__ ■■■■■■■■■■■■■■■■■

class Point: # without __slots__: ~184 bytes per instance

__slots__ = ['x', 'y'] # with __slots__: ~56 bytes

def __init__(self, x, y): self.x = x; self.y = y

# ■■ Profiling ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

# cProfile (function-level)

import cProfile

[Link]('train_loop()')

# timeit — micro-benchmarks

import timeit

t = [Link]('[x**2 for x in range(1000)]', number=1000)

# tracemalloc — memory tracking

[Link]()

# ... run code ...

snap = tracemalloc.take_snapshot()

for stat in [Link]('lineno')[:5]: print(stat)

# ■■ Performance tips ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

# 1. Use NumPy arrays instead of Python lists for numbers

# 2. Use generators for large data (don't load all at once)

# 3. Avoid global variables in hot loops (LOAD_GLOBAL is slow)

# 4. Use local variable aliases in tight loops

# 5. String concatenation: use ''.join(lst) not += in loop

# 6. Use set/dict for O(1) lookups instead of list O(n)


# 7. Use [Link] for O(1) popleft instead of [Link](0)

# 8. Use multiprocessing for CPU-bound, asyncio for I/O-bound tasks

1.11 Python in the AI/ML Ecosystem


Library/Framework Purpose Key Abstractions

NumPy Numerical computing ndarray, broadcasting, ufuncs

Pandas Data manipulation DataFrame, Series, GroupBy

Matplotlib/Seaborn Visualization Figure, Axes, FacetGrid

Scikit-learn Classical ML fit/transform/predict API

PyTorch Deep learning Tensor, Module, autograd

TensorFlow/Keras Deep learning Tensor, Layer, Model

Hugging Face NLP/Transformers Tokenizer, Trainer, Pipeline

OpenCV Computer vision Mat, imread, VideoCapture

NLTK/spaCy NLP preprocessing Token, Doc, Span

XGBoost/LightGBM Gradient boosting DMatrix, Booster

MLflow/W&B; Experiment tracking Run, Artifact, Metric

FastAPI ML model serving APIRouter, Pydantic model


Section 2 — NumPy: Basic to Advanced

2.1 ndarray Creation, Indexing & Slicing


NumPy's ndarray is the fundamental data structure for numerical computing in Python. All major ML
frameworks use NumPy-compatible array formats internally.

import numpy as np

# ■■ Creation ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

a = [Link]([1, 2, 3, 4, 5]) # 1D from list

b = [Link]([[1,2,3],[4,5,6]]) # 2D (shape 2,3)

c = [Link]([1.0, 2, 3], dtype=np.float32) # explicit dtype

[Link]((3,4)) # zeros matrix

[Link]((2,3,4)) # ones tensor (3D)

[Link](5) # identity matrix

[Link]((3,3), 7) # filled with 7

[Link](0, 10, 0.5) # like range but float-capable

[Link](0, 1, 100) # 100 evenly spaced points [0,1]

[Link](0, 3, 10) # 10 log-spaced points [1, 1000]

[Link]((2,2)) # uninitialized (fast allocation)

# ■■ Array attributes ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

a = [Link]([[1,2,3],[4,5,6]])

[Link] # (2, 3)

[Link] # 2

[Link] # int64

[Link] # 6 (total elements)

[Link] # 48 (memory in bytes)

[Link] # 8 (bytes per element)

# ■■ Indexing & Slicing ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

a[0] # first row: [1,2,3]


a[1, 2] # row 1, col 2: 6

a[:, 1] # all rows, col 1: [2,5]

a[0, :] # row 0 (same as a[0])

a[0:2, 1:3] # sub-matrix: [[2,3],[5,6]]

a[::1, ::-1] # reverse columns

# ■■ Reshaping ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link](3, 2) # new shape (same data, view)

[Link](-1, 1) # column vector (rows inferred)

[Link]() # 1D copy

[Link]() # 1D view (faster than flatten)

a.T # transpose (view, not copy)

np.expand_dims(a, axis=0) # add dimension: (1,2,3)

[Link](a) # remove size-1 dimensions

a[[Link], :] # same as expand_dims

2.2 Broadcasting & Vectorization


Broadcasting is NumPy's mechanism for performing element-wise operations on arrays of different shapes
without copying data. It is the key to efficient ML computations.

# ■■ Broadcasting rules ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

# 1. If arrays differ in ndim, prepend 1s to smaller shape

# 2. Arrays with size 1 along a dim are stretched to match

# 3. Shapes must match or one must be 1

a = [Link]([[1],[2],[3]]) # shape (3,1)

b = [Link]([10, 20, 30]) # shape (3,)

# b treated as (1,3) → broadcast to (3,3)

a + b

# [[11,21,31],

# [12,22,32],

# [13,23,33]]

# ■■ Common ML broadcasting patterns ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


# Batch normalization: (batch, features) - (1, features)

X = [Link](1000, 128) # 1000 samples, 128 features

mean = [Link](axis=0) # (128,)

std = [Link](axis=0) # (128,)

X_norm = (X - mean) / std # broadcasts (1000,128) - (128,)

# Bias addition in neural net layers

W = [Link](128, 64) # weight matrix

b = [Link](64) # bias vector (64,)

Z = X @ W + b # (1000,64) + (64,) → broadcast (1000,64)

# ■■ Vectorization vs Python loop ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

# SLOW: pure Python

result = [x**2 + 2*x for x in range(1000000)]

# FAST: NumPy vectorized (typically 50-200x faster)

x = [Link](1000000)

result = x**2 + 2*x

# Universal functions (ufuncs) — applied element-wise in C

[Link](x), [Link](x), [Link](x), [Link](x)

[Link](x), [Link](x), [Link](x)

[Link](x, 0) # ReLU!

[Link](x, 0, 1) # Sigmoid-output clipping

■ [Link](x, 0) IS ReLU activation. [Link](logits, -500, 500) prevents overflow in softmax. Vectorize
everything that runs in a training loop.

2.3 Mathematical & Linear Algebra Operations


# ■■ Aggregate operations ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

a = [Link]([[1,2,3],[4,5,6]])

[Link](a) # 21 (all elements)

[Link](a, axis=0) # [5,7,9] (sum per column)

[Link](a, axis=1) # [6,15] (sum per row)


[Link](a, axis=0)

[Link](a, axis=0)

[Link](a), [Link](a)

[Link](a, axis=1) # index of min in each row (useful for predictions)

[Link](a, axis=1) # predicted class in classification!

[Link](a) # cumulative sum

# ■■ Linear algebra ([Link]) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

A = [Link](4, 4)

[Link](A, A.T) # matrix multiplication (A·A■)

A @ A.T # same, cleaner syntax (Python 3.5+)

[Link](A) # determinant

[Link](A) # inverse

[Link](A) # Frobenius norm

[Link](A, axis=1) # row-wise L2 norm

# Eigendecomposition (PCA, graph algorithms)

eigenvalues, eigenvectors = [Link](A @ A.T)

# SVD (dimensionality reduction, matrix factorization)

U, s, Vt = [Link](A, full_matrices=False)

# Solve linear system Ax = b

b = [Link](4)

x = [Link](A, b)

# QR decomposition

Q, R = [Link](A)

# Least squares (linear regression!)

# min ||Xw - y||^2

X = [Link](100, 4)

y = [Link](100)
w, residuals, rank, sv = [Link](X, y, rcond=None)

2.4 Random Module


rng = [Link].default_rng(seed=42) # recommended modern API

[Link]((3,4)) # uniform [0,1)

[Link](0, 10, (3,4)) # random ints

[Link](0, 1, (100,10)) # Gaussian — weight init

[Link](-1, 1, 1000) # uniform range

[Link]([1,2,3,4,5], size=3, replace=False) # sampling

[Link](arr) # in-place shuffle

[Link](100) # shuffled indices — train/val split

# Legacy API (still widely seen)

[Link](42)

[Link](3, 3) # standard normal

[Link](0, 100, (5,))

# Distributions used in ML

rng.standard_normal((batch, features)) # Xavier init approximation

[Link](n=1, p=0.5, size=1000) # Bernoulli — coin flip

[Link](n=1, pvals=[.2,.3,.5]) # categorical sampling

2.5 Advanced Indexing & Masking


a = [Link](10) # [0,1,2,3,4,5,6,7,8,9]

# ■■ Boolean (mask) indexing ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

mask = a > 5

a[mask] # [6,7,8,9] — returns copy

a[a % 2 == 0] # [0,2,4,6,8]

# Set values using mask

a[a < 0] = 0 # ReLU in one line!


# ■■ Fancy indexing ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

idx = [1, 3, 7]

a[idx] # [1,3,7] — returns copy

# 2D fancy indexing

mat = [Link](25).reshape(5,5)

rows = [0, 2, 4]

cols = [1, 3, 0]

mat[rows, cols] # [1, 13, 20] — element (0,1),(2,3),(4,0)

# [Link] — conditional selection / replacement

x = [Link]([-2, -1, 0, 1, 2])

[Link](x > 0, x, 0) # [0,0,0,1,2] — ReLU again!

[Link](x > 0, 1, -1) # sign function

# [Link] — indices of nonzero elements

[Link](x) # (array([3,4]),) — indices of 1 and 2

# [Link] & [Link]

[Link](a, [2,4,6]) # a[[2,4,6]]

2.6 Memory Efficiency & Performance


# ■■ Views vs Copies ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

a = [Link]([1, 2, 3, 4, 5])

b = a[1:4] # VIEW — shares memory

b[0] = 99 # a is also modified!

c = a[1:4].copy() # COPY — independent

# Check if it's a view

[Link] is a # True

# ■■ dtype selection — critical for memory in DL ■■■■■■■■■■■■■■■■■■

# float64: 8 bytes (default Python float)


# float32: 4 bytes (standard in deep learning)

# float16: 2 bytes (mixed precision training)

# int32: 4 bytes

# int8: 1 byte (quantized models)

X = [Link](1000, 1000).astype(np.float32)

print([Link]) # 4MB vs 8MB for float64

# ■■ Contiguous arrays — for C/Fortran interop ■■■■■■■■■■■■■■■■■■■

[Link]['C_CONTIGUOUS'] # row-major (C order)

[Link]['F_CONTIGUOUS'] # column-major (Fortran order)

[Link](a) # ensure C-contiguous

# ■■ [Link] — readable and efficient tensor ops ■■■■■■■■■■■■■■■■

# Matrix multiplication

[Link]('ij,jk->ik', A, B) # same as A @ B

# Batch matrix multiply

[Link]('bij,bjk->bik', A_batch, B_batch)

# Trace

[Link]('ii->', A)

# Outer product

[Link]('i,j->ij', a, b)

Operation Python List NumPy Array Speedup

Sum of 1M ints ~60ms ~1ms ~60x

Element-wise multiply ~100ms ~2ms ~50x

Memory (1M floats) ~28 MB ~8 MB (float64) 3.5x

Matrix multiply 1000x1000 Manual loops ~5ms (BLAS) 1000x+

Boolean indexing List comp Native (C) ~30x

2.7 NumPy's Role in ML


ML Operation NumPy Implementation Notes

Linear regression forward y_pred = X @ w + b Matrix-vector product


MSE loss [Link]((y_pred - y)**2) Vectorized over batch

Gradient descent step w -= lr * grad Broadcasting update

[Link](z)/[Link](z).sum(axis=1,keepd Numerically stable with max


Softmax
ims=True) subtraction

ReLU activation [Link](0, z) Element-wise ufunc

One-hot encoding [Link](n_classes)[labels] Fancy indexing trick

Train/test split idx = [Link](n) Shuffle then slice

Normalization (X - [Link](0)) / [Link](0) Broadcasting over axis=0

Cosine similarity X @ Y.T / (norm(X)*norm(Y)) Linear algebra

PCA (manual) U,s,Vt = [Link](X_centered) SVD decomposition


Section 3 — Pandas: Basic to Advanced

3.1 Series & DataFrame


import pandas as pd

import numpy as np

# ■■ Series — 1D labelled array ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

s = [Link]([10, 20, 30, 40], index=['a','b','c','d'])

s['b'] # 20 — label-based

s[1] # 20 — position-based

s[['a','c']] # [10,30]

s[s > 15] # filter: [20,30,40]

# ■■ DataFrame — 2D table ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

df = [Link]({

'age': [25, 30, 35, 28],

'salary': [50000, 80000, 120000, 60000],

'dept': ['Eng','Mkt','Eng','HR'],

'score': [0.8, 0.9, 0.95, 0.7]

})

# ■■ Inspection ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link] # (4, 4)

[Link] # column data types

[Link]() # dtype + non-null counts + memory

[Link]() # statistics (count, mean, std, min, 25%, 50%, 75%, max)

[Link](2) # first 2 rows

[Link](2) # last 2 rows

[Link] # Index(['age','salary','dept','score'])

[Link] # RangeIndex(start=0, stop=4)

[Link] # NumPy array underlying


[Link]() # unique count per column

df.value_counts('dept') # frequency of each dept

3.2 Data Loading


# CSV

df = pd.read_csv('[Link]')

df = pd.read_csv('[Link]',

sep=',',

header=0,

index_col='id',

usecols=['age','salary','target'],

dtype={'age': np.int32},

parse_dates=['date'],

na_values=['NA','N/A','?'],

chunksize=10000) # iterator for large files

# Excel

df = pd.read_excel('[Link]', sheet_name='Sheet1')

# JSON

df = pd.read_json('[Link]')

df = pd.read_json('[Link]', orient='records')

# SQL

import sqlalchemy

engine = sqlalchemy.create_engine('sqlite:///[Link]')

df = pd.read_sql('SELECT * FROM users WHERE active=1', engine)

# Parquet (recommended for large datasets — columnar format)

df = pd.read_parquet('[Link]')

df.to_parquet('[Link]', compression='snappy')

# Save
df.to_csv('[Link]', index=False)

df.to_excel('[Link]', index=False)

3.3 Data Cleaning & Missing Values


# ■■ Detect missing values ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link]().sum() # count NaN per column

[Link]().mean() * 100 # % missing per column

df[[Link]().any(axis=1)] # rows with any NaN

# ■■ Drop ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link]() # drop rows with any NaN

[Link](subset=['age','salary']) # only check these cols

[Link](thresh=3) # keep rows with >= 3 non-NaN

[Link](axis=1) # drop columns

# ■■ Fill ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link](0) # fill all NaN with 0

df['age'].fillna(df['age'].mean()) # mean imputation

df['dept'].fillna('Unknown') # categorical fill

[Link](method='ffill') # forward fill

[Link](method='bfill') # backward fill

# ■■ Advanced imputation (ML-based) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

from [Link] import SimpleImputer, KNNImputer

imp = KNNImputer(n_neighbors=5)

df_imputed = [Link](imp.fit_transform(df), columns=[Link])

# ■■ Duplicates ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link]().sum() # count duplicates

df.drop_duplicates(inplace=True) # remove

df.drop_duplicates(subset=['age','dept']) # based on subset

# ■■ Data type issues ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


df['salary'] = pd.to_numeric(df['salary'], errors='coerce') # str→num

df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

df['dept'] = df['dept'].astype('category') # save memory

df['age'] = df['age'].astype(np.int32)

# ■■ Outlier detection ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

Q1 = df['salary'].quantile(0.25)

Q3 = df['salary'].quantile(0.75)

IQR = Q3 - Q1

lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR

outliers = df[(df['salary'] < lower) | (df['salary'] > upper)]

df_clean = df[df['salary'].between(lower, upper)]

# Z-score method

from scipy import stats

z_scores = [Link]([Link](df['salary']))

df_clean = df[z_scores < 3]

3.4 Filtering, Selection & GroupBy


# ■■ Selection ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

df['age'] # single column → Series

df[['age','salary']] # multi-column → DataFrame

# loc — label-based (inclusive on both ends)

[Link][0, 'age'] # row 0, col 'age'

[Link][0:2, ['age','dept']] # rows 0-2 (inclusive), 2 cols

[Link][df['age'] > 28] # boolean filter with loc

# iloc — position-based (exclusive end, like Python slicing)

[Link][0, 1] # row 0, col index 1

[Link][1:3, :] # rows 1,2 all cols

[Link][:, -1] # last column


# ■■ Boolean filtering ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

df[df['age'] > 28]

df[(df['age'] > 25) & (df['salary'] > 60000)] # AND

df[(df['dept'] == 'Eng') | (df['dept'] == 'HR')] # OR

df[~(df['dept'] == 'HR')] # NOT

df[df['dept'].isin(['Eng','Mkt'])]

df[df['dept'].[Link]('E')] # string method

# query() — SQL-like string syntax (readable)

[Link]('age > 25 and salary > 60000')

[Link]('dept in ["Eng", "HR"]')

# ■■ GroupBy ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

# Split → Apply → Combine

grp = [Link]('dept')

grp['salary'].mean() # mean salary per dept

grp['salary'].agg(['mean','std','min','max'])

[Link]() # count per group

# Multiple aggregations (common in feature engineering)

result = [Link]('dept').agg(

avg_salary = ('salary', 'mean'),

max_score = ('score', 'max'),

count = ('age', 'count')

).reset_index()

# transform — returns same shape as input (useful for features)

df['salary_rank_in_dept'] = [Link]('dept')['salary'].rank()

df['dept_mean_salary'] = [Link]('dept')['salary'].transform('mean')

# filter groups

big_depts = [Link]('dept').filter(lambda g: len(g) >= 2)


3.5 Merge, Join & Concat
left = [Link]({'id':[1,2,3], 'name':['A','B','C']})

right = [Link]({'id':[2,3,4], 'score':[90,85,78]})

# ■■ merge (like SQL JOIN) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link](left, right, on='id', how='inner') # 2 rows

[Link](left, right, on='id', how='left') # 3 rows (keep all left)

[Link](left, right, on='id', how='right') # 3 rows (keep all right)

[Link](left, right, on='id', how='outer') # 4 rows (union)

# merge on different column names

[Link](left, right, left_on='id', right_on='emp_id')

# ■■ join (index-based) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

left.set_index('id').join(right.set_index('id'), how='left')

# ■■ concat (stack DataFrames) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link]([df1, df2], axis=0, ignore_index=True) # vertical stack

[Link]([df1, df2], axis=1) # horizontal stack

[Link]([df1, df2], keys=['train','val']) # multi-index

Operation Use Case SQL Equivalent Key Param

inner merge Records in both tables INNER JOIN how='inner'

left merge Keep all left, match right LEFT JOIN how='left'

outer merge Keep all records FULL OUTER JOIN how='outer'

ignore_index=Tr
concat axis=0 Stack new rows UNION ALL
ue

Add columns
concat axis=1 Lateral join axis=1
side-by-side

join Index-based merge JOIN on index on parameter

3.6 Time Series & Feature Engineering


# ■■ Time series ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').sort_index()

# Resampling (change frequency)

[Link]('D').mean() # daily mean

[Link]('W').sum() # weekly sum

[Link]('ME').last() # month-end last value

# Rolling window (moving average, volatility)

df['ma_7'] = df['close'].rolling(7).mean()

df['std_30'] = df['close'].rolling(30).std()

df['ema_12'] = df['close'].ewm(span=12).mean()

# Datetime components

df['year'] = [Link]

df['month'] = [Link]

df['weekday'] = [Link]

df['is_weekend'] = [Link] >= 5

# ■■ Feature Engineering ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

# Binning continuous → categorical

df['age_group'] = [Link](df['age'], bins=[0,25,35,100],

labels=['young','mid','senior'])

# qcut — equal-frequency bins

df['salary_quartile'] = [Link](df['salary'], q=4,

labels=['Q1','Q2','Q3','Q4'])

# Label encoding

df['dept_code'] = df['dept'].astype('category').[Link]

# One-hot encoding

df_ohe = pd.get_dummies(df, columns=['dept'], drop_first=True)

# Lag features (important for time series ML)


df['lag_1'] = df['value'].shift(1)

df['lag_7'] = df['value'].shift(7)

df['diff_1'] = df['value'].diff(1)

# Target encoding (mean of target per category)

means = [Link]('dept')['salary'].transform('mean')

df['dept_salary_mean'] = means

3.7 Performance Tips & Role in ML Pipelines


# ■■ Memory reduction ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

def reduce_memory(df):

for col in [Link]:

if df[col].dtype == 'float64':

df[col] = df[col].astype('float32')

elif df[col].dtype == 'int64':

mx = df[col].max()

if mx < 127: df[col] = df[col].astype('int8')

elif mx < 32767: df[col] = df[col].astype('int16')

elif mx < 2147483647: df[col] = df[col].astype('int32')

elif df[col].dtype == 'object' and df[col].nunique() / len(df) < 0.5:

df[col] = df[col].astype('category')

return df

# ■■ Fast operations ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

# Use vectorized string methods (not apply+lambda)

df['name'].[Link]() # fast

df['name'].[Link]('AI') # fast

# Avoid iterrows — very slow

# BAD: for idx, row in [Link](): ...

# GOOD: df['new_col'] = df['col1'] + df['col2'] (vectorized)

# GOOD: [Link](func, axis=1) (still slow but better than iterrows)


# BEST: Use numpy operations on .values

# eval() and query() use numexpr — faster on large frames

[Link]('new_col = salary * score + age * 100')

# ■■ pandas in ML pipeline ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

from [Link] import Pipeline

from [Link] import StandardScaler

from sklearn.linear_model import LogisticRegression

# Typical ML workflow

# 1. Load with pd.read_csv / read_parquet

# 2. Clean: dropna, fillna, fix dtypes

# 3. Feature engineer: new columns, encoding

# 4. Split: X = df[features], y = df[target]

# 5. Convert: [Link] or X.to_numpy() → pass to sklearn

X = df[['age','salary','score']].to_numpy()

y = df['dept_code'].to_numpy()

pipe = Pipeline([('scaler', StandardScaler()),

('clf', LogisticRegression())])

[Link](X_train, y_train)
Section 4 — Matplotlib: Basic to Advanced

4.1 Core Plot Types


import [Link] as plt

import numpy as np

# ■■ Object-oriented API (preferred for complex plots) ■■■■■■■■■■■■

fig, ax = [Link](figsize=(8, 4))

# ■■ Line plot ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

x = [Link](0, 10, 200)

[Link](x, [Link](x), color='royalblue', lw=2,

linestyle='--', label='sin(x)')

[Link](x, [Link](x), color='tomato', lw=2, label='cos(x)')

ax.set_xlabel('x'); ax.set_ylabel('y')

ax.set_title('Trigonometric Functions')

[Link](loc='upper right')

[Link](True, alpha=0.3)

plt.tight_layout()

[Link]('[Link]', dpi=150, bbox_inches='tight')

# ■■ Scatter plot ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

fig, ax = [Link]()

[Link](X[:,0], X[:,1], c=y, cmap='viridis',

s=30, alpha=0.7, edgecolors='none')

[Link]([Link][0], ax=ax, label='Class')

# ■■ Bar chart ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

categories = ['Accuracy','Precision','Recall','F1']

values = [0.92, 0.89, 0.94, 0.91]

colors = ['#2196F3','#4CAF50','#FF9800','#E91E63']
fig, ax = [Link]()

bars = [Link](categories, values, color=colors, width=0.5, edgecolor='white')

ax.bar_label(bars, fmt='%.2f', padding=3) # value on top of bar

ax.set_ylim(0, 1.1)

# ■■ Horizontal bar ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link](categories, values, color=colors)

# ■■ Histogram ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

fig, ax = [Link]()

[Link](data, bins=30, color='steelblue', edgecolor='white',

alpha=0.8, density=True) # density=True for PDF

ax.set_xlabel('Value'); ax.set_ylabel('Density')

4.2 Customization & Subplots


# ■■ Subplots — the Swiss Army knife ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

fig, axes = [Link](2, 3, figsize=(15, 8))

[Link]('Model Analysis Dashboard', fontsize=16, fontweight='bold')

# Access subplots

ax = axes[0, 0] # row 0, col 0

ax = axes[1, 2] # row 1, col 2

# Flatten for easy iteration

for ax in [Link]():

ax.set_visible(False) # hide unused

# Shared axes

fig, (ax1, ax2) = [Link](1, 2, sharey=True)

# gridspec — unequal subplot sizes

import [Link] as gridspec

gs = [Link](2, 3, hspace=0.4, wspace=0.3)

ax_big = fig.add_subplot(gs[:, 0]) # spans both rows


ax_tr = fig.add_subplot(gs[0, 1])

ax_br = fig.add_subplot(gs[1, 1])

# ■■ Styling ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link]('seaborn-v0_8-whitegrid') # built-in theme

[Link]('ggplot')

[Link]({'[Link]': 12, '[Link]': 120})

# ■■ Annotations ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link]('Peak',

xy=(peak_x, peak_y),

xytext=(peak_x+0.5, peak_y+0.1),

arrowprops=dict(arrowstyle='->', color='red'),

fontsize=10, color='red')

[Link](y=threshold, color='red', linestyle='--', label='Threshold')

[Link](x=split_point, color='green', linestyle=':')

ax.fill_between(x, y_lower, y_upper, alpha=0.2, label='CI')

# ■■ Twin axes (two y scales) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

ax2 = [Link]()

[Link](x, loss, 'b-', label='Loss')

[Link](x, accuracy, 'r-', label='Accuracy')

ax.set_ylabel('Loss', color='blue')

ax2.set_ylabel('Accuracy', color='red')

4.3 Advanced ML Visualizations


import [Link] as plt

import numpy as np

# ■■ Confusion Matrix ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

from [Link] import confusion_matrix

import seaborn as sns


cm = confusion_matrix(y_true, y_pred)

fig, ax = [Link](figsize=(6,5))

[Link](cm, annot=True, fmt='d', cmap='Blues',

xticklabels=class_names, yticklabels=class_names, ax=ax)

ax.set_xlabel('Predicted'); ax.set_ylabel('True')

ax.set_title('Confusion Matrix')

# ■■ ROC Curve ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

from [Link] import roc_curve, auc

fpr, tpr, _ = roc_curve(y_true, y_prob)

roc_auc = auc(fpr, tpr)

[Link](fpr, tpr, lw=2, label=f'AUC = {roc_auc:.3f}')

[Link]([0,1],[0,1],'k--', label='Random') # chance line

ax.set_xlabel('False Positive Rate')

ax.set_ylabel('True Positive Rate')

# ■■ Loss / Accuracy curves ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

fig, (ax1, ax2) = [Link](1, 2, figsize=(12, 4))

[Link](train_loss, label='Train Loss')

[Link](val_loss, label='Val Loss', linestyle='--')

[Link](train_acc, label='Train Acc')

[Link](val_acc, label='Val Acc', linestyle='--')

for ax in (ax1, ax2):

[Link](); [Link](True, alpha=0.3)

# ■■ Feature importance ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

importances = model.feature_importances_

sorted_idx = [Link](importances)[::-1][:15]

[Link]([feature_names[i] for i in sorted_idx[::-1]],

importances[sorted_idx[::-1]])

ax.set_title('Top 15 Feature Importances')

# ■■ Correlation heatmap ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


corr = [Link]()

mask = [Link](np.ones_like(corr, dtype=bool))

[Link](corr, mask=mask, annot=True, fmt='.2f',

cmap='coolwarm', center=0, ax=ax)

# ■■ Learning curve ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

from sklearn.model_selection import learning_curve

sizes, train_sc, val_sc = learning_curve(

model, X, y, cv=5, train_sizes=[Link](0.1,1,10))

ax.fill_between(sizes, train_sc.mean(1)-train_sc.std(1),

train_sc.mean(1)+train_sc.std(1), alpha=0.1)

[Link](sizes, train_sc.mean(1), 'o-', label='Train')

[Link](sizes, val_sc.mean(1), 'o-', label='Val')

4.4 Saving, Exporting & Best Practices


# ■■ Saving figures ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link]('[Link]', dpi=300, bbox_inches='tight')

[Link]('[Link]', bbox_inches='tight') # vector format

[Link]('[Link]', bbox_inches='tight') # editable in Inkscape

[Link]('[Link]', bbox_inches='tight') # LaTeX / journals

# ■■ In-memory save (for APIs / web) ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

from io import BytesIO

buf = BytesIO()

[Link](buf, format='png', dpi=150, bbox_inches='tight')

[Link](0)

img_bytes = [Link]() # send over HTTP

# ■■ Closing figures to free memory ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

[Link](fig) # close specific figure

[Link]('all') # close all figures

# ■■ Best practices ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


# 1. Always use fig, ax = [Link]() over [Link]()

# 2. Set DPI >= 150 for reports, >= 300 for publications

# 3. Use tight_layout() or constrained_layout=True

# 4. Label everything: title, xlabel, ylabel, legend

# 5. Use colorblind-friendly palettes (viridis, plasma, tab10)

# 6. Avoid 3D plots unless truly necessary (they distort perception)

# 7. Close figures after saving in loops to prevent memory leak

# 8. For interactive plots: use plotly or bokeh instead


Section 5 — End-to-End Mini Project
We will walk through a complete ML workflow: loading the Titanic dataset, cleaning it, engineering features,
building a classifier, and visualizing results. Every step uses Python, NumPy, Pandas, and Matplotlib
together.

Step 1 — Data Loading & Inspection


import pandas as pd, numpy as np

import [Link] as plt

from pathlib import Path

# Load (use Titanic or any CSV)

df = pd.read_csv('[Link]')

print([Link]) # (891, 12)

print([Link])

print([Link]().sum())

# Age: 177 missing, Cabin: 687 missing, Embarked: 2 missing

print([Link]())

Step 2 — Data Cleaning


# Drop high-missingness column

[Link]('Cabin', axis=1, inplace=True)

# Fill missing Age with median grouped by class

df['Age'] = [Link]('Pclass')['Age'].transform(

lambda x: [Link]([Link]()))

# Fill missing Embarked with mode

df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Drop irrelevant columns

[Link](['PassengerId','Name','Ticket'], axis=1, inplace=True)

print([Link]().sum()) # all zeros now


Step 3 — Feature Engineering
# Family size

df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

# Title from Name (if we had kept it)

# df['Title'] = df['Name'].[Link](r' ([A-Za-z]+)\.', expand=False)

# Age bins

df['AgeBin'] = [Link](df['Age'], bins=[0,12,18,35,60,100],

labels=[0,1,2,3,4]).astype(int)

# Fare log transform (reduce skew)

df['LogFare'] = np.log1p(df['Fare'])

# Encode categorical

df['Sex'] = (df['Sex'] == 'male').astype(int)

df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

print([Link]())

print([Link]) # (891, 15)

Step 4 — Train a Model


from sklearn.model_selection import train_test_split, cross_val_score

from [Link] import StandardScaler

from [Link] import RandomForestClassifier

from [Link] import accuracy_score, classification_report

X = [Link]('Survived', axis=1).to_numpy(dtype=np.float32)

y = df['Survived'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)

X_test_sc = [Link](X_test)

clf = RandomForestClassifier(n_estimators=100, random_state=42)

[Link](X_train_sc, y_train)

y_pred = [Link](X_test_sc)

print('Accuracy:', accuracy_score(y_test, y_pred))

print(classification_report(y_test, y_pred))

# Cross-validation

cv_scores = cross_val_score(clf, X_train_sc, y_train, cv=5)

print(f'CV Accuracy: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}')

Step 5 — Visualisation & Insights


feature_names = [Link]('Survived',axis=1).columns

importances = clf.feature_importances_

sorted_idx = [Link](importances)

fig, axes = [Link](1, 3, figsize=(18, 5))

[Link]('Titanic ML Analysis', fontsize=15, fontweight='bold')

# ■■ Panel 1: Feature importance ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

axes[0].barh([feature_names[i] for i in sorted_idx],

importances[sorted_idx], color='steelblue')

axes[0].set_title('Feature Importances')

# ■■ Panel 2: Survival by class & sex ■■■■■■■■■■■■■■■■■■■■■■■■■■■■

pivot = [Link](['Pclass','Sex'])['Survived'].mean().unstack()

[Link](kind='bar', ax=axes[1], color=['#EF5350','#42A5F5'])

axes[1].set_title('Survival Rate by Class & Sex')

axes[1].set_ylabel('Survival Rate')

axes[1].legend(['Female (0)','Male (1)'])


# ■■ Panel 3: Age distribution by survival ■■■■■■■■■■■■■■■■■■■■■■■

for survived, grp in [Link]('Survived')['Age']:

lbl = 'Survived' if survived else 'Did not survive'

axes[2].hist(grp, bins=25, alpha=0.6, label=lbl, density=True)

axes[2].set_title('Age Distribution by Survival')

axes[2].set_xlabel('Age'); axes[2].legend()

plt.tight_layout()

[Link]('titanic_analysis.png', dpi=150, bbox_inches='tight')

[Link]()
Section 6 — Reference Tables & Interview
Guide

6.1 Time & Space Complexity Cheat Sheet


Structure / Op Access Search Insert Delete Notes

list (end) O(1) O(n) O(1)* O(n) *amortized

list (front) O(1) O(n) O(n) O(n) Shifts all elements

dict O(1)* O(1)* O(1)* O(1)* *average; worst O(n)

set — O(1)* O(1)* O(1)* Hash table

tuple O(1) O(n) — — Immutable

deque O(n) O(n) O(1) O(1) Both ends O(1)

heapq O(1) min O(n) O(log n) O(log n) Priority queue

[Link] O(1) O(n) O(n) O(n) Contiguous memory

[Link] O(1) col O(n) O(n) O(n) Column-oriented

6.2 Common Pitfalls


■ Mutable default argument

■■ Problem: def f(lst=[]): [Link](1) — list is shared across calls


Fix: Use def f(lst=None): if lst is None: lst=[]

■ == None vs is None

■■ Problem: NumPy arrays will raise 'ambiguous truth value' with ==


Fix: Always use 'is None' or 'is not None'

■ Modifying list while iterating

■■ Problem: for x in lst: [Link](x) — skips elements


Fix: Iterate over a copy: for x in [Link]()

■ Integer caching surprise

■■ Problem: a=1000; b=1000; a is b → False (no caching above 256)


Fix: Use == for value comparison, 'is' only for None/True/False
■ Broadcasting shape mismatch

■■ Problem: [Link]([1,2,3]) + [Link]([[1],[2]]) → (3,3) not error!


Fix: Explicitly reshape arrays; always check shapes first

■ pandas chained indexing

■■ Problem: df['col'][mask] = val — may not modify original


Fix: Use [Link][mask, 'col'] = val

■ pandas SettingWithCopy

■■ Problem: df2 = df[[Link]>25]; df2['x'] = 1 — SettingWithCopyWarning


Fix: Use [Link]() or loc-based assignment

■ float precision

■■ Problem: 0.1 + 0.2 == 0.3 → False in Python


Fix: Use [Link](a, b, rel_tol=1e-9) for float comparison

■ Generator exhaustion

■■ Problem: gen = (x for x in range(5)); sum(gen); sum(gen) → 0


Fix: Generators can only be consumed once; recreate or use list()

6.3 Top Interview Questions & Answers


■ Q: What is GIL and how does it affect ML code?
The Global Interpreter Lock (GIL) prevents multiple native threads from executing Python bytecode
simultaneously. For CPU-bound ML code, use multiprocessing (bypasses GIL) or NumPy (releases GIL in
C extensions). I/O-bound code uses asyncio/threading effectively.

■ Q: Difference between deep copy and shallow copy?


Shallow copy ([Link]) copies the object but not nested objects — nested mutable objects are shared.
Deep copy ([Link]) recursively copies everything. For ML: [Link]() is a deep copy of the
DataFrame; a NumPy slice is a VIEW (shallow); .copy() on an array makes an independent copy.

■ Q: When would you use a generator over a list?


When data is too large to fit in memory (e.g., streaming millions of rows), when you only need one pass, or
when you want to compose lazy pipelines. PyTorch DataLoader and [Link] use generators internally.

■ Q: What is broadcasting and why does it matter in ML?


Broadcasting is NumPy's rule for performing element-wise operations on arrays with different but
compatible shapes without copying data. It makes batch operations (normalize features, add bias vectors)
efficient and concise — the backbone of forward/backward pass implementations.
■ Q: Explain pandas .loc vs .iloc vs boolean indexing
.loc uses labels (inclusive end), .iloc uses integer positions (exclusive end, like Python slicing), boolean
indexing filters rows by a True/False mask. For ML: use .loc for feature selection by name, .iloc for
positional splits, boolean for data cleaning filters.

■ Q: What are decorators and where are they used in ML frameworks?


Decorators are functions that wrap other functions to add behaviour (timing, logging, caching, validation). In
ML: @[Link] JIT-compiles Python functions to TensorFlow graph ops; @torch.no_grad() disables
gradient tracking during inference; @property exposes computed attributes like model.num_parameters.

■ Q: How does Python manage memory? What is reference counting?


Python uses reference counting as its primary mechanism — each object tracks how many references point
to it. When the count drops to zero, the object is deallocated. A cyclic garbage collector handles reference
cycles. Large NumPy arrays should be explicitly deleted (del arr) and [Link]() called when memory is
tight.

■ Q: Difference between map(), filter() and list comprehension?


All three transform sequences. List comprehensions are more Pythonic, readable, and slightly faster
because they avoid function call overhead per element. map() and filter() return lazy iterators. For ML use:
[Link] or NumPy ufuncs are preferred over all three for numerical data.

■ Q: What is vectorization and why is it critical in ML?


Vectorization replaces explicit Python loops with C-level array operations (NumPy ufuncs, matrix multiply).
A Python loop over 1M elements may take 100ms; the NumPy equivalent takes ~1ms. In deep learning, the
entire forward pass is a sequence of matrix multiplications — essentially vectorized operations.

■ Q: How would you detect and handle data leakage in a pandas pipeline?
Data leakage occurs when test-set information is used during training (e.g., fitting a scaler on all data). Fix:
always fit preprocessors (StandardScaler, imputers, encoders) only on training data, then transform both.
Use sklearn Pipeline to enforce this. Check for temporal leakage in time series by using time-based splits.
Quick-Reference: Python Ecosystem
Interconnections
Concept Python NumPy Pandas Matplotlib

Data container list, dict ndarray DataFrame/Series —

Iteration for, generators [Link], vectorize iterrows (avoid) —

Functional ops map,filter,reduce ufuncs,where apply,map,transform —

Math math module [Link], [Link] describe(), corr() —

I/O open(), pathlib [Link]/load .npy read_csv/to_csv savefig

Missing data None [Link] NaN, isnull() —

Type system dynamic dtype per array dtype per column —

Memory model ref count + GC contiguous buffer column store —

Performance tip __slots__, deque vectorize, einsum eval(), query() close(fig)

Tensor math
ML integration Base language Data pipeline Result visualization
backbone

Python & AI/ML Comprehensive Study Guide • All sections complete • Good luck! ■

You might also like