1.
Introduction to Machine Learning (ML)
1. Machine Learning is a subfield of artificial intelligence (AI) that enables systems to
learn from data without being explicitly programmed.
2. It focuses on developing algorithms that improve automatically through experience.
3. ML helps in discovering patterns and making decisions based on data.
4. It is widely used in applications like email filtering, speech recognition, image
processing, and recommendation systems.
5. It uses mathematical models that are trained on data to make predictions or
decisions.
6. Machine learning is data-driven, meaning the quality and quantity of data affect the
performance.
7. It often involves data preprocessing, model training, evaluation, and prediction
phases.
8. ML is different from traditional programming where rules are defined manually.
9. Algorithms improve by minimizing errors between predicted and actual outcomes.
10. It supports automation of tasks that are hard to program manually.
11. ML is broadly categorized into supervised, unsupervised, and reinforcement
learning.
12. Real-life examples: Google search, Netflix recommendations, and fraud detection.
13. ML uses features (input variables) to predict labels or outputs.
14. It can handle both structured (tables) and unstructured (text, images) data.
15. Requires iterative experimentation to improve performance.
2. The Three Types of Machine Learning
a. Supervised Learning
1. Learns from labeled data (input-output pairs).
2. Main goal is to predict outputs for new inputs.
3. Tasks: Classification (predict categories), Regression (predict numbers).
4. Example: Email spam detection (spam or not).
5. Requires training data and test data for evaluation.
b. Unsupervised Learning
6. Learns from unlabeled data (only inputs).
7. Used to find hidden patterns or groupings.
8. Tasks: Clustering, association rules, dimensionality reduction.
9. Example: Customer segmentation in marketing.
c. Reinforcement Learning
10. Agent learns by interacting with an environment.
11. Gets rewards or penalties based on actions.
12. Used in robotics, games (like chess), and self-driving cars.
13. Focuses on learning optimal strategies (policies).
14. Involves exploration (trying new actions) and exploitation (using known actions).
15. The choice of learning type depends on the problem type and data availability.
1
3. Making Predictions with Supervised Learning
1. Uses labeled datasets where inputs and outputs are known.
2. The model learns patterns from training data to predict outputs.
3. Requires a clear relationship between input and output.
4. Model is trained to minimize prediction errors.
5. Common algorithms: Linear Regression, Decision Trees, SVM, Neural Networks.
6. Steps: Data preparation → model training → evaluation → prediction.
7. Can be used for forecasting, e.g., sales, stock prices, weather.
8. Performance is measured with metrics like accuracy, precision, recall, or RMSE.
9. Needs a good balance of bias and variance to avoid overfitting.
10. Works best when sufficient labeled data is available.
11. Errors are often analyzed using confusion matrix (for classification).
12. Cross-validation is used to ensure the model generalizes well.
13. Predictions help in automating decisions and planning.
14. Examples: Loan approval, cancer detection, exam score prediction.
15. Can be binary, multi-class, or multi-label prediction.
4. Classification for Predicting Class Labels
1. Classification is a supervised learning task.
2. Goal is to predict a discrete label (category/class).
3. Example: Classifying animals into dog, cat, or bird.
4. Input features (like weight, size) are mapped to output class.
5. Common algorithms: Logistic Regression, Decision Tree, KNN, SVM, Naive
Bayes.
6. Data must be preprocessed, e.g., label encoding, normalization.
7. Output can be binary (yes/no) or multi-class (multiple categories).
8. Uses metrics like accuracy, precision, recall, F1-score.
9. Handles problems like spam filtering, disease detection, etc.
10. Training involves assigning correct class labels to input data.
11. Prediction is based on decision boundaries learned from data.
12. Confusion matrix helps visualize performance.
13. Overfitting can occur if model learns noise instead of pattern.
14. Cross-validation improves reliability of predictions.
15. Output class is usually represented as a probability distribution.
5. Regression for Predicting Continuous Outcomes
1. Regression is another form of supervised learning.
2. It predicts continuous numerical values.
3. Example: Predicting house prices based on area, location.
4. Common algorithms: Linear Regression, Polynomial Regression, SVR, Decision
Trees.
5. Model fits a function that best maps inputs to real values.
2
6. Evaluated using metrics like Mean Absolute Error, MSE, RMSE, and R² score.
7. Data preprocessing includes scaling and removing outliers.
8. Regression line is fitted to minimize prediction error (cost).
9. Simple Linear Regression uses one input variable.
10. Multiple Linear Regression uses two or more features.
11. Overfitting/underfitting problems must be checked.
12. Visualization often done using scatter plots with prediction line.
13. Used in domains like finance, economics, marketing.
14. Handles trend analysis and forecasting.
15. Predicts values that are not categorized but measured.
6. Solving Interactive Problems with Reinforcement Learning
1. Reinforcement Learning (RL) is about learning by doing and getting feedback.
2. An agent interacts with an environment and learns through trial and error.
3. It uses rewards (positive feedback) and penalties (negative feedback) to improve.
4. The goal is to learn a policy—a strategy that tells the agent what action to take.
5. RL problems are often modeled as a Markov Decision Process (MDP).
6. Key components: Agent, Environment, State, Action, Reward.
7. The agent selects actions to maximize cumulative future reward.
8. Example: A robot learning to walk or a game bot learning to play chess.
9. RL is different from supervised learning because it uses delayed rewards.
10. Two main types: Model-Free (Q-learning, SARSA) and Model-Based RL.
11. Uses exploration vs exploitation: try new actions vs use known best ones.
12. Q-learning is a popular algorithm that uses a Q-table for learning values.
13. Deep RL combines RL with neural networks (e.g., Deep Q-Networks).
14. Real-world applications: robotics, game AI, recommendation engines.
15. Challenges: large state spaces, delayed rewards, and safe exploration.
7. Discovering Hidden Structures with Unsupervised Learning
1. Unsupervised learning works with unlabeled data.
2. The goal is to find patterns, groupings, or structure in the data.
3. Common tasks include clustering, association rule mining, and dimensionality
reduction.
4. It is exploratory in nature – no specific output labels are provided.
5. Algorithms try to find similarities and differences between data points.
6. Often used for data analysis, pattern recognition, and feature extraction.
7. Real-life example: grouping customers by purchasing behavior.
8. Helps in understanding data distribution and anomaly detection.
9. Common algorithms: K-Means, Hierarchical Clustering, DBSCAN, PCA.
10. Requires preprocessing like normalization or scaling.
11. Can reveal natural clusters or structures not known before.
12. Used in bioinformatics, market research, and image compression.
13. Performance is often evaluated using silhouette scores, elbow method, or
visualization.
14. It can be a first step before applying supervised methods.
3
15. Helps in feature engineering by identifying redundant or irrelevant features.
8. Finding Subgroups with Clustering
1. Clustering is a core task in unsupervised learning.
2. It involves grouping similar data points into clusters.
3. Each cluster contains data points that are more similar to each other than to those in
other clusters.
4. Most popular clustering algorithm: K-Means.
5. Other methods include Agglomerative (Hierarchical) and DBSCAN.
6. K-Means uses a centroid-based approach to form K clusters.
7. Hierarchical clustering builds a tree of clusters (dendrogram).
8. DBSCAN groups based on density of data points.
9. Used in customer segmentation, document classification, genomics, etc.
10. Requires setting parameters like number of clusters (K) or density threshold.
11. Visualization is typically done with scatter plots and PCA for dimensionality
reduction.
12. The elbow method helps decide optimal number of clusters.
13. Each algorithm has trade-offs: K-Means is fast but assumes spherical clusters.
14. Noise and outliers can affect results significantly.
15. Clustering can be used as a preprocessing step for other ML tasks.
9. Dimensionality Reduction for Data Compression
1. Dimensionality reduction reduces the number of features in a dataset.
2. Helps to simplify models, remove noise, and speed up training.
3. Common techniques: PCA (Principal Component Analysis), t-SNE, LDA (Linear
Discriminant Analysis).
4. PCA works by finding new axes (principal components) that capture most
variance.
5. Reduces overfitting by removing redundant or correlated features.
6. Important for visualization of high-dimensional data.
7. Allows compression of data while retaining key information.
8. t-SNE is useful for visualizing data in 2D or 3D space.
9. Helps in feature selection for better modeling.
10. Makes it easier to handle data with hundreds or thousands of features.
11. Improves model interpretability and computational efficiency.
12. Often applied before clustering or classification.
13. Reduces curse of dimensionality, which affects distance-based algorithms.
14. Preserves important structure in the data.
15. Can also be used to remove noise or irrelevant features.
10. Basic Terminology and Notations in Machine Learning
1. Model: A mathematical representation that maps inputs to outputs.
4
2. Feature: Individual measurable input property (also called attribute).
3. Label: The output value in supervised learning.
4. Training set: Data used to train the model.
5. Test set: Data used to evaluate the model’s performance.
6. Validation set: Used during training to fine-tune the model.
7. Overfitting: Model performs well on training data but poorly on new data.
8. Underfitting: Model is too simple and fails to capture patterns.
9. Hyperparameters: Settings defined before training (e.g., learning rate).
10. Loss function: Measures error between prediction and actual value.
11. Epoch: One full pass over the training data during training.
12. Batch: Subset of data used in one training iteration.
13. Gradient Descent: Optimization technique to minimize loss function.
14. Accuracy: Ratio of correct predictions to total predictions.
15. Confusion Matrix: Table showing correct and incorrect predictions for
classification.
11. A Roadmap for Building Machine Learning Systems
1. Problem Definition: Clearly state the goal — classification, regression, etc.
2. Data Collection: Gather high-quality, relevant data.
3. Data Preprocessing: Clean data by handling missing values, outliers, etc.
4. Feature Selection/Engineering: Choose and create informative features.
5. Splitting the Dataset: Divide into training, validation, and test sets.
6. Model Selection: Choose an algorithm (e.g., Decision Tree, SVM).
7. Training the Model: Use training data to adjust model parameters.
8. Hyperparameter Tuning: Optimize settings like learning rate, depth, etc.
9. Cross-validation: Validate performance across different splits.
10. Evaluation: Use metrics (accuracy, precision, RMSE) on validation data.
11. Model Interpretation: Understand what features impact predictions.
12. Testing: Evaluate model on unseen test data for final accuracy.
13. Deployment: Integrate the model into a real-world application.
14. Monitoring: Track performance over time and retrain if needed.
15. Maintenance: Update the model as new data becomes available.
12. Preprocessing – Getting Data into Shape
1. Data Cleaning: Handle missing values (mean, median, drop).
2. Normalization/Scaling: Standardize feature ranges using MinMax or Z-score.
3. Encoding Categorical Data: Convert text to numbers using One-Hot or Label
Encoding.
4. Feature Extraction: Derive new features from raw data (e.g., date to weekday).
5. Removing Outliers: Use IQR or Z-score to remove extreme values.
6. Dealing with Imbalanced Data: Use SMOTE or resampling.
7. Text Preprocessing: Tokenize, remove stopwords, stem or lemmatize.
8. Image Preprocessing: Resize, normalize pixel values.
9. Noise Reduction: Smooth or filter unwanted variations in data.
10. Dimensionality Reduction: Apply PCA or t-SNE.
5
11. Handling Time-Series Data: Resample, fill gaps, lag features.
12. Data Transformation: Apply log or Box-Cox for skewed data.
13. Binning: Convert continuous variables into categorical.
14. Train/Test Split: Common ratio is 80:20 or 70:30.
15. Pipelines: Automate preprocessing steps with tools like Scikit-learn's Pipeline.
13. Evaluating Models and Predicting Unseen Data Instances
1. Accuracy: Percentage of correct predictions.
2. Precision: Correct positive predictions out of all predicted positives.
3. Recall: Correct positive predictions out of all actual positives.
4. F1 Score: Harmonic mean of precision and recall.
5. Confusion Matrix: Shows TP, TN, FP, FN.
6. ROC Curve: Graph of TPR vs. FPR.
7. AUC Score: Area under the ROC curve.
8. Cross-Validation: Use K-fold (e.g., 5-fold) for better estimates.
9. Bias-Variance Tradeoff: Balance between underfitting and overfitting.
10. Train-Test Split: Evaluate generalization on unseen data.
11. Learning Curves: Show training vs. validation error.
12. Baseline Models: Compare against simple models (e.g., predict average).
13. Overfitting Detection: High train score, low test score.
14. Hyperparameter Tuning: GridSearchCV, RandomSearchCV.
15. Deployment Testing: A/B testing, real-world data checks.
14. Artificial Neurons – A Brief History
1. Inspired by the structure of biological neurons in the brain.
2. First model: McCulloch-Pitts Neuron (1943).
3. Inputs are multiplied by weights, summed, then passed to an activation function.
4. The output is binary (0 or 1) in early models.
5. The Perceptron was introduced by Frank Rosenblatt in 1958.
6. Early excitement faded due to limitations (e.g., XOR problem).
7. Interest revived with backpropagation in the 1980s.
8. Modern neurons use activation functions like ReLU, Sigmoid, Tanh.
9. Neural networks are layers of neurons (input, hidden, output).
10. Key to deep learning, which uses many hidden layers.
11. Each weight update is based on error minimization.
12. Neurons learn feature importance through weight adjustments.
13. Used in pattern recognition, image processing, and NLP.
14. The bias term adjusts the output independent of input.
15. Artificial neurons are foundational units of neural networks.
15. Implementing a Perceptron Learning Algorithm in Python
1. The Perceptron is the simplest neural network model.
6
2. It classifies linearly separable data.
3. Takes inputs x, multiplies with weights w, sums them: z = w·x + b.
4. Applies an activation function (e.g., step function).
5. Output: 1 if result ≥ threshold, else 0.
6. Weight update rule: w = w + η * (y - ŷ) * x.
7. Python libraries: NumPy for matrix operations.
8. Training involves multiple epochs (passes through data).
9. Convergence occurs if data is linearly separable.
10. Used to understand core ML mechanics.
11. Small code snippet can train on 2D inputs (AND, OR).
12. Ideal for binary classification problems.
13. Decision boundary is a straight line (or hyperplane in higher dimensions).
14. Doesn’t work on non-linear problems like XOR.
15. Basis for more advanced models like multilayer perceptrons.
16. Training a Perceptron Model on the Iris Dataset
1. Iris dataset has 150 samples and 3 classes of flowers.
2. Features: sepal length/width, petal length/width.
3. Often simplified to binary (e.g., Setosa vs. Versicolor).
4. Use scikit-learn to load and preprocess data.
5. Split into training and test sets (e.g., 70:30).
6. Initialize weights and bias as zeros.
7. Apply the perceptron learning rule to update weights.
8. Use epochs (iterations) to refine model.
9. Visualize decision boundaries using matplotlib.
10. Accuracy usually high due to linear separability.
11. Evaluate using metrics like precision and recall.
12. Simple baseline model for understanding data behavior.
13. Petal length and width are most influential features.
14. Can be extended to multiclass using One-vs-All strategy.
15. Good starter example to demonstrate model training flow.
17. Adaptive Linear Neurons (Adaline) and Gradient Descent
1. ADALINE = ADAptive LInear NEuron.
2. Similar to perceptron but uses a continuous output.
3. Uses identity function (linear activation) instead of step.
4. Error is based on squared difference between predicted and actual output.
5. Cost function: Sum of Squared Errors (SSE).
6. Weight update uses gradient descent to minimize SSE.
7. Gradients are partial derivatives of cost w.r.t. weights.
8. Smaller learning rates ensure stable convergence.
9. Larger rates can cause overshooting.
10. Adaline can handle noisy data better than perceptron.
11. Learning continues until the cost function converges.
12. Sensitive to feature scaling, hence normalization is crucial.
7
13. Forms basis of more advanced linear classifiers.
14. Offers better stability in weight updates.
15. Implemented similarly to perceptron but updates after all samples (batch learning).
1. Choosing a Classification Algorithm
1. Classification predicts discrete labels (e.g., spam vs. not spam).
2. Choice of algorithm depends on:
o Dataset size
o Linear or non-linear nature
o Noise and feature scaling
3. For small, clean datasets → Logistic Regression or SVM.
4. For large, complex datasets → Random Forest or Gradient Boosting.
5. For real-time/online learning → Stochastic Gradient Descent (SGD).
6. For non-linear decision boundaries → KNN, SVM with kernels, Neural Nets.
7. Interpretability matters? Use Decision Trees or Logistic Regression.
8. Need probability scores? Choose Logistic Regression or Naive Bayes.
9. High-dimensional data? Prefer SVM or Logistic Regression with regularization.
10. Imbalanced data? Use balanced class weights or sampling methods.
11. Use GridSearchCV to tune hyperparameters across classifiers.
12. Try multiple models and compare using cross-validation scores.
13. Avoid overfitting by choosing simpler models first.
14. Evaluate with precision, recall, F1, ROC AUC.
15. Scikit-learn makes switching algorithms easy via a common API.
2. Training a Perceptron via Scikit-learn
1. A perceptron is a binary linear classifier.
2. In Scikit-learn: from sklearn.linear_model import Perceptron.
3. It uses a step function to make predictions.
4. Weights are updated using stochastic gradient descent.
5. Code:
python
CopyEdit
clf = Perceptron(max_iter=1000, eta0=0.1, random_state=1)
[Link](X_train, y_train)
6. eta0 is the learning rate; max_iter controls training epochs.
7. Works best with linearly separable data.
8. Predict using [Link](X_test).
9. Accuracy: accuracy_score(y_test, y_pred).
10. Visualize decision boundary using matplotlib (2D only).
11. Doesn’t output probabilities (only class labels).
12. You can scale data using StandardScaler.
13. It is sensitive to feature scaling.
14. Use confusion matrix for evaluation.
8
15. Not suitable for non-linear classification.
3. Learning the Weights of the Logistic Cost Function
1. Logistic regression uses a sigmoid function to model output.
2. Output: a probability between 0 and 1.
3. Logistic cost function (binary cross-entropy):
J(w)=−∑[ylog(h)+(1−y)log(1−h)]J(w) = - \sum [y \log(h) + (1 - y) \log(1 -
h)]J(w)=−∑[ylog(h)+(1−y)log(1−h)]
4. h is the hypothesis or prediction (sigmoid output).
5. Weights are learned via gradient descent to minimize cost.
6. If prediction is close to true label, cost is low.
7. If prediction is far off, cost becomes large.
8. Convex nature ensures global minimum.
9. Derivatives are used to adjust weights iteratively.
10. Gradient:
∂J∂w=∑(h−y)x\frac{\partial J}{\partial w} = \sum (h - y)x∂w∂J=∑(h−y)x
11. The cost function penalizes wrong confident predictions more.
12. L2 regularization can be added to control overfitting.
13. Optimizers like liblinear, lbfgs, saga are used in Scikit-learn.
14. Weight updates happen until convergence (or max_iter).
15. Good understanding of this helps in tuning learning rate and tolerance.
4. Training a Logistic Regression Model with Scikit-learn
1. Use from sklearn.linear_model import LogisticRegression.
2. Code:
python
CopyEdit
model = LogisticRegression()
[Link](X_train, y_train)
3. Supports binary and multiclass classification.
4. Automatically applies L2 regularization (can be disabled).
5. Predict class: [Link](X_test).
6. Predict probability: model.predict_proba(X_test).
7. Feature importance: model.coef_.
8. Use penalty='l1' for Lasso regression (sparse model).
9. Use solver='liblinear' for small datasets.
10. Works best with scaled features (use StandardScaler).
11. Tune regularization strength via C parameter.
12. Use multinomial for softmax-based multiclass.
9
13. Evaluate with ROC, AUC, confusion matrix.
14. Good default classifier with fast training.
15. Interpretability makes it a great baseline model.
5. Logistic Regression Intuition and Conditional Probabilities
1. Logistic regression models P(y=1 | x).
2. Uses sigmoid function to squeeze values between 0 and 1.
3. Probability output is:
P(y=1∣x)=11+e−(w⋅x+b)P(y=1|x) = \frac{1}{1 + e^{-(w \cdot x +
b)}}P(y=1∣x)=1+e−(w⋅x+b)1
4. Can be interpreted as odds using log-odds (logit):
log(p1−p)=w⋅x+b\log\left(\frac{p}{1-p}\right) = w \cdot x + blog(1−pp)=w⋅x+b
5. The curve is S-shaped.
6. Threshold (usually 0.5) is applied to classify.
7. Gives confidence in predictions (unlike Perceptron).
8. Based on maximum likelihood estimation.
9. Handles binary classification by default.
10. Can be extended to multiclass using softmax.
11. Feature weights indicate feature importance.
12. Coefficients show direction and strength of influence.
13. Probabilities allow risk-based decision-making.
14. Can be regularized to avoid overfitting.
15. Common in medical, marketing, and risk modeling.
6. Tackling Overfitting via Regularization
1. Overfitting occurs when a model learns noise instead of patterns.
2. It performs well on training data but poorly on test data.
3. Regularization helps reduce overfitting by penalizing large weights.
4. In logistic regression and linear models, two main types:
o L1 Regularization (Lasso) – Encourages sparsity (some weights = 0).
o L2 Regularization (Ridge) – Shrinks all weights towards zero.
5. Regularization is controlled using C parameter in Scikit-learn:
o Smaller C = stronger regularization.
6. Regularization term is added to cost function:
J(w)=Loss+λ∑w2 (L2)J(w) = \text{Loss} + \lambda \sum w^2 \
(\text{L2})J(w)=Loss+λ∑w2 (L2)
7. Helps prevent the model from becoming too complex.
8. Forces model to rely only on important features.
9. Reduces model variance while slightly increasing bias.
10
10. L1 regularization can be used for feature selection.
11. Must normalize or scale features for effective regularization.
12. In Scikit-learn, use:
python
CopyEdit
LogisticRegression(penalty='l2', C=1.0)
13. Can combine both via Elastic Net (penalty='elasticnet').
14. Essential for high-dimensional datasets (many features).
15. Should be tuned via cross-validation or GridSearchCV.
7. Maximum Margin Classification with Support Vector Machines (SVM)
1. SVM aims to find the optimal hyperplane that separates classes.
2. The best hyperplane is the one with the maximum margin.
3. Margin = distance between the hyperplane and nearest data points (support vectors).
4. Support vectors are critical data points on the margin boundary.
5. A large margin reduces generalization error.
6. In 2D, hyperplane is a line; in 3D, it's a plane.
7. Mathematically:
w⋅x+b=0w \cdot x + b = 0w⋅x+b=0
8. Objective: minimize ∥w∥\|w\|∥w∥ while correctly classifying points.
9. SVM uses convex optimization techniques to find weights.
10. Works well with high-dimensional data.
11. In Scikit-learn:
python
CopyEdit
from [Link] import SVC
clf = SVC(kernel='linear')
12. SVC with kernel='linear' builds a linear SVM.
13. Effective even when number of features > samples.
14. Works well for text classification, bioinformatics.
15. Hard margin vs. soft margin: real-world data needs soft margin.
8. Maximum Margin Intuition
1. Larger margin = better generalization.
2. Margins act as buffer zones between classes.
3. Even slight data shifts won't misclassify if margin is large.
4. Support vectors are only points that affect the decision boundary.
5. Increasing margin reduces model complexity.
6. Helps avoid overfitting by ignoring non-critical data points.
7. Focuses on hard-to-classify samples.
11
8. Doesn’t care about points far from the boundary.
9. Geometry-based approach unlike probabilistic models (e.g., Logistic Regression).
10. Trade-off between margin size and classification error is controlled via C.
11. Intuition: a “safe corridor” for separation.
12. Ideal when classes are well-separated.
13. Boosts confidence in classification results.
14. Margin width affects sensitivity to noise.
15. Explains why SVM performs well even with fewer data points.
9. Dealing with the Non-Linearly Separable Case Using Slack Variables
1. Real-world data often isn’t perfectly linearly separable.
2. Slack variables ξi\xi_iξi allow some misclassification.
3. They "relax" the margin constraint:
yi(w⋅xi+b)≥1−ξiy_i(w \cdot x_i + b) \geq 1 - \xi_iyi(w⋅xi+b)≥1−ξi
4. A small slack value = close to margin; large = misclassified.
5. Soft-margin SVM optimizes margin while minimizing slack penalties.
6. Introduces a trade-off parameter C:
o High C → less tolerance for misclassifications.
o Low C → wider margin, more tolerance.
7. Helps handle noisy or overlapping data.
8. Makes SVM more practical for real-world problems.
9. In Scikit-learn:
python
CopyEdit
SVC(C=1.0)
10. You can tune C with GridSearch for best results.
11. The cost function becomes:
12∥w∥2+C∑ξi\frac{1}{2} \|w\|^2 + C \sum \xi_i21∥w∥2+C∑ξi
12. The optimization balances margin maximization and slack minimization.
13. Prevents model from becoming too rigid.
14. Ensures generalization on test data.
15. Makes SVM flexible for nearly all datasets.
10. Solving Non-linear Problems Using a Kernel SVM
1. Kernel SVM maps data to higher dimensions to make it linearly separable.
2. Kernel trick: Compute dot products in higher-dimensional space without actual
transformation.
3. Common kernels:
o RBF (Gaussian): Good for most problems.
12
o Polynomial: Good for curved decision boundaries.
o Sigmoid: Mimics neural nets.
4. RBF kernel formula:
K(x,x′)=exp(−γ∥x−x′∥2)K(x, x') = \exp(-\gamma \|x -
x'\|^2)K(x,x′)=exp(−γ∥x−x′∥2)
5. In Scikit-learn:
python
CopyEdit
clf = SVC(kernel='rbf', gamma=0.1, C=1.0)
6. gamma controls the influence of a training point.
o Low gamma → smooth boundary.
o High gamma → overfitting.
7. C controls regularization (soft margin).
8. Kernel SVM works well for image, handwriting, bioinformatics.
9. Transforms circular, spiral, or XOR-like datasets into separable ones.
10. Great for capturing nonlinear relationships.
11. Can overfit if gamma or C are too high.
12. RBF is default kernel in SVC.
13. Doesn’t work well with very large datasets (slow training).
14. You can visualize decision regions in 2D using meshgrid plots.
15. Essential for solving complex real-world classification problems.
1. Decision Tree Learning
1. A Decision Tree is a flowchart-like model that splits data into branches based on
feature values.
2. It recursively divides data to form a tree structure with decision nodes and leaf
nodes.
3. Internal nodes test features; leaf nodes give predictions.
4. Works for classification and regression tasks.
5. Splits are chosen to maximize purity (i.e., homogeneity) of the resulting subsets.
6. Common splitting criteria:
o Gini Impurity (default in Scikit-learn)
o Entropy (Information Gain)
7. Gini: Measures probability of mislabeling a random sample.
8. Entropy: Measures the amount of uncertainty/disorder in the data.
9. Scikit-learn code:
python
CopyEdit
from [Link] import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy')
[Link](X_train, y_train)
10. Easy to interpret and visualize.
11. Prone to overfitting if the tree is too deep.
13
12. Control complexity using max_depth, min_samples_split.
13. Can be visualized using plot_tree() or export_graphviz().
14. No need for feature scaling.
15. Can handle both numerical and categorical data.
2. Maximizing Information Gain – Getting the Most Bang for the Buck
1. Information Gain (IG) helps pick the best feature for a split.
2. IG = entropy before split − weighted entropy after split.
3. Higher IG means a better feature for splitting.
4. Intuitively: "How much uncertainty does this feature remove?"
5. Used in decision trees for selecting best nodes.
6. Formula for entropy:
H(S)=−∑pilog2piH(S) = -\sum p_i \log_2 p_iH(S)=−∑pilog2pi
7. Scikit-learn allows criterion='entropy' to use info gain.
8. Weighted entropy considers the size of subsets after a split.
9. More balanced splits = higher IG.
10. Greedy algorithm picks best split locally, not globally.
11. Works well when features are informative.
12. Reduces tree depth by focusing on most relevant attributes.
13. Should be combined with pruning or depth limits.
14. Can be visualized by inspecting feature importances.
15. Helps avoid irrelevant splits and overfitting.
3. Building a Decision Tree
1. Start from the root node and pick the best feature using IG or Gini.
2. Split data based on feature value.
3. Repeat recursively for each subset.
4. Stop when:
o Max depth is reached
o All samples belong to one class
o No features left
5. Scikit-learn auto-detects stopping conditions.
6. Hyperparameters to control size:
o max_depth, min_samples_split, min_samples_leaf
7. Binary or multi-way splits possible.
8. Nodes store feature, threshold, and class prediction.
9. Easy to interpret but not always generalizable.
10. Code:
python
CopyEdit
clf = DecisionTreeClassifier(max_depth=3)
[Link](X_train, y_train)
14
11. Use feature_importances_ to interpret feature relevance.
12. Overfitting can be mitigated by pruning or ensemble methods.
13. Can output probabilities using predict_proba().
14. Performs well on small to medium datasets.
15. Works well without preprocessing or feature scaling.
4. Combining Weak to Strong Learners via Random Forests
1. A Random Forest is an ensemble of decision trees.
2. It combines predictions of many independent trees to improve accuracy.
3. Each tree is trained on a random subset of data (bootstrap sampling).
4. Also uses a random subset of features per split.
5. Helps reduce overfitting seen in single decision trees.
6. Prediction is based on majority vote (classification) or average (regression).
7. Scikit-learn code:
python
CopyEdit
from [Link] import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
[Link](X_train, y_train)
8. n_estimators: Number of trees.
9. max_features: Controls randomness; smaller values increase diversity.
10. Very robust to noisy data and outliers.
11. Handles missing values better than single trees.
12. Feature importances can be calculated.
13. Performs well with high-dimensional data.
14. Slower to predict than single tree but more accurate.
15. Excellent baseline model for classification tasks.
5. K-Nearest Neighbors (KNN) – A Lazy Learning Algorithm
1. KNN classifies based on the majority label of the k-nearest points.
2. It is instance-based: it stores all training data.
3. No training phase → “lazy learner”.
4. Prediction time increases with dataset size.
5. Distance metrics used:
o Euclidean (default)
o Manhattan
o Minkowski
6. Code:
python
CopyEdit
from [Link] import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
[Link](X_train, y_train)
15
7. Works best with scaled features.
8. k must be chosen carefully (odd for binary classification).
9. k=1 → very sensitive to noise.
10. k too high → too general.
11. No internal model; just data lookup and voting.
12. Great for non-linear decision boundaries.
13. Simple and effective for small datasets.
14. Can be used for regression too.
15. Memory inefficient on large datasets.
6. Building Good Training Sets – Data Preprocessing
1. Data preprocessing is crucial to prepare raw data for machine learning models.
2. It improves accuracy, speed, and generalization.
3. Common preprocessing steps:
o Handling missing values
o Scaling features
o Encoding categorical variables
o Removing outliers or noise
4. Helps in transforming inconsistent, noisy, or incomplete data.
5. Ensures data is in the right shape and format for ML algorithms.
6. Scikit-learn provides tools like StandardScaler, SimpleImputer, OneHotEncoder.
7. Preprocessing improves model training efficiency and performance.
8. Important for algorithms that are sensitive to scale (e.g., KNN, SVM).
9. Also includes feature extraction, dimensionality reduction, etc.
10. Normalizing data reduces the influence of dominant features.
11. Training and test data must be preprocessed using same parameters.
12. Use Pipeline in Scikit-learn to automate preprocessing and modeling.
13. Enables reproducibility and consistent transformation.
14. Data preprocessing must be part of the ML pipeline, not manual.
15. Without preprocessing, even the best models may underperform.
7. Dealing with Missing Data
1. Missing data is common in real-world datasets.
2. It can occur due to entry errors, sensor failure, skipped responses, etc.
3. Strategies to deal with missing data:
o Removing samples or features
o Imputing values
4. If missing rate is low, remove rows using:
python
CopyEdit
[Link]()
5. Can remove columns with:
16
python
CopyEdit
[Link](columns=['feature'])
6. For imputation:
python
CopyEdit
from [Link] import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)
7. Strategies: mean, median, most_frequent, or constant.
8. Never apply imputation separately on test set.
9. Impute using training data statistics, then apply to test set.
10. Dropping too much data can hurt model performance.
11. Imputation helps retain valuable data instead of discarding.
12. KNN or regression-based imputation = more advanced.
13. Some models like XGBoost can handle missing values directly.
14. Always analyze why data is missing before choosing a strategy.
15. Visual tools like heatmaps ([Link]) help understand missing patterns.
8. Understanding the Scikit-learn Estimator API
1. Scikit-learn uses a unified interface for all models: the Estimator API.
2. Every model is a Python class with .fit(), .predict() methods.
3. Steps:
o Initialize: model = ClassifierName(params)
o Train: [Link](X_train, y_train)
o Predict: [Link](X_test)
4. .fit() learns the model from training data.
5. .predict() returns predictions on unseen data.
6. Many estimators also support:
o .predict_proba() for probabilities
o .score() for model evaluation
7. Pipelines are built using:
python
CopyEdit
from [Link] import Pipeline
8. Preprocessing can be chained with modeling in one estimator.
9. Consistent API allows interchangeable models with same code.
10. Grid search for tuning:
python
CopyEdit
from sklearn.model_selection import GridSearchCV
11. Cross-validation:
17
python
CopyEdit
from sklearn.model_selection import cross_val_score
12. Estimators are stateless before .fit() is called.
13. Once fitted, models store learned parameters (coef_, feature_importances_, etc.).
14. Scikit-learn API encourages clean, modular, and reusable ML code.
15. Works seamlessly with Pandas and NumPy.
9. Handling Categorical Data
1. ML models require numerical input – categorical data must be converted.
2. Two types of categorical variables:
o Ordinal: Have meaningful order (e.g., low < medium < high)
o Nominal: No intrinsic order (e.g., red, green, blue)
3. Scikit-learn can encode both using preprocessing tools.
9.1 Mapping Ordinal Features
1. Ordinal features are mapped to integers based on rank.
2. Manual encoding example:
python
CopyEdit
mapping = {'low': 1, 'medium': 2, 'high': 3}
df['size'] = df['size'].map(mapping)
3. Preserves order of values.
4. Use only when order is logically meaningful.
9.2 Encoding Class Labels
1. Target values must also be numerical.
2. Use LabelEncoder:
python
CopyEdit
from [Link] import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
3. Converts ['yes', 'no'] → [1, 0]
4. Required for classification models.
9.3 Performing One-Hot Encoding on Nominal Features
18
1. Use for non-ordinal categorical variables.
2. Transforms column into multiple binary columns:
o color = ['red', 'green'] → color_red, color_green
3. Use:
python
CopyEdit
from [Link] import OneHotEncoder
ohe = OneHotEncoder()
X = ohe.fit_transform(X)
4. Avoids false assumptions about order.
5. In pandas:
python
CopyEdit
pd.get_dummies(df)
10. Partitioning Dataset into Training and Test Sets
1. Dataset should be split to assess model performance.
2. Training set: used to train the model.
3. Test set: used to evaluate on unseen data.
4. Prevents overfitting and gives an estimate of generalization.
5. Standard split: 80/20 or 70/30.
6. Scikit-learn:
python
CopyEdit
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
7. Use random_state for reproducibility.
8. For classification, use stratify=y to maintain class balance.
9. Validation set can also be added for tuning hyperparameters.
10. Use cross-validation for more robust testing.
11. Bringing Features onto the Same Scale
1. Many models assume features are on similar scale.
2. Features like age (0–100) and income (10,000–1,000,000) can distort results.
3. Common scaling techniques:
o Min-Max Scaling: Rescales to [0, 1]
o Standardization: Mean = 0, Std = 1
4. Standardization:
python
CopyEdit
from [Link] import StandardScaler
scaler = StandardScaler()
19
X_scaled = scaler.fit_transform(X)
5. MinMax:
python
CopyEdit
MinMaxScaler()
6. Required for:
o KNN
o SVM
o PCA
o Gradient Descent
7. Not required for:
o Decision Trees
o Random Forests
8. Must be applied to test set using training parameters.
9. Helps models converge faster and perform better.
10. Prevents bias towards high-magnitude features.
12. Selecting Meaningful Features
1. Feature selection removes irrelevant or redundant features.
2. Reduces overfitting, improves accuracy, and speeds up training.
3. Two main types:
o Filter methods: Based on stats (e.g., correlation)
o Wrapper methods: Based on model performance
12.1 Sparse Solutions with L1 Regularization
1. L1 regularization encourages weights of irrelevant features to become zero.
2. Feature selection happens automatically.
3. Use:
python
CopyEdit
LogisticRegression(penalty='l1', solver='liblinear')
4. Great for high-dimensional data.
5. Useful for text classification, gene selection, etc.
12.2 Sequential Feature Selection Algorithms
1. A wrapper method that selects features based on model performance.
2. Types:
o Forward Selection: Add features one by one
20
o Backward Elimination: Start with all, remove one by one
3. Scikit-learn:
python
CopyEdit
from sklearn.feature_selection import SequentialFeatureSelector
4. Finds feature subset that gives best performance.
5. Can be slow but accurate.
6. Best used when feature count is manageable.
1. Compressing Data via Dimensionality Reduction
1. Dimensionality reduction is the process of reducing the number of input features
(dimensions) while retaining most of the original information.
2. It helps reduce computational cost, training time, and overfitting.
3. High-dimensional datasets often suffer from the curse of dimensionality—models
perform poorly with too many irrelevant features.
4. It improves data visualization (e.g., projecting to 2D/3D).
5. Two major types:
o Unsupervised (e.g., PCA)
o Supervised (e.g., LDA)
6. It transforms data into a new feature space.
7. Reduces redundancy and multicollinearity among features.
8. Can improve model generalization and interpretation.
9. Dimensionality reduction often precedes classification/clustering.
10. Can act as a form of feature selection or transformation.
11. It can uncover hidden patterns in the data.
12. Some techniques are linear (PCA, LDA), others are nonlinear (Kernel PCA).
13. Does not necessarily improve accuracy but improves efficiency.
14. A balance must be struck between compression and information loss.
15. Always apply same transformation to training and test data.
2. Principal Component Analysis (PCA) – Unsupervised
1. PCA is a linear, unsupervised dimensionality reduction technique.
2. It finds new axes (principal components) along which the variance in the data is
maximized.
3. It rotates the data onto a new coordinate system.
4. The first component captures the most variance, the second the next most, etc.
5. PCA does not use labels (unsupervised).
6. PCA uses eigenvalues and eigenvectors from the covariance matrix of the dataset.
7. Principal components are orthogonal and uncorrelated.
8. PCA ranks components by their explained variance.
9. It helps with data compression while preserving trends and patterns.
10. Used in image compression, exploratory analysis, and visualization.
21
PCA in Scikit-learn:
python
CopyEdit
from [Link] import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
11. explained_variance_ratio_ shows how much variance each component explains.
12. PCA is sensitive to feature scaling, so standardize before applying:
python
CopyEdit
from [Link] import StandardScaler
X_std = StandardScaler().fit_transform(X)
13. PCA reduces noise by eliminating lower-variance components.
14. It is suitable for Gaussian, continuous, linearly separable data.
15. Visualization is clearer in 2D PCA projections.
3. Total and Explained Variance in PCA
1. Total variance is the sum of variances across all features.
2. PCA redistributes this variance among fewer components.
3. Explained variance shows how much of the original data is captured by each
component.
4. Used to decide the number of components to retain.
5. Plotting cumulative explained variance helps in thresholding:
python
CopyEdit
import [Link] as plt
[Link]([Link](pca.explained_variance_ratio_))
6. A good threshold is 95%, meaning 95% of the information is retained.
7. Helps avoid underfitting by discarding too many components.
8. Retain enough components to keep performance while reducing dimensions.
9. Important for balancing compression and information loss.
10. Useful in selecting the ideal number of dimensions for visualization and modeling.
4. Feature Transformation with PCA
1. Original features are projected onto principal components.
2. This results in new features that are linear combinations of the originals.
3. The transformed space is lower-dimensional and more informative.
4. Components can’t be interpreted directly like original features.
5. Inverse transformation is possible using:
python
22
CopyEdit
X_orig = pca.inverse_transform(X_pca)
6. Often used in image reconstruction and noise filtering.
7. Removes correlated and redundant features.
8. The new features help downstream models train faster and generalize better.
9. It also aids visualization when reduced to 2D or 3D.
5. Linear Discriminant Analysis (LDA) – Supervised
1. LDA is a supervised dimensionality reduction technique.
2. Unlike PCA, LDA uses class labels to maximize class separation.
3. It aims to maximize between-class variance and minimize within-class variance.
4. It projects data onto a linear subspace that best separates the classes.
5. Ideal for classification problems.
6. Works best when data is linearly separable.
7. Computes two matrices:
o Within-class scatter matrix SWS_WSW
o Between-class scatter matrix SBS_BSB
8. Finds projection that maximizes:
J(w)=wTSBwwTSWwJ(w) = \frac{w^T S_B w}{w^T S_W w}J(w)=wTSWwwTSB
w
9. Number of LDA components ≤ (number of classes - 1)
LDA in Scikit-learn:
python
CopyEdit
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)
10. Great for high-dimensional classification tasks (e.g., gene data).
11. Enhances model performance by reducing noise.
12. Especially useful when classes are well-separated in fewer dimensions.
13. Visualization of class clusters becomes easier.
14. Regularization can be applied to LDA if data is noisy or ill-conditioned.
15. Scikit-learn also supports quadratic discriminant analysis (QDA) for more complex
boundaries.
6. Kernel Principal Component Analysis (Kernel PCA) – Nonlinear
1. Kernel PCA extends PCA to handle nonlinear data distributions.
2. Uses the kernel trick to project data into a higher-dimensional space without
computing it explicitly.
3. Helps separate data that is not linearly separable in original space.
23
4. Popular kernels:
o Radial Basis Function (RBF)
o Polynomial
o Sigmoid
Kernel PCA in Scikit-learn:
python
CopyEdit
from [Link] import KernelPCA
kpca = KernelPCA(kernel='rbf', gamma=15)
X_kpca = kpca.fit_transform(X)
5. gamma controls the influence of individual data points.
6. Kernel PCA works on patterns like concentric circles, spirals, moons, etc.
7. Ideal for complex feature sets where linear PCA fails.
8. Visualization becomes more meaningful in transformed space.
Example 1: Separating Half-Moon Shapes
python
CopyEdit
from [Link] import make_moons
X, y = make_moons(n_samples=100, noise=0.1)
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)
X_kpca = kpca.fit_transform(X)
• Half-moon shapes become linearly separable in transformed space.
Example 2: Separating Concentric Circles
python
CopyEdit
from [Link] import make_circles
X, y = make_circles(n_samples=100, factor=.3, noise=.05)
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)
X_kpca = kpca.fit_transform(X)
• Circular data transforms into linearly separable classes.
7. Projecting New Data Points with Kernel PCA
1. Once trained, [Link]() can project new unseen data into the reduced
space.
2. However, inverse_transform() is not always supported, unlike linear PCA.
3. You must retain the same fitted model to apply transformation on test data.
4. Helps visualize test data in same latent space as training.
5. Use same kernel and parameters to ensure consistency.
24
6. Transform test set just like PCA/LDA for model prediction.
1. Supervision
• PCA: Unsupervised – does not use class labels.
• LDA: Supervised – uses class labels to maximize class separation.
2. Goal / Objective
• PCA: Find directions (principal components) that maximize variance in data.
• LDA: Find directions (discriminants) that maximize separation between classes
(maximize between-class variance, minimize within-class variance).
3. Use Case
• PCA: Useful for data compression and feature extraction.
• LDA: Useful for improving classification accuracy by enhancing class
separability.
4. Output Dimensions
• PCA: Output dimensions = min(number of features, number of samples)
• LDA: Output dimensions ≤ (number of classes - 1)
5. Feature Transformation
• PCA: Transforms to orthogonal axes capturing maximum variance.
• LDA: Transforms to axes that best separate classes.
6. Basis of Computation
• PCA: Eigen decomposition of the covariance matrix of the input features.
• LDA: Eigen decomposition of the scatter matrices (within-class and between-
class).
25
7. Class Information
• PCA: Ignores class labels (treats all data equally).
• LDA: Uses class labels to compute distances between and within classes.
8. Type of Learning
• PCA: Unsupervised learning technique.
• LDA: Supervised learning technique (assumes labeled data).
9. Interpretability
• PCA: New features are combinations of original features, hard to interpret.
• LDA: New features are combinations that maximize class distinction, somewhat
interpretable in classification context.
10. Noise Handling
• PCA: May retain high variance noise as a component.
• LDA: Focuses on class separation, may ignore noise if it doesn't affect class
boundaries.
11. Sensitivity to Scaling
• Both require feature standardization for accurate results.
12. Ideal Applications
• PCA: Image compression, exploratory analysis, preprocessing before unsupervised
models.
• LDA: Classification tasks like face recognition, document classification, and gene
expression analysis.
13. Output Quality
• PCA: Projects data in a way that retains information but not necessarily useful for
class separation.
26
• LDA: Projects data to enhance class clustering and discrimination.
14. Applicability in Classification
• PCA: Not optimized for classification.
• LDA: Optimized for improving classification performance.
15. Linearity
• Both PCA and LDA are linear techniques, but can be extended using kernels
(e.g., Kernel PCA, Kernel LDA) for nonlinear mappings.
Summary Table:
Aspect PCA LDA
Type Unsupervised Supervised
Goal Maximize variance Maximize class separability
Uses labels No Yes
Output Dimensions Up to n_features Up to n_classes - 1
Focus Data structure Class discrimination
Best for Compression, visualization Classification
1. Streamlining Workflows with Pipelines
1. Pipelines automate the ML workflow by chaining preprocessing and modeling steps.
2. Scikit-learn's Pipeline ensures consistent data transformation during training and
testing.
3. It reduces code duplication, prevents data leakage, and makes models
reproducible.
4. Example:
python
CopyEdit
from [Link] import Pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
[Link](X_train, y_train)
5. Helps in grid search or cross-validation by wrapping everything into one object.
27
6. Ensures test data is transformed only with parameters learned on training data.
7. Transformers (e.g., StandardScaler) are followed by estimators (e.g.,
LogisticRegression).
8. Can be extended to include custom steps using FunctionTransformer.
2. Loading the Breast Cancer Dataset
1. The Breast Cancer Wisconsin dataset is a common dataset for binary classification.
2. It’s built into Scikit-learn via:
python
CopyEdit
from [Link] import load_breast_cancer
data = load_breast_cancer()
X, y = [Link], [Link]
3. Contains 30 features and 569 samples, labeled as benign or malignant.
4. Ideal for evaluating classifiers and feature selection techniques.
5. Real-world biomedical dataset, useful for trying pipelines, grid search, etc.
3. Combining Transformers and Estimators in
Pipelines
1. You can chain preprocessing (e.g., scaling, encoding) with modeling (e.g., SVM).
2. Example:
python
CopyEdit
from [Link] import SVC
pipe = Pipeline([
('scale', StandardScaler()),
('svm', SVC())
])
3. Pipelines allow parameter tuning across all stages using GridSearchCV.
4. Simplifies workflow and keeps training/testing transformations aligned.
4. Using k-Fold Cross-Validation
1. Holdout method splits data once into training and test sets (e.g., 80/20 split).
2. May cause variance depending on how data is split.
3. K-Fold CV divides data into k parts (e.g., 5 folds), trains on k-1 and tests on 1.
4. Repeats this k times to average performance scores.
5. Helps provide a more reliable estimate of model generalization.
28
6. Example:
python
CopyEdit
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5)
7. Variants include StratifiedKFold (preserves class ratio) and Leave-One-Out CV.
8. Reduces overfitting due to consistent validation.
9. Enables fair comparison between different models.
5. Learning Curves – Debugging Bias and Variance
1. Learning curves plot model performance vs. number of training samples.
2. Help detect underfitting (high bias) or overfitting (high variance).
3. Example using plot_learning_curve:
python
CopyEdit
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(estimator,
X, y, cv=5)
4. Large gap between training and test scores → overfitting.
5. Both low → underfitting.
6. Validation Curves – Tuning Hyperparameters
1. Validation curves plot model score vs. a changing hyperparameter.
2. Help identify best hyperparameter values.
3. Example:
python
CopyEdit
from sklearn.model_selection import validation_curve
param_range = [0.001, 0.01, 0.1, 1.0, 10.0]
train_scores, test_scores = validation_curve(SVC(), X, y,
param_name='C', param_range=param_range, cv=3)
4. Sweet spot is the value where test score is highest and stable.
5. Detects overfitting when test score drops at high parameter values.
7. Confusion Matrix
29
1. A matrix that summarizes true positives, false positives, false negatives, and true
negatives.
2. Example:
python
CopyEdit
from [Link] import confusion_matrix
y_pred = [Link](X_test)
cm = confusion_matrix(y_test, y_pred)
3. Helpful in binary and multiclass classification.
4. Allows calculation of precision, recall, accuracy, and F1-score.
5. A perfect classifier would have all predictions on the diagonal of the matrix.
8. Optimizing Precision and Recall
1. Precision = TP / (TP + FP) – how many predicted positives are correct.
2. Recall = TP / (TP + FN) – how many actual positives were found.
3. F1-score = harmonic mean of precision and recall.
4. Choose metric depending on problem:
o Precision important in spam filtering (false positives costly).
o Recall important in medical diagnosis (false negatives costly).
9. ROC Curve and AUC
1. ROC (Receiver Operating Characteristic) curve plots TPR vs. FPR at various
thresholds.
2. Closer the curve to top-left, the better.
3. AUC (Area Under Curve) gives single score (1.0 = perfect, 0.5 = random).
4. Example:
python
CopyEdit
from [Link] import roc_curve, roc_auc_score
fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:,1])
auc_score = roc_auc_score(y_test, y_pred_prob)
5. Used especially when classes are imbalanced.
10. Scoring for Multiclass Classification
1. Accuracy is insufficient for imbalanced or multiclass problems.
2. Use macro/micro-averaged:
o Precision
o Recall
30
o F1-score
3. Scikit-learn supports scoring with:
python
CopyEdit
from [Link] import classification_report
print(classification_report(y_test, y_pred))
4. Micro-average aggregates contributions of all classes.
5. Macro-average treats all classes equally.
31