What is Machine Learning?
Enables computers to learn without programming
Improves with experience (data)
Uses algorithms to recognize patterns
Learns like humans with feedback
Key to AI and Data Science
Powers predictive systems
Applications in Data Science
Regression: Predicts continuous values
Classification: Categorizes data
NLP: Finds names in text
Image/Voice recognition
Customer segmentation
Predictive maintenance
Root Cause Analysis in ML
Focuses on interpretation
Identifies causes, not just predictions
Business process optimization
Healthcare insights (e.g., diabetes)
Traffic jam analysis
Supports strategic decisions
Clustering in Machine
Learning
Groups similar data (unsupervised)
Used in market segmentation
No need for labeled data
Reveals natural patterns
Aids data cleaning
Common in exploratory analysis
ML in the Data Science
Process
Supports all phases
Guides problem framing
Automates data prep
Detects data patterns
Powers model training
Enables realtime insights
ML in Setting Goals & Data
Retrieval
Leverages past models
Refines problem framing
NLP for unstructured data
Extracts info from PDFs
Identifies key data sources
Automates extraction
ML in Data Preparation
Cleans and structures data
Clustering fixes typos
Groups similar entries
Enhances quality
Assists transformation
Streamlines preprocessing
ML in Exploration & Modeling
Detects patterns
Reduces dimensionality (PCA)
Enables automated EDA
Feature selection
Trains predictive models
Compares model accuracy
Presentation & Automation
Builds dashboards
Autogenerates reports
Enables API deployment
Supports decisionmaking
Repeats tasks at scale
Delivers realtime results
Python ML Ecosystem
Libraries cover full ML lifecycle
Pandas, NumPy: Inmemory data
Scikitlearn: Core ML toolkit
TensorFlow: Deep learning
PyCUDA/Numba: GPU acceleration
PySpark: Big data ML
Data in Memory Libraries
Pandas: Data manipulation
NumPy: Numerical arrays
Matplotlib: Visuals
SciPy: Scientific computing
StatsModels: Regression & stats
SymPy: Symbolic math
Optimizing Operations Tools
Numba: JIT compilation
PyCUDA: GPU acceleration
Cython: Fast Python/C hybrid
PP: Parallel Python
Blaze: Outofcore operations
PySpark: Big data interface
Python Tools Used in Machine
Learning
Scikitlearn Overview
Userfriendly ML library
Built on NumPy, SciPy, matplotlib
Supports classification, regression
Feature selection & model eval
API consistency
Ideal for beginners
Scikitlearn Use Cases
SVM, Decision Trees, kNN
Linear/Ridge regression
Kmeans, DBSCAN clustering
Text classification
Sales prediction
Customer segmentation
TensorFlow & PyTorch
TensorFlow: Google’s deep learning lib
PyTorch: Facebook’s dynamic DL tool
GPU acceleration
TensorBoard support
Used in NLP, vision
Suitable for research & prod
Keras, StatsModels, NLTK
Keras: Wrapper for TensorFlow
StatsModels: Econometrics
NLTK: Text mining toolkit
Easy neural network prototyping
Rich statistical summaries
Supports POS tagging
LightGBM & XGBoost
Gradient boosting tools
Optimized for structured data
Fast and accurate
Used in Kaggle competitions
Great for large datasets
Support parallel training
Supporting Tools: Pandas &
NumPy
Pandas: DataFrames, manipulation
NumPy: Array ops, linear algebra
Foundation for ML libraries
Fast, efficient structures
Crucial in data prep
Enable feature engineering
Feature Engineering
Identifies predictors
Extracts and transforms features
Creates interaction variables
Uses modeling for new features
Avoids availability bias
Enhances predictive power
Model Training
Learns patterns from data
Optimizes model parameters
Requires labeled data
Uses Python libraries
Evaluated with metrics
Trained to generalize
Model Validation & Selection
Measures prediction accuracy
Class error rate, MSE
Train/Test splits
Crossvalidation (KFold)
Regularization (L1/L2)
Prevents overfitting
Predicting New Observations
Applies trained model to new data
Requires similar preparation
Produces scores or labels
Supports automation
Enables realtime inference
Core to production ML
Types of ML: Supervised
Labeled training data
Regression: Continuous outputs
Classification: Categorical outputs
Trains on inputoutput pairs
Measures performance
Common in realworld ML
Supervised Algorithms
Linear & Logistic Regression
Decision Trees
Random Forests
SVM
KNN
Neural Networks
Types of ML: Unsupervised
No labeled data
Finds hidden structures
Clustering (Kmeans, DBSCAN)
Dimensionality Reduction (PCA)
Reveals groupings
Used for exploration
SemiSupervised Learning
Combines labeled/unlabeled data
Uses label propagation
Reduces labeling costs
Active learning helps
Used in NLP, vision
Improves model accuracy
Case Study: Digit Recognition
Uses MNIST dataset
Naïve Bayes classifier
Data flattened from 2D → 1D
Trains/test split
Uses confusion matrix
Iterative learning improves accuracy
Digit Classifier – Steps
Load data using scikitlearn
Display digits visually
Flatten and label data
Train Naïve Bayes model
Predict and evaluate
Visualize results
PCA & Wine Quality
Dataset: Red wine attributes
Apply PCA to reduce features
Capture latent variables
Explain variance using components
Fewer features, better model
5 components give 77% info
Latent Structure Analysis
PCA finds hidden patterns
Variables: acidity, sulfides, etc.
Reduced complexity
Interpret latent dimensions
Improves model accuracy
Visualized via scree plots
Large Data Challenges
Memory overload
Slow I/O and CPU delays
Processing bottlenecks
Unscalable algorithms
Inefficient storage
Requires new strategies
Techniques for Large Data
Online learning
Block algorithms
Streaming models
Sparse data formats
Parallelization
Use of GPUs and clusters
Online Algorithms
One observation at a time
Memoryefficient
Good for streaming
Avoids storing full dataset
Learns incrementally
Example: Perceptron
Minibatch vs Online
Full batch: All data
Minibatch: Small batches
Online: One at a time
Streaming: Once, no revisit
Ideal for Twitter, logs
Adaptive to changing data
Block & MapReduce
Block: Process in chunks
MapReduce: Split + aggregate
Parallel processing
Use Dask, bcolz for blocks
Hadoop/Disco for MapReduce
Good for logs, images
Efficient Data Structures
Sparse: Saves memory on 0s
Trees: Hierarchical search
Hash tables: Fast retrieval
Fit for NLP, search, clustering
Used in databases
Core for large data
Specialized Tools
Cython: Speedup Python
Numexpr: Fast math expressions
Theano: Deep learning with GPU
Dask: Parallel computing
Blaze: SQL for Python
Bcolz: Compressed arrays
Programming Tips
Use existing libraries
Optimize for hardware
Profile before optimizing
Compile or use JIT
Avoid loading all data
Use generators, subsets
Case Study: Malicious URLs
Detects unsafe websites
Huge sparse dataset (3M+ features)
Uses SGDClassifier
Applies online learning
Evaluated with precision/recall
Shows realworld ML scaling
Case Study: Recommender
System
Movie recommendation via MySQL
LSH + Hamming Distance
Groups similar users
Uses compressed bit strings
Fast lookup with indexing
Built inside a database