0% found this document useful (0 votes)
10 views41 pages

Machine Learning Applications and Techniques

Data Science notes

Uploaded by

kishore kumar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views41 pages

Machine Learning Applications and Techniques

Data Science notes

Uploaded by

kishore kumar
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

What is Machine Learning?

 Enables computers to learn without programming


 Improves with experience (data)
 Uses algorithms to recognize patterns
 Learns like humans with feedback
 Key to AI and Data Science
 Powers predictive systems
Applications in Data Science

 Regression: Predicts continuous values


 Classification: Categorizes data
 NLP: Finds names in text
 Image/Voice recognition
 Customer segmentation
 Predictive maintenance
Root Cause Analysis in ML

 Focuses on interpretation
 Identifies causes, not just predictions
 Business process optimization
 Healthcare insights (e.g., diabetes)
 Traffic jam analysis
 Supports strategic decisions
Clustering in Machine
Learning

 Groups similar data (unsupervised)


 Used in market segmentation
 No need for labeled data
 Reveals natural patterns
 Aids data cleaning
 Common in exploratory analysis
ML in the Data Science
Process

 Supports all phases


 Guides problem framing
 Automates data prep
 Detects data patterns
 Powers model training
 Enables realtime insights
ML in Setting Goals & Data
Retrieval

 Leverages past models


 Refines problem framing
 NLP for unstructured data
 Extracts info from PDFs
 Identifies key data sources
 Automates extraction
ML in Data Preparation

 Cleans and structures data


 Clustering fixes typos
 Groups similar entries
 Enhances quality
 Assists transformation
 Streamlines preprocessing
ML in Exploration & Modeling

 Detects patterns
 Reduces dimensionality (PCA)
 Enables automated EDA
 Feature selection
 Trains predictive models
 Compares model accuracy
Presentation & Automation

 Builds dashboards
 Autogenerates reports
 Enables API deployment
 Supports decisionmaking
 Repeats tasks at scale
 Delivers realtime results
Python ML Ecosystem

 Libraries cover full ML lifecycle


 Pandas, NumPy: Inmemory data
 Scikitlearn: Core ML toolkit
 TensorFlow: Deep learning
 PyCUDA/Numba: GPU acceleration
 PySpark: Big data ML
Data in Memory Libraries

 Pandas: Data manipulation


 NumPy: Numerical arrays
 Matplotlib: Visuals
 SciPy: Scientific computing
 StatsModels: Regression & stats
 SymPy: Symbolic math
Optimizing Operations Tools

 Numba: JIT compilation


 PyCUDA: GPU acceleration
 Cython: Fast Python/C hybrid
 PP: Parallel Python
 Blaze: Outofcore operations
 PySpark: Big data interface
Python Tools Used in Machine
Learning
Scikitlearn Overview

 Userfriendly ML library
 Built on NumPy, SciPy, matplotlib
 Supports classification, regression
 Feature selection & model eval
 API consistency
 Ideal for beginners
Scikitlearn Use Cases

 SVM, Decision Trees, kNN


 Linear/Ridge regression
 Kmeans, DBSCAN clustering
 Text classification
 Sales prediction
 Customer segmentation
TensorFlow & PyTorch

 TensorFlow: Google’s deep learning lib


 PyTorch: Facebook’s dynamic DL tool
 GPU acceleration
 TensorBoard support
 Used in NLP, vision
 Suitable for research & prod
Keras, StatsModels, NLTK

 Keras: Wrapper for TensorFlow


 StatsModels: Econometrics
 NLTK: Text mining toolkit
 Easy neural network prototyping
 Rich statistical summaries
 Supports POS tagging
LightGBM & XGBoost

 Gradient boosting tools


 Optimized for structured data
 Fast and accurate
 Used in Kaggle competitions
 Great for large datasets
 Support parallel training
Supporting Tools: Pandas &
NumPy

 Pandas: DataFrames, manipulation


 NumPy: Array ops, linear algebra
 Foundation for ML libraries
 Fast, efficient structures
 Crucial in data prep
 Enable feature engineering
Feature Engineering

 Identifies predictors
 Extracts and transforms features
 Creates interaction variables
 Uses modeling for new features
 Avoids availability bias
 Enhances predictive power
Model Training

 Learns patterns from data


 Optimizes model parameters
 Requires labeled data
 Uses Python libraries
 Evaluated with metrics
 Trained to generalize
Model Validation & Selection

 Measures prediction accuracy


 Class error rate, MSE
 Train/Test splits
 Crossvalidation (KFold)
 Regularization (L1/L2)
 Prevents overfitting
Predicting New Observations

 Applies trained model to new data


 Requires similar preparation
 Produces scores or labels
 Supports automation
 Enables realtime inference
 Core to production ML
Types of ML: Supervised

 Labeled training data


 Regression: Continuous outputs
 Classification: Categorical outputs
 Trains on inputoutput pairs
 Measures performance
 Common in realworld ML
Supervised Algorithms

 Linear & Logistic Regression


 Decision Trees
 Random Forests
 SVM
 KNN
 Neural Networks
Types of ML: Unsupervised

 No labeled data
 Finds hidden structures
 Clustering (Kmeans, DBSCAN)
 Dimensionality Reduction (PCA)
 Reveals groupings
 Used for exploration
SemiSupervised Learning

 Combines labeled/unlabeled data


 Uses label propagation
 Reduces labeling costs
 Active learning helps
 Used in NLP, vision
 Improves model accuracy
Case Study: Digit Recognition

 Uses MNIST dataset


 Naïve Bayes classifier
 Data flattened from 2D → 1D
 Trains/test split
 Uses confusion matrix
 Iterative learning improves accuracy
Digit Classifier – Steps

 Load data using scikitlearn


 Display digits visually
 Flatten and label data
 Train Naïve Bayes model
 Predict and evaluate
 Visualize results
PCA & Wine Quality

 Dataset: Red wine attributes


 Apply PCA to reduce features
 Capture latent variables
 Explain variance using components
 Fewer features, better model
 5 components give 77% info
Latent Structure Analysis

 PCA finds hidden patterns


 Variables: acidity, sulfides, etc.
 Reduced complexity
 Interpret latent dimensions
 Improves model accuracy
 Visualized via scree plots
Large Data Challenges

 Memory overload
 Slow I/O and CPU delays
 Processing bottlenecks
 Unscalable algorithms
 Inefficient storage
 Requires new strategies
Techniques for Large Data

 Online learning
 Block algorithms
 Streaming models
 Sparse data formats
 Parallelization
 Use of GPUs and clusters
Online Algorithms

 One observation at a time


 Memoryefficient
 Good for streaming
 Avoids storing full dataset
 Learns incrementally
 Example: Perceptron
Minibatch vs Online

 Full batch: All data


 Minibatch: Small batches
 Online: One at a time
 Streaming: Once, no revisit
 Ideal for Twitter, logs
 Adaptive to changing data
Block & MapReduce

 Block: Process in chunks


 MapReduce: Split + aggregate
 Parallel processing
 Use Dask, bcolz for blocks
 Hadoop/Disco for MapReduce
 Good for logs, images
Efficient Data Structures

 Sparse: Saves memory on 0s


 Trees: Hierarchical search
 Hash tables: Fast retrieval
 Fit for NLP, search, clustering
 Used in databases
 Core for large data
Specialized Tools

 Cython: Speedup Python


 Numexpr: Fast math expressions
 Theano: Deep learning with GPU
 Dask: Parallel computing
 Blaze: SQL for Python
 Bcolz: Compressed arrays
Programming Tips

 Use existing libraries


 Optimize for hardware
 Profile before optimizing
 Compile or use JIT
 Avoid loading all data
 Use generators, subsets
Case Study: Malicious URLs

 Detects unsafe websites


 Huge sparse dataset (3M+ features)
 Uses SGDClassifier
 Applies online learning
 Evaluated with precision/recall
 Shows realworld ML scaling
Case Study: Recommender
System

 Movie recommendation via MySQL


 LSH + Hamming Distance
 Groups similar users
 Uses compressed bit strings
 Fast lookup with indexing
 Built inside a database

You might also like