Machine Learning in Python
Rohith Mohan
GradQuant
Spring 2018
What is Machine Learning?
[Link]
Traditional Programming
Data
• Getting computers to program
themselves
Output • Coding is the bottleneck, let data
Computer
dictate programming
Program
Machine Learning
Data
Computer Output
Program
[Link]
Formal Definitions
• Arthur Samuel (1959)
• “Machine Learning: Field of study that gives computers the ability to learn
without being explicitly programmed.”
• Created a program for computer to play itself in checkers (10000s games) and
learn at IBM
• Tom Mitchell (1998)
• “Well-posed Learning Problem: A computer program is said to learn from
experience E with respect to some task T and some performance measure P, if
its performance on T, as measured by P, improves with experience E.”
Andrew Ng Machine Learning Coursera
Machine Learning
• Developed out of initial work in Artificial Intelligence (AI)
• Increased availability of large datasets and advances in computing
architecture boosted usage in recent times
[Link]
Usage
Natural Language Processing
+ Computer Vision
Mining and clustering
gene expression data to
identify individuals
Reproducing human
behavior (True AI)
[Link]
breakthroughs-of-2017-that-might-just-change-the-world-1222695/
[Link]
Recommendation algorithms [Link] [Link]
Common steps in ML workflow
• Collect data (various sources, UCI data repository, news orgs, Kaggle)
• Prepare data (exploratory analysis, feature selection, regularization)
• Selecting and training model (train and test datasets, what model?)
• Evaluating model (accuracy, precision, ROC curves, F1 score)
• Optimizing performance (change model, # of features, scaling)
scikit-learn
[Link]
Preprocessing
• Clean data and deal with missing values, etc.
• Feature scaling - rescaling features to be more sensible
• Standardization - getting various features into similar range (e.g. -1 to
1)
• Square footage of a house (100s of ft) vs # of rooms (1-5)
[Link]
Preprocessing
• Clean data and deal with missing values, etc.
• Feature scaling - rescaling features to be more sensible
• Standardization - getting various features into similar range (e.g. -1 to
1)
• Square footage of a house (100s of ft) vs # of rooms (1-5)
• Normalization – scaling to some standard (e.g. subtract mean &
divide by SD)
• Many others (regularization,imputation, generating polynomial
features, etc.)
[Link]
Preprocessing
• Clean data and deal with missing values, etc.
• Feature scaling - rescaling features to be more sensible
• Standardization - getting various features into similar range (e.g. -1 to
1)
• Square footage of a house (100s of ft) vs # of rooms (1-5)
• Normalization – scaling to some standard (e.g. subtract mean &
divide by SD)
[Link]
Importance of feature scaling
[Link]
Comparison of scaling
StandardScaler
Comparison of scaling
RobustScaler
Train Test (Cross Validate?)
• Why do we need to split up our datasets?
• Overfitting
• Split dataset
• Train – for training your model on
• Test – evaluate performance of model
• Usually 40% for testing is enough
• Validation set?
• Cross-validation
• Split up training set into subsets and evaluate performance (can be more
computationally expensive but conserves data)
• Hyper-parameter tuning
Bias-variance tradeoff
Underfitting Overfitting
High Bias High Variance
[Link] [Link]
plot-underfitting-overfitting-py
Bias-variance tradeoff
[Link]
How to select a model?
[Link]
Supervised vs Unsupervised Learning
• Supervised
• Regression, classification
• Input variables, output variable, learn mapping of input to output
• Unsupervised
• Clustering, association, etc.
• No correct answers and no teacher
• Semi-supervised
• Partially labeled dataset of images
• Mixing both techniques is what occurs in real-world
Regression
• Linear regression (OLS)
• Prediction
• Multiple variables/features?
• Feature selection
[Link] [Link]
Feature Selection
[Link]
Feature Selection
[Link]
Regression
• Linear regression (OLS)
• Prediction
• Multiple variables/features?
• Feature selection
• Length, width of a house (area?)
• Regularization
[Link] [Link]
Regularization
[Link]
Regularization
[Link]
[Link]
plot-train-error-vs-test-error-py
Classification – Logistic Regression
[Link]
Classification – Logistic Regression
[Link] [Link]
Classification – SVM
[Link]
Evaluating Performance
• Accuracy – how many predictions are correct out of the entire
dataset?
• Can be a flawed metric
• Precision and Recall
[Link]
Evaluating Performance
• Accuracy – how many predictions are correct out of the entire
dataset?
• Can be a flawed metric
• Precision and Recall
• ROC curves
• F1 score
Evaluating Performance
[Link]
Classification - K-Nearest Neighbors
• Robust to noisy training data
• More effective with larger datasets
• Need to determine parameter K (number of nearest neighbors)
• What type of distance metric?
• High computation cost
Clustering
• Unsupervised learning
• Can help you understand structure of your data
• Various types of clustering: K-means, Hierarchical, Ward
K-means
• Randomly choose k centroids
• Form clusters around it
• Take mean of cluster to identify new centroid
• Repeat until convergence
[Link]