0% found this document useful (0 votes)
3 views36 pages

ML Python Spring2018 Part1

The document provides an overview of machine learning, defining it as a field that enables computers to learn from data without explicit programming. It outlines the workflow of machine learning, including data collection, preparation, model selection, evaluation, and optimization, while distinguishing between supervised, unsupervised, and semi-supervised learning. Additionally, it discusses key concepts such as feature scaling, model evaluation metrics, and various algorithms used in machine learning.

Uploaded by

anh nguyen
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views36 pages

ML Python Spring2018 Part1

The document provides an overview of machine learning, defining it as a field that enables computers to learn from data without explicit programming. It outlines the workflow of machine learning, including data collection, preparation, model selection, evaluation, and optimization, while distinguishing between supervised, unsupervised, and semi-supervised learning. Additionally, it discusses key concepts such as feature scaling, model evaluation metrics, and various algorithms used in machine learning.

Uploaded by

anh nguyen
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning in Python

Rohith Mohan
GradQuant
Spring 2018
What is Machine Learning?

[Link]
Traditional Programming
Data
• Getting computers to program
themselves
Output • Coding is the bottleneck, let data
Computer
dictate programming

Program

Machine Learning
Data

Computer Output

Program
[Link]
Formal Definitions
• Arthur Samuel (1959)
• “Machine Learning: Field of study that gives computers the ability to learn
without being explicitly programmed.”
• Created a program for computer to play itself in checkers (10000s games) and
learn at IBM

• Tom Mitchell (1998)


• “Well-posed Learning Problem: A computer program is said to learn from
experience E with respect to some task T and some performance measure P, if
its performance on T, as measured by P, improves with experience E.”

Andrew Ng Machine Learning Coursera


Machine Learning
• Developed out of initial work in Artificial Intelligence (AI)
• Increased availability of large datasets and advances in computing
architecture boosted usage in recent times

[Link]
Usage
Natural Language Processing
+ Computer Vision

Mining and clustering


gene expression data to
identify individuals

Reproducing human
behavior (True AI)

[Link]
breakthroughs-of-2017-that-might-just-change-the-world-1222695/

[Link]
Recommendation algorithms [Link] [Link]
Common steps in ML workflow
• Collect data (various sources, UCI data repository, news orgs, Kaggle)
• Prepare data (exploratory analysis, feature selection, regularization)
• Selecting and training model (train and test datasets, what model?)
• Evaluating model (accuracy, precision, ROC curves, F1 score)
• Optimizing performance (change model, # of features, scaling)
scikit-learn

[Link]
Preprocessing
• Clean data and deal with missing values, etc.
• Feature scaling - rescaling features to be more sensible
• Standardization - getting various features into similar range (e.g. -1 to
1)
• Square footage of a house (100s of ft) vs # of rooms (1-5)

[Link]
Preprocessing
• Clean data and deal with missing values, etc.
• Feature scaling - rescaling features to be more sensible
• Standardization - getting various features into similar range (e.g. -1 to
1)
• Square footage of a house (100s of ft) vs # of rooms (1-5)
• Normalization – scaling to some standard (e.g. subtract mean &
divide by SD)
• Many others (regularization,imputation, generating polynomial
features, etc.)

[Link]
Preprocessing
• Clean data and deal with missing values, etc.
• Feature scaling - rescaling features to be more sensible
• Standardization - getting various features into similar range (e.g. -1 to
1)
• Square footage of a house (100s of ft) vs # of rooms (1-5)
• Normalization – scaling to some standard (e.g. subtract mean &
divide by SD)

[Link]
Importance of feature scaling

[Link]
Comparison of scaling
StandardScaler
Comparison of scaling
RobustScaler
Train Test (Cross Validate?)
• Why do we need to split up our datasets?
• Overfitting
• Split dataset
• Train – for training your model on
• Test – evaluate performance of model
• Usually 40% for testing is enough
• Validation set?
• Cross-validation
• Split up training set into subsets and evaluate performance (can be more
computationally expensive but conserves data)
• Hyper-parameter tuning
Bias-variance tradeoff

Underfitting Overfitting
High Bias High Variance

[Link] [Link]
plot-underfitting-overfitting-py
Bias-variance tradeoff

[Link]
How to select a model?
[Link]
Supervised vs Unsupervised Learning
• Supervised
• Regression, classification
• Input variables, output variable, learn mapping of input to output

• Unsupervised
• Clustering, association, etc.
• No correct answers and no teacher

• Semi-supervised
• Partially labeled dataset of images
• Mixing both techniques is what occurs in real-world
Regression
• Linear regression (OLS)

• Prediction
• Multiple variables/features?
• Feature selection

[Link] [Link]
Feature Selection

[Link]
Feature Selection

[Link]
Regression
• Linear regression (OLS)

• Prediction
• Multiple variables/features?
• Feature selection
• Length, width of a house (area?)
• Regularization

[Link] [Link]
Regularization

[Link]
Regularization

[Link]
[Link]
plot-train-error-vs-test-error-py
Classification – Logistic Regression

[Link]
Classification – Logistic Regression

[Link] [Link]
Classification – SVM

[Link]
Evaluating Performance
• Accuracy – how many predictions are correct out of the entire
dataset?
• Can be a flawed metric

• Precision and Recall

[Link]
Evaluating Performance
• Accuracy – how many predictions are correct out of the entire
dataset?
• Can be a flawed metric

• Precision and Recall

• ROC curves
• F1 score
Evaluating Performance

[Link]
Classification - K-Nearest Neighbors
• Robust to noisy training data
• More effective with larger datasets

• Need to determine parameter K (number of nearest neighbors)


• What type of distance metric?
• High computation cost
Clustering
• Unsupervised learning
• Can help you understand structure of your data

• Various types of clustering: K-means, Hierarchical, Ward


K-means
• Randomly choose k centroids
• Form clusters around it
• Take mean of cluster to identify new centroid
• Repeat until convergence

[Link]

You might also like