Deep learning on mobile

Deep Learning on mobile phones
- A Practitioners guide
Anirudh Koul

Anirudh Koul , @anirudhkoul , http://koul.ai
Head of AI & Research,
Aira
[lastname]@aira.io
Founder, Seeing AI
Previously at Microsoft

Why Deep Learning On Mobile?
Latency Privacy

Response Time Limits – Powers of 10
0.1 second : Reacting instantly
1.0 seconds : User’s flow of thought
10 seconds : Keeping the user’s attention
[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:

Mobile Deep Learning Recipe
Mobile Inference Engine + Pretrained Model = DL App
(Efficient) (Efficient)

Use Cloud APIs for General Recognition Needs
• Microsoft Cognitive Services
• Clarifai
• Google Cloud Vision
• IBM Watson Services
• Amazon Rekognition

How to Choose a Computer Vision Based API?
Benchmark & Compare them
COCO-Text v2.0 for Text reading in the wild
• ~2k random images
• Candidate text has at least 2 characters together
• Direct word match
COCO-Val 2017 for Image Tagging in the wild
• ~4k random images
• Tag similarity match instead of word match

Recognize Text Benchmarks
Text API Accuracy
Amazon Rekognition 45.4%
Google Cloud Vision 33.4%
Microsoft Cognitive Services 55.4%
Evaluation criteria:
• Photos have candidate words with at length>=2
• Direct word match with ground truth

Image Tagging Benchmarks
• Concept similarity match instead of word match
• E.g. ‘military-officer’ tag matched with ground truth tag ‘person’
Text API Accuracy
Amazon Rekognition 65%
Google Cloud Vision 47.6%
Microsoft Cognitive Services 50.0%

• Concept similarity match instead of word match
• E.g. ‘military-officer’ tag matched with ground truth tag ‘person’
Text API Accuracy Avg #Tags
Amazon Rekognition 65% 14
Google Cloud Vision 47.6% 14
Microsoft Cognitive Services 50.0% 8

Hard to do Precision-Recall since COCO ground truth tags are not exhaustive
Lower # of tags for a given accuracy indicates higher F-measure
Text API Accuracy Avg #Tags
Amazon Rekognition 65% 14
Google Cloud Vision 47.6% 14
Microsoft Cognitive Services 50.0% 8

Tips for reducing network latency
• For Text Recognition
• Compressing setting of upto 90% has little effect on accuracy, but drastic
savings in size
• Resizing is dangerous, text recognition needs a minimum size for
recognition
• For image recognition
• Resize to 224 as the minimum(height,width) at 50% compression with
bilinear interpolation

http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/
Energy to train
Convolutional
Neural Network
Energy to use
Convolutional
Neural Network

Base Pretrained Model
ImageNet – 1000 Object Categorizer
VGG16
Inception-v3
Resnet-50
MobileNet
SqueezeNet

Running pre-trained models on mobile
Core ML
TensorFlow Lite
Caffe2

Apple’s Ecosystem
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017
2018

Apple’s Ecosystem
Metal
- low-level, low-overhead hardware-accelerated 3D graphic and
compute shader application programming interface (API)
- Available since iOS 8
2014 2016 2017
2018

Apple’s Ecosystem
Fast low-level primitives:
• BNNS – Basic Neural Network Subroutine
• Ideal case: Fully connected NN
• MPS – Metal Performance Shaders
• Ideal case: Convolutions
Inconvenient for large networks:
• Inception-v3 inference consisted of 1.5K hard coded model definition
• Libraries Like Forge by Matthijs Hollemans provide abstraction
2014 2016 2017
2018

Apple’s Ecosystem
Convert Caffe/Tensorflow model to CoreML model in 3 lines:
import coremltools
coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’)
coreml_model.save('my_model.mlmodel’)
Add model to iOS project and call for prediction.
Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM
Automatically minimizes memory footprint and power consumption
2014 2016 2017
2018

Apple’s Ecosystem
• Model quantization support upto 1 bit
• Batch API for improved performance
• Conversion support for MXNet, ONNX
• ONNX opens models from PyTorch, Cognitive Toolkit, Caffe2, Chainer
• ML Create for quick training
• tf-coreml for direct conversion from tensorflow
2014 2016 2017
2018

CoreML Benchmark - Pick a DNN for your mobile architecture
Model Top-1
Accurac
y
Size of
Model (MB)
iPhone 5S
Execution
Time (ms)
iPhone 6
Execution
Time (ms)
iPhone
6S/SE
Execution
Time (ms)
iPhone 7
Execution
Time (ms)
iPhone 8/X
Execution
Time (ms)
VGG 16 71 553 7408 4556 235 181 146
Inception v3 78 95 727 637 114 90 78
Resnet 50 75 103 538 557 77 74 71
MobileNet 71 17 129 109 44 35 33
SqueezeNet 57 5 75 78 36 30 29
2014 2015 2016
Huge
improvement in
GPU hardware
in 2015
2013 2017

Putting out more frames than an art gallery

TensorFlow Ecosystem
TensorFlow TensorFlow Mobile TensorFlow Lite
2015 2016 2018

The full, bulky deal
2015 2016 2018

2015 2016 2018
Easy pipeline to bring Tensorflow models to mobile
Excellent documentation
Optimizations to bring model to mobile

• Smaller
• Faster
• Minimal dependencies
• Easier to package & deploy
• Allows running custom operators
1 line conversion from Keras to TensorFlow lite
• tflite_convert --keras_model_file=keras_model.h5 --output_file=foo.tflite
2015 2016 2018

TensorFlow Lite is small
• ~75KB for core interpreter
• ~400KB for core interpreter + supported operations
• Compared to 1.5MB for Tensorflow Mobile

TensorFlow Lite is fast
• Takes advantage of on-device hardware acceleration
• Uses FlatBuffers
• Reduces code footprint, memory usage
• Reduces CPU cycles on serialization and deserialization
• Improves startup time
• Pre-fused activations
• Combining batch normalization layer with previous Convolution
• Interpreter uses static memory and static execution plan
• Decreases load time

TensorFlow Lite Benchmarks - http://alpha.lab.numericcal.com/

TensorFlow Lite Benchmarks - http://ai-benchmark.com/
• Crowdsourcing benchmarking with AI Benchmark android app
• By Andrey Ignatov from ETH
• 9 Tests
• E.g Semantic Segmentation, Image Super Resolution, Face Recognition

Caffe2
From Facebook
Under 1 MB of binary size
Built for Speed :
For ARM CPU : Uses NEON Kernels, NNPack
For iPhone GPU : Uses Metal Performance Shaders and Metal
For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup)
ONNX format support to import models from CNTK/PyTorch

Recommendation for development
1. Train a model using Keras
2. For iOS:
• Convert to CoreML using coremltools
3. For Android:
• Convert to Tensorflow Lite using tflite_convert
Keras
.mlmodel file .tflite file
coremltools tflite_convert

Common Questions
“My app has become too big to download. What do I do?”
• iOS doesn’t allow apps over 150 MB to be downloaded
• Solution : Download on demand, and compile on device
• 0 MB change to app size on first install

Common Questions
“Do I need to ship a new app update with every model improvement?”
• Making App updates is a decent amount of overheard, plus ~2 days
wait time
• Solution : Check for model updates, download and compile on device
• Easier solution – Use a framework for Model Management, e.g.
• Google ML Kit
• Fritz
• Numerrical

Common Questions
“Why does my app not recognize objects at top/bottom of screen?”
• Solution : Check the cropping used, by default, its center crop 

Learn Playing an Accordion
3 months

Learn Playing an Accordion
3 months
Knows Piano
Fine Tune Skills
1 week

I got a dataset, Now What?
Step 1 : Find a pre-trained model
Step 2 : Fine tune a pre-trained model
Step 3 : Run using existing frameworks
“Don’t Be A Hero”
- Andrej Karpathy

How to find pretrained models for my task?
Search “Model Zoo”
https://modelzoo.co
- 300+ models

AlexNet, 2012 (simplified)
[Krizhevsky, Sutskever,Hinton’12]
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11
n-dimension
Feature
representation

Deciding how to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

Could you training your own classifier ... without coding?
• Microsoft CustomVision.ai
• Unique: Under a minute training, Custom object detection
• Google AutoML
• Unique: Full CNN training, crowdsourced workers
• IBM Watson Visual recognition
• Baidu EZDL
• Unique: Custom Sound recognition

Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Upload 30 photos per class for make prototype model
Upload 200 photos per class for more robust production model
More distinct the shape/type of object, lesser images required.

Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Use Fatkun Browser Extension to download images from Search Engine,
or use Bing Image Search API to programmatically download photos with
proper rights

CoreML exporter from customvision.ai
– Drag and drop training
5 minute shortcut to training, finetuning and
getting model ready in CoreML format
Drag and drop interface

Building a Crowdsourced Data Collector
in 1 months

Barcode recognition from Seeing AI
Live Guide user in finding a barcode with audio cues
With
Server
Decode barcode to identify product
Tech MPSCNN running on mobile GPU + barcode library
Metrics 40 FPS (~25 ms) on iPhone 7
Aim : Help blind users identify products using barcode
Issue : Blind users don’t know where the barcode is

Currency recognition from Seeing AI
Aim : Identify currency
Live Identify denomination of paper currency instantly
With
Server
-
Tech Task specific CNN running on mobile GPU
Metrics 40 FPS (~25 ms) on iPhone 7

Training Data Collection App
Request volunteers to take photos of objects
in non-obvious settings
Sends photos to cloud, trains model nightly
Newsletter shows the best photos from volunteers
Let them compete for fame

Daily challenge - Collected by volunteers

Building a production DL App
in 3 months

What you want
https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016
$2000$200,000
What you can afford

11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

11x11 conv, 96, /4, pool/2
3x3 conv, 384
3x3 conv, 384
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
3x3 conv, 64
3x3 conv, 128
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
fc, 4096
fc, 4096
fc, 1000
VGG, 19 layers
(ILSVRC 2014)
input
Conv
7x7+ 2(S)
MaxPool
3x3+ 2(S)
LocalRespNorm
Conv
1x1+ 1(V)
Conv
3x3+ 1(S)
LocalRespNorm
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
AveragePool
7x7+ 1(V)
FC
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max0
Conv
1x1+ 1(S)
FC
FC
soft max1
soft max2
GoogleNet, 22 layers
(ILSVRC 2014)
Revolution of Depth

AlexNet, 8 layers
(ILSVRC 2012)
ResNet, 152 layers
(ILSVRC 2015)
3x3 conv, 64
3x3 conv, 128
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
fc, 4096
fc, 4096
fc, 1000
11x11 conv, 96, /4, pool/2
3x3 conv, 384
3x3 conv, 384
fc, 4096
fc, 4096
fc, 1000
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
VGG, 19 layers
(ILSVRC 2014)
Revolution of Depth
Ultra
deep

ResNet, 152 layers 1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
7x7 conv, 64, /2, pool/2
Revolution of Depth

28.2
25.8
16.4
11.7
7.3 6.7
3.6 2.9
ILSVRC'10 ILSVRC'11 ILSVRC'12
AlexNet
ILSVRC'13 ILSVRC'14
VGG
ILSVRC'14
GoogleNet
ILSVRC'15
ResNet
ILSVRC'16
Ensemble
ImageNet Classification top-5 error (%)
shallow 8 layers
19 layers 22 layers
152 layers
Revolution of Depth vs Classification Accuracy
Ensemble of
Resnet, Inception
Resnet, Inception
and Wide Residual
Network

Accuracy vs Operations Per Image Inference
Size is proportional
to num parameters
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
552 MB
240 MB
What we want

Your Budget - Smartphone Floating Point Operations Per Second (2015)
http://pages.experts-exchange.com/processing-power-compared/

iPhone X is more powerful than a Macbook Pro
https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/

Strategies to get maximum efficiency from your CNN
Before training
• Pick an efficient architecture for your task
• Designing efficient layers
After training
• Pruning
• Quantization
• Network binarization

CoreML Benchmark - Pick a DNN for your mobile architecture
Model Top-1
Accura
cy
Size of
Model
(MB)
Million
Multi
Adds
iPhone 5S
Execution
Time (ms)
iPhone 6
Execution
Time (ms)
iPhone
6S/SE
Execution
Time (ms)
iPhone 7
Execution
Time (ms)
iPhone
8/X
Execution
Time (ms)
VGG 16 71 553 15300 7408 4556 235 181 146
Inception
v3
78 95 5000 727 637 114 90 78
Resnet 50 75 103 3900 538 557 77 74 71
MobileNet 71 17 569 129 109 44 35 33
SqueezeN
et
57 5 800 75 78 36 30 29
2014 2015 2016
Huge
improvement in
GPU hardware
in 2015
2013 2017

MobileNet family
Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise
conv
Tune with two parameters – Width Multiplier and resolution multiplier
Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017

Efficient Classification Architectures
https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html
MobileNetV2 is the current favourite

Efficient Detection Architectures
Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017

Efficient Segmentation Architectures
ICNet - Image cascade
network

Tricks while designing your own network
• Dilated Convolutions
• Great for Segmentation / when target object has high area in image
• Replace NxN convolutions with Nx1 followed by 1xN
• Depth wise Separable Convolutions (e.g. MobileNet)
• Inverted residual block (e.g. MobileNetV2)
• Replacing large filters with multiple small filters
• 5x5 is slower than 3x3 followed by 3x3

Design consideration for custom architectures – Small Filters
Three layers of 3x3 convolutions
>>
One layer of 7x7 convolution
Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions
Replace NxN convolutions with stack of 1xN and Nx1
Fewer parameters 
Less compute 
More non-linearity 
Better
Faster
Stronger
Andrej Karpathy, CS-231n Notes, Lecture 11

Selective training to keep networks shallow
Idea : Augment data limited to how your network will be used
Example : If making a selfie app, no benefit in rotating training images
beyond +-45 degrees. Your phone will anyway rotate.
Followed by WordLens / Google Translate
Example : Add blur if analyzing mobile phone frames

Pruning
Aim : Remove all connections
with absolute weights below a
threshold
Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015

Observation : Most parameters in Fully Connected Layers
AlexNet 240 MB VGG-16 552 MB
96% of all
parameters
90% of all
parameters

Pruning gets quickest model compression without accuracy loss
AlexNet 240 MB VGG-16 552 MB
First layer which directly interacts with image is sensitive
and cannot be pruned too much without hurting
accuracy

Weight Sharing
Idea : Cluster weights with similar values together, and store in a
dictionary.
Codebook
Huffman coding
HashedNets
Cons: Need a special inference engine, doesn’t work for most
applications

Filter Pruning - ThiNet
Idea : Discard whole filter if not important to predictions
Advantage:
• No change in architecture, other than thinning of filters per layer
• Can be further compressed with other methods
Just like feature selection, select filter to discard. Possible greedy
methods:
• Absolute weight sum of entire filter closest to 0
• Average percentage of ‘Zeros’ as outputs
• ThiNet – Collect statistics on the output of the next layer

SqueezeNet - AlexNet-level accuracy in 0.5 MB
SqueezeNet base 4.8 MB
SqueezeNet compressed 0.5 MB
80.3% top-5 Accuracy on ImageNet
0.72 GFLOPS/image
Fire Block
Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"

Quantization
Reduce precision from 32 bits to <=16 bits or lesser
Use stochastic rounding for best results
In Practice:
• Ristretto + Caffe
• Automatic Network quantization
• Finds balance between compression rate and accuracy
• Apple Metal Performance Shaders automatically quantize to 16 bits
• Tensorflow has 8 bit quantization support
• Gemmlowp – Low precision matrix multiplication library

Quantizing CNNs in Practice
Reducing CoreML models to half size
# Load a model, lower its precision, and then save the smaller model.
model_spec = coremltools.utils.load_spec(‘model.mlmodel’)
model_fp16_spec = coremltools.utils.convert_neural_network_spec_weights_to_fp16(model_spec)
coremltools.utils.save_spec(model_fp16_spec, ‘modelFP16.mlmodel')

Quantizing CNNs in Practice
Reducing CoreML models to even smaller size
Choose bits and quantization mode
Bits from [1,2,4,8]
Quantization mode from [“linear","linear_lut","kmeans_lut",”custom_lut”]
• Lut = look up table
from coremltools.models.neural_network.quantization_utils import *
quantized_model= quantize_weights(model, 8, 'linear')
quantized_model.save('quantizedModel.mlmodel’)
compare_model(model, quantized_model, './sample_data/')

Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation
and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and
Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

Deep learning on mobile