TinyML to TinyDL: Trade-offs and Advances
TinyML to TinyDL: Trade-offs and Advances
Survey
SHRIYANK SOMVANSHI, Texas State University, USA
MD MONZURUL ISLAM, Texas State University, USA
GAURAB CHHETRI, Texas State University, USA
ROHIT CHAKRABORTY, Texas State University, USA
MAHMUDA SULTANA MIMI, Texas State University, USA
arXiv:2506.18927v2 [[Link]] 25 Jun 2025
Authors’ Contact Information: Shriyank Somvanshi, Texas State University, San Marcos, USA, shriyank@[Link]; Md
Monzurul Islam, Texas State University, San Marcos, USA, monzurul@[Link]; Gaurab Chhetri, Texas State University,
San Marcos, USA, gaurab@[Link]; Rohit Chakraborty, Texas State University, San Marcos, USA, rohitchakraborty@
[Link]; Mahmuda Sultana Mimi, Texas State University, San Marcos, USA, qnb9@[Link]; Sawgat Ahmed Shuvo,
Texas State University, San Marcos, USA, sawgat@[Link]; Kazi Sifatul Islam, Texas State University, San Marcos, USA,
kazi_sifat@[Link]; Syed Aaqib Javed, Texas State University, San Marcos, USA, [Link]@[Link]; Sharif Ahmed
Rafat, Texas State University, San Marcos, USA, sarafat@[Link]; Anandi Dutta, Ph.D., Texas State University, San Marcos,
USA, [Link]@[Link]; Subasish Das, Ph.D., Texas State University, San Marcos, USA, subasish@[Link].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@[Link].
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 1557-735X/2025/5-ART
[Link]
1 Introduction
Tiny Machine Learning (TinyML) has emerged as a rapidly growing paradigm that brings machine
learning capabilities to severely resource-constrained edge devices. Traditionally, machine learning
models demanded significant computational resources, making their deployment on microcontroller
units (MCUs) and embedded platforms impractical. However, advances in hardware design, model
compression, and embedded inference have allowed real-time intelligence to be embedded on-
device, leading to a new class of systems that execute complex analytics at the edge. As the field
evolves, a distinct subdomain called Tiny Deep Learning (TinyDL) has gained momentum, focusing
specifically on deploying deep learning models, rather than shallow classifiers on low-power,
ultra-constrained hardware.
TinyML is typically defined as the deployment of machine learning inference tasks on devices
operating under 1 mW of power, often with only 32 to 512 kB of Static Random-Access Memory
(SRAM) and constrained flash storage. These devices, which usually lack an operating system and
hardware accelerators for floating-point operations, are capable of performing real-time analytics
while meeting stringent energy and memory budgets [1–3]. TinyDL builds upon this foundation
by emphasizing the use of deep neural networks, such as convolutional and transformer-based
architectures, under similar constraints. This term, introduced as early as 2017 with just-in-time
inference frameworks like TinyDL [4], now encompasses a range of state-of-the-art models such
as MCUNet, EfficientNet-lite, and DistilBERT variants that deliver strong accuracy with memory
footprints below 1 MB and latency below 20 milliseconds [5].
The rise of TinyML and TinyDL is primarily driven by limitations inherent in traditional cloud-
based machine learning workflows. Cloud inference introduces unacceptable round-trip latencies
in time-sensitive applications such as autonomous driving, drones, and wearables [6]. Moreover,
transmitting sensor data to the cloud raises substantial privacy concerns in healthcare and indus-
trial Internet of Things (IoT) contexts, where data sovereignty and user trust are paramount [7].
Finally, the energy consumption required to constantly stream data to remote servers introduces
a prohibitive cost, especially for battery-powered devices [8]. By shifting inference-and increas-
ingly, lightweight learning-onto the device, TinyDL enables ultra-low-latency responses, reduces
dependency on cloud connectivity, and enhances data privacy [1].
Initially, TinyML systems relied on shallow models such as linear classifiers, decision trees,
or single-layer perceptrons. These models, while lightweight, were unable to match the repre-
sentational power of deep neural networks and required extensive manual feature engineering,
particularly for audio and vision tasks [9]. The transition toward TinyDL was made possible
by several interrelated advances. First, architectural innovations such as depthwise separable
convolutions, inverted residuals, and attention mechanisms made it possible to compress model
complexity without sacrificing accuracy [2]. Second, a suite of optimization techniques including
quantization-aware training (QAT), structured pruning, knowledge distillation, and low-rank fac-
torization, dramatically reduced the runtime and memory demands of deep models [5]. Third, the
introduction of Neural Architecture Search (NAS) frameworks that co-optimize model topology
and deployment constraints-such as MCUNet and TinyNAS-has demonstrated that ImageNet-scale
tasks can be executed on MCUs with just 480 kB of SRAM [10]. Additionally, new developments
in on-device and continual learning allow models to adapt in real-time under strict memory and
compute constraints, further extending the practicality of TinyDL systems [11].
1 [Link]
limited in terms of their available resources. In doing so, it supports applications requiring low
latency, minimal power consumption, and enhanced data privacy by keeping data local [8, 9, 13].
The operational landscape of TinyML is shaped by three critical constraints: memory, power, and
compute limitations. Firstly, memory availability is exceptionally scarce. Devices typically include
SRAM ranging from a few kilobytes to several hundred kilobytes for runtime operations, and Flash
memory often under one megabyte for storing program code and ML models, though in rare cases
this may reach up to 2 MB [9, 12, 14–16]. This represents a stark contrast to conventional computing
platforms and necessitates the use of highly compact models [14]. Secondly, power consumption
is a paramount constraint. TinyML devices typically operate on ultra-low power budgets, often
in the milliwatt or even microwatt range [8, 9, 12, 15, 16]. This is essential for battery-powered
or energy-harvesting systems designed for long-term operation without frequent recharging or
maintenance [8, 15]. Thirdly, compute capabilities in these devices are limited. The MCUs generally
operate at clock speeds of several tens to a few hundred megahertz, and many lack Floating Point
Units (FPUs), which further constrains the deployment of typical ML models unless optimized
through quantization techniques [9, 12, 16]. These hardware limitations necessitate lightweight
and efficient models capable of running within constrained environments.
The hardware ecosystem supporting TinyML primarily consists of low-power MCUs integrated
with sensors that gather environmental data. Prominent examples include the ARM Cortex-M
series, such as the Cortex-M0, M4, and M7, which strike a balance between computational efficiency,
power consumption, and cost [3, 9, 17]. Other widely used platforms include the STM32 and ESP32
families [8, 16]. These MCUs are often paired with application-specific sensors, such as inertial
measurement units (IMUs) for motion tracking, microphones for voice command recognition, and
low-resolution cameras for vision tasks with constrained compute budgets [3, 9]. This combination
of efficient hardware and targeted sensors empowers TinyML to bring intelligence into everyday
objects, from wearables to smart infrastructure.
low power consumption (milliwatts or microwatts), and minimal memory footprint (kilobytes)
[5, 16]. Recent advancements in TinyDL have introduced neural network architectures specifically
designed for edge execution, including MobileNet, SqueezeNet, and Tiny-YOLO [22–24]. These
models are tailored to execute with fewer floating-point operations and reduced parameter counts,
enabling inference in real time even on MCUs without FPUs [12, 16]. Furthermore, hardware-
aware NAS and energy-efficient training paradigms are being actively explored to enhance model
deployment on edge platforms [8]. Within the context of TinyML, TinyDL is a specialized area.
TinyML encompasses all machine learning techniques, including classical algorithms, that can
be deployed on resource-limited devices [25]. TinyDL, however, specifically addresses the more
demanding challenge of implementing and running inherently more complex and resource intensive
deep learning models under these same severe constraints [5, 20].
1. Pruning
2. Knowledge Distillation
1. Model Selection 3. Quantization
2. Training
Resource Rich Device
1. Manual Programming
2. Code Generation
3. Tiny ML Interpreters
Microcontroller Units
Real World Data
To synthesize the transition from TinyML to TinyDL, Figure 2 provides a side-by-side compar-
ison across four key dimensions: model size, hardware platforms, optimization techniques, and
representative application domains. While TinyML typically involves deploying classical machine
learning models under 250 KB on low-power MCUs, TinyDL enables compressed deep learning
models to run on resource-constrained devices through advances in hardware accelerators and
model optimization techniques. TinyDL leverages QAT, NAS, hardware-aware quantization (HAQ),
and knowledge distillation to maintain high accuracy within stringent memory and energy budgets.
From Tiny Machine Learning to Tiny Deep Learning: A Survey 7
As shown, this progression expands the use cases from simple tasks like gesture or electrocardio-
grams (ECGs) monitoring in TinyML to more complex applications such as speech recognition,
vision-based inference, and autonomous systems in TinyDL.
TinyML TinyDL
Model Size ≤ 250 KB Typically, hundreds of KB to under 1
MB
Fig. 2. Comparative summary of TinyML and TinyDL across four key aspects: model size, hardware platforms,
optimization techniques, and representative use cases
Post-Training Quantization (PTQ),
Hardware-Aware Quantization
3 Hardware Platforms for(HAQ)
TinyML and TinyDL
TinyML and TinyDL applications run on highly resource-constrained devices. This section surveys
the hardware platforms enabling TinyML, from general MCUs to new specialized AI accelerators,
and how these platforms are evaluated. Despite their modest specs, these devices can perform
meaningful ML tasks at the edge by balancing performance, power, and accuracy through careful
design and benchmarking.
3.1 MCUs
MCUs form the backbone of many TinyML deployments, bringing intelligence to the extreme
edge. They are single-chip computers designed for low power operation, often running on batteries
in remote sensors or wearables [27]. MCUs operate under strict hardware constraints: low clock
speeds (typically on the order of tens of MHz) and limited on-chip memory (often only tens to a
few hundred kilobytes of RAM). For example, a typical MCU might have approximately 128 KB of
RAM and 1 MB of flash storage, versus the gigabytes of memory and storage on a modern smart-
phone [28]. Because they usually run on small batteries, energy efficiency is paramount – TinyML
devices must consume mere milliwatts or microwatts to run for long periods [27]. Despite these
limitations, MCUs are capable of running surprisingly complex ML workloads when models are
optimized. Continuous improvements in MCU hardware (e.g. more efficient 32-bit ARM Cortex-M
processors) have made them powerful and energy-efficient enough to handle small neural networks
within tight power budgets [17]. In other words, it is now feasible to deploy machine learning
on battery-operated MCU-based sensors in the field. Equally important is the software toolchain
that supports MCUs. Frameworks like TensorFlow Lite for MCUs (TFLite Micro) and platforms
like Edge Impulse allow developers to compress and deploy trained models on tiny devices. For
instance, TFLite Micro can convert a neural network into a form that runs in as little as 16 KB of
RAM on an MCU. Techniques such as 8-bit quantization and pruning are used to shrink model size
and computation so that inference can execute in real time on limited CPU and memory.
Clock Speed: Often 20–200 MHz clock rates (much lower than GHz-class processors) [27],
which limits the raw compute throughput of MCUs.
Memory: On the order of kilobytes to a few hundred kilobytes of RAM and usually under
a few MB of flash storage [29]. This means models must be very small and efficient.
Power/Battery: Designed for ultra-low power use; many MCU systems run on coin-cell
batteries or energy harvesters. Power consumption is in the milliwatt or even microwatt
range, so the device can operate for months or years [29].
Even with these constraints, MCUs have demonstrated the ability to run useful ML inference
tasks at the edge. By using optimized models, an MCU can perform tasks like keyword spotting
(KWS), gesture recognition, anomaly detection, or simple image classification entirely on-device
[30]. For example, researchers have successfully deployed a voice wake-word detector and simple
vision models on tiny boards like the Espressif ESP32 and STMicroelectronics STM32 series MCUs.
The ESP32 (a dual-core MCU up to 240 MHz with approximately 520 KB RAM) and various
STM32 Cortex-M variants (e.g. an M7 at 216 MHz with a few hundred KB of RAM) are popular
choices that, with quantized models, can handle basic deep learning tasks under tight memory and
energy constraints. Recent studies and white papers highlight that the combination of improving
MCU hardware and clever model optimizations enables these chips to support machine learning
workloads that were once thought impossible on such limited devices [27]. MCUs provide a flexible,
low-cost platform for TinyML, albeit one that demands extreme efficiency in model design.
3.2.1 Google Edge Tensor Processing Unit. The Google Edge Tensor Processing Unit (TPU) is a
small application-specific integrated circuit (ASIC) designed by Google to accelerate TensorFlow
Lite models at the edge. Each Edge TPU can perform 4 trillion operations per second (4 TOPS) while
consuming about 2 W of power (roughly 2 TOPS/W) [31]. In practical terms, an Edge TPU can run
vision models like MobileNet V2 at nearly 400 frames per second in a power-efficient manner [32],
far beyond what a typical MCU could achieve. These chips often come as co-processors (e.g., in the
Coral EdgeTPU USB sticks or M.2 modules) that pair with a MCU or microprocessor, offloading the
heavy math of neural networks. By handling matrix multiplications and convolutions in dedicated
hardware, the EdgeTPU enables real-time image and audio inference on the edge device with
minimal latency and modest power use.
3.2.2 Syntiant Neural Decision Processors Series. Syntiant’s Neural Decision Processors (NDP) are
ultra-low-power neural accelerators aimed at always-on workloads like KWS and sensor analytics.
They use a custom deep neural network inference engine that runs models efficiently with parallel
multiply-accumulate (MAC) units and an optimized data path for minimal idle cycles [31]. For
example, the Syntiant NDP120 can continuously listen for voice commands using only a few
microwatts. In MLPerf Tiny benchmark tests, the NDP120 was able to perform a KWS inference in
about 4.3 ms while consuming only 35 µJ of energy per inference (at 30 MHz operation) [33]. This is
orders of magnitude more energy-efficient than running the same task on a generic MCU. The NDP
chips achieve this by being ASICs optimized for neural workloads-they store and process neural
network layers on-chip to avoid costly memory accesses, and they integrate small digital signal
processor (DSP) cores for preprocessing tasks. Syntiant’s platform demonstrates how specialized
silicon can deliver real-time AI within a milliwatt power budget.
3.2.3 Himax WiseEye WE-I Plus. The Himax WE-I Plus (HX6537-A) is an example of an AI-enabled
MCU/ASIC tailored for vision and sensor inferencing at the edge. It combines a 400 MHz DSP with
dedicated hardware accelerators (for tasks like image processing, HOG feature extraction, and JPEG
encoding) in an ultra-low-power design [34]. Uniquely, the WE-I Plus is event-driven: it stays in a
near-standby mode until its camera or sensor accelerator detects a trigger (e.g., motion or a person
in view), then the DSP wakes to run a neural network inference [34]. This architecture is highly
power-efficient. In fact, when running a person-detection CNN (TinyML vision model), the average
power consumption can be under 5 mW-an exceptionally low figure for an image recognition task.
By leveraging an ASIC with built-in neural accelerators, the Himax WE-I Plus achieves real-time
vision inference (e.g., detecting human presence in a frame) using only a fraction of the energy
that a general-purpose MCU would require for the same task [34].
These examples illustrate the importance of custom AI silicon for TinyML. Specialized edge AI
chips like the Edge TPU, Syntiant NDP, Himax WE-I, as well as others (e.g. Intel’s Movidius Myriad
X visual processing unit and various analog neural chips), focus on the common computational
patterns of ML algorithms. By implementing neural network operations (matrix multiplies, convo-
lutions, etc.) in hardware, they achieve far higher throughput per watt than a CPU. Innovations
such as parallel MAC arrays, on-chip memory for weights/activations, and streamlined dataflows
allow these ASICs to perform inference with minimal wasted energy. Many also include features
like built-in DSPs or camera interfaces to handle sensor data directly. The result is low-power,
real-time inference: tasks like wake-word detection or gesture recognition can run continuously
on the edge without exhausting a battery. Table 2 summarizes key TinyML hardware platforms,
including MCUs and neural accelerators, along with their specifications and typical use cases.
Platform Type Processor / Clock Memory Notable Features Typical TinyML Use
(RAM/Flash) Case
Espressif ESP32 MCU (Wi-Fi Dual-core 32-bit MCU 520 KB SRAM, 4 Wi-Fi/Bluetooth IoT sensors, simple
SoC) @ 240 MHz MB flash (external) integrated; low cost KWS
[27]
STM32 (e.g., MCU 216 MHz ARM ∼320 KB RAM, 1 DSP instructions, Industrial sensing,
STM32F7) (Cortex-M7) Cortex-M7 MCU MB flash optional FPU audio classification
[27]
Google EdgeTPU ASIC Neural Custom ASIC @ ∼200 Uses host memory 4 TOPS (∼400 FPS High-speed vision
Accelerator MHz (equiv.) (external DRAM) MobileNet) at 2 W (object detection, etc.)
PCIe/USB interfaces
[32]
Syntiant NDP120 Neural Programmable DNN On-chip memory ∼35 𝜇J per inference Always-listening AI
co-processor core @ 30–100 MHz for models (KWS); always-on (wake word, anomaly
ASIC capability detection)
[33]
Himax WE-I Plus AI MCU/ASIC 400 MHz DSP + 2 MB SRAM, 4 MB Camera interface; <5 Ultra-low-power vision
with DSP accelerators flash (typical dev mW person detection (people counting, etc.)
[34] board) [34]
TinyML models, due to their limited capacity and computational constraints, typically exhibit
higher generalization errors. Their reduced ability to learn complex features makes them more
prone to poor performance on unseen or out-of-distribution data. For instance, in the VWW
task, lightweight convolutional models such as MobileNet and MCUNet have achieved 85–90%
accuracy within 200–250 KB memory budgets [40, 41]. In contrast, traditional pipelines using
handcrafted features like HOG with classical classifiers (e.g., SVM, decision trees) tend to perform
significantly worse-typically in the 70–75% accuracy range-due to their limited ability to capture
spatial hierarchies and generalize to real-world images under constrained memory [42, 43].
4.1.2 High Feature Engineering Burden. Feature engineering involves manually selecting or trans-
forming raw sensor data (e.g., audio, motion, temperature) into meaningful inputs for traditional
models like decision trees or SVMs. While essential, this process is time-consuming, requires
domain expertise, and often fails to capture complex patterns as effectively as deep learning. In
TinyML, these limitations are amplified due to the resource constraints of edge devices, making
manual pipelines impractical for real-time, scalable applications.
Post-Training Quantization (PTQ): One-shot float → INT8 conversion; ∼4× size drop
with negligible accuracy loss [9].
Quantization-Aware Training (QAT): Simulates quant noise during training, enabling 4-
or even 2-bit inference while preserving accuracy [48].
Mixed-Precision Schemes (e.g., hardware-aware quantization (HAQ)): Per-layer
bit-width selection (2/4/8 bit) based on latency/energy targets [49].
Custom Numeric Formats (e.g., TENT): Tapered or block-floating formats tuned per
layer; up to 31% energy savings over INT8 baselines [50].
Microcontrollers Accelerator/
AI Optimized Chips
ARM KENDRYTE
Cortex-M C Coral K210
IMU
Raspberry pi Arduino
Portenta H7 Edge Impulse- OPENMV
Ready Devices Cam H7
goes further by running deep learning on tiny MCUs. For language tasks, TinyBERT and DistilBERT
are smaller, faster versions of BERT. TinyBERT uses teacher-student training to keep accuracy high,
and DistilBERT keeps 97% of BERT’s performance with fewer parameters. Figure 4 summarizes key
TinyDL breakthroughs from 2016 to 2024, illustrating the progression from early lightweight CNNs
like SqueezeNet and MobileNet to advanced models such as MCUNet, TinyBERT, and RedMule
that enable deep learning on microcontrollers.
TinyBERT
SqeezeNet DistilBert (Huawei)
(UC Berkeley & (Hugging Face) Layerwise MobileViT
Stanford University) Smaller and Knowledge (Apple)
MobileNet v2 RedMule
AlexNet level accuracy faster than BERT Distillation for Combines CNNs
(Google) First real training
with 50× fewer parameters MCUs.
Inverted Residual Model and Transformers
engine for MCUs
in <0.5 MB. for Mobile Vision
Structure
5.1.3 Hardware-Aware Architecture Design. Recent developments in lightweight CNN design have
emphasized hardware-aware optimization through NAS specifically tailored for MCU constraints
[63]. MCUNet demonstrates this approach by achieving 68.7% ImageNet accuracy with only 0.51
MB model size through joint optimization of network architecture and inference scheduling [41].
The framework employs a two-stage NAS that first optimizes the search space to fit resource
constraints, then specializes the network architecture within the optimized space [41].
5.2.2 DistilBERT Compression Strategy. Utilizing a different distillation approach, DistilBERT re-
duces the original BERT model by 40% while retaining 97% of its language understanding capabilities
[55]. The model achieves 91.3% F1 score on SQuAD v1.1 with 66 million parameters, demonstrating
the effectiveness of student-teacher training with temperature-scaled softmax distributions [55].
5.2.3 MCU Deployment Challenges. Recent research has focused on optimizing transformer deploy-
ment specifically for MCU units, addressing unique challenges posed by the multi-head self-attention
mechanism [64]. The primary bottlenecks include high memory footprint of intermediate attention
results and frequent data marshaling operations [64]. Novel approaches such as Fused-Weight
Self-Attention (FWSA) and Depth-First Tiling have been developed to mitigate these challenges,
achieving up to 6.19× reduction in memory peak usage while maintaining computational accuracy
[64].
Table 3. Summary of TinyDL Models with Size, Inference Speed, and Task Accuracy
Model Name Architecture Size Latency Accuracy (%) Target Task Hardware
(MB) (ms)
TinyBERT-4L [55] Transformer 14.5 5.0 96.8 (SST-2) Text Classification Mobile SoC
DistilBERT [55] Transformer 66.0 7.0 91.3 (SQuAD) QA / NLP Tasks Mobile GPU
MobileNetV2-0.35 CNN 3.4 ∼32 71.8 Image STM32H7 MCU
[60] (ImageNet) Classification
SqueezeNet v1.1 CNN 4.8 ∼20 58.38 Object Detection Kendryte K210
[61, 62] (ImageNet)
MCUNet-256kB CNN + NAS 0.51 12.0 70.7 Image STM32F746
[41] (ImageNet) Classification
EfficientNet-Lite0 CNN 4.7 45.0 75.1 Image EdgeTPU
[60] (ImageNet) Classification
DS-CNN (MLPerf) 1D-CNN 0.05 20.0 >90.0 Wake Word ARM Cortex-M4
[43] (Commands) Detection
MobileNet (VWW) CNN 0.32 8.0 80.0 (VWW) Visual Wake STM32 MCU
[43] Words
Deep AutoEncoder Autoencoder 0.27 15.0 85.0 (AD Anomaly MCU Platform
(AD) [43] Bench) Detection
ResNet (IC) [43] CNN 0.096 25.0 85.0 Image STM32 MCU
(ImageNet) Classification
Transformer- Transformer 2.1 180.0 78.2 (NLP NLP Tasks STM32F746
FWSA [64] Tasks)
SquishedNet [61] CNN 0.95 156.0 77.0 Image Nvidia Jetson TX1
(CIFAR-10) Classification
5.3.1 Quantization Methodologies. Quantization represents one of the most effective approaches
for model compression in TinyML systems [66][67]. PTQ converts trained models from floating-
point to reduced precision representations, typically INT8, achieving 4× model size reduction with
minimal accuracy degradation [65]. QAT incorporates quantization effects during training, enabling
more aggressive precision reduction while maintaining model performance [66]. Comparative
analysis shows that quantization generally outperforms pruning across various compression ratios,
with benefits becoming more pronounced at moderate compression levels [67].
5.3.2 Co-Design of Architecture and Runtime: MCUNet. MCUNet exemplifies a system-algorithm co-
design approach where both the neural architecture (TinyNAS) and inference engine (TinyEngine)
are jointly optimized to meet the extreme memory and compute constraints of MCUs. Unlike
traditional pipelines that first fix either the library or the model, MCUNet explores a larger design
space by integrating both dimensions. Figure 5 illustrates this co-optimization flow, highlighting
how it surpasses previous approaches limited to one-directional tuning.
5.3.3 Neural Network Pruning. Neural network pruning eliminates redundant or less important
parameters to reduce model complexity and memory footprint [68][65]. Magnitude-based pruning
removes weights with smallest absolute values, providing a straightforward approach for parameter
reduction [68]. Structured pruning targets entire network components such as filters or layers,
enabling more significant architectural simplifications suitable for severely resource-constrained
environments [65]. Research demonstrates that structured pruning can achieve compression ratios
up to 13× without significant accuracy loss through iterative pruning and retraining cycles [68].
5.3.4 Joint Optimization Approaches. The combination of quantization and pruning techniques
has emerged as a powerful strategy for achieving maximum compression efficiency [66][69].
Quantization-aware pruning yields more computationally efficient models than either technique
(a) Search NN model on an existing (b) Tune deep learning library given a
library. e.g., ProxylessNAS, MnasNet NN model. e.g., TVM
Fig. 5. MCUNet co-designs neural architecture (TinyNAS) and inference scheduling (TinyEngine) for MCU
efficiency. Unlike (a) architecture search and (b) runtime tuning done separately, (c) MCUNet integrates both
for improved accuracy and resource use [41]
alone, particularly for ultra-low latency applications [66]. Joint optimization frameworks demon-
strate superior computational efficiency compared to sequential application of compression tech-
niques, with benefits varying based on target compression ratios and application requirements
[69].
5.3.5 Network Augmentation and Auxiliary Supervision. Network augmentation offers a comple-
mentary training-time optimization strategy, particularly relevant for TinyDL. As shown in Figure 6,
a tiny model is embedded into larger networks that share weights and provide auxiliary supervision.
This enables the tiny model to learn stronger representations without increasing its inference-time
footprint, making it ideal for resource-constrained deployments.
Step1 Step2
Input
g = g base + g aug
g base g aug
Fig. 6. Network augmentation strategy: A tiny model is trained within a larger model to benefit from auxiliary
supervision, but only the tiny network is used during inference [70].
pipeline with integrated DSP processing and on-device testing. TFLite Model Maker [74] supports
fine-tuning and exporting models tailored for deployment on EdgeTPUs and mobile hardware.
PyTorch Mobile [75], while not suitable for MCUs, supports deployment of larger TinyDL models
(including Transformers) on higher-end mobile SoCs.
More specialized toolchains target performance tuning and low-level integration. CMSIS-NN [76]
provides hand-optimized kernels for ARM Cortex-M architectures and is often paired with TFLite
Micro for improved inference latency. MicroTVM [77], as an extension of the TVM compilation stack,
brings auto-tuning and graph optimization to MCU platforms like STM32 and ESP32. Glow [78],
developed by Meta, offers ahead-of-time graph lowering for hardware accelerators and NPUs. Tools
like DeepC convert Keras models into static C code, ideal for systems without dynamic memory
support [79], while MLPACK [80], although not TinyML-specific, is a lightweight C++ library
adaptable for embedded use. Vendor-specific tools such as X-CUBE-AI [81] for STM32 platforms
and academic solutions like QKeras with HLS4ML [82] for FPGA deployment demonstrate the
growing ecosystem of domain-targeted solutions. Commercial platforms such as OctoML [83] and
Nebullvm [84] further enhance deployment by automating compilation, quantization, and precision
tuning across edge platforms. These diverse tools reflect the increasing demand for streamlined,
hardware-aware TinyDL deployment pipelines.
TensorFlow Lite Micro: de-facto baseline; 8-bit INT quantization, huge community; no
GUI [72].
Edge Impulse Studio: drag-and-drop AutoML with built-in DSP blocks-ideal for newcom-
ers [73].
MicroTVM: TVM-based compiler autotuning for MCUs (STM32, ESP32) and RISC-V boards
[77].
Neuton TinyML: sub-kilobyte models that fit where even CMSIS-NN is too heavy [90].
Other tools address the needs of beginner users or are integrated within hardware-specific
ecosystems. Arduino IDE and ArduinoML [93] offer intuitive interfaces for model deployment on
AVR and Cortex-M boards but are limited to simpler use cases. The NXP eIQ Toolkit [94] provides
vendor-specific integration for NXP’s MCU and NPU portfolio, offering a graphical interface with
built-in support for quantization and pruning.
From the research perspective, Microsoft’s EdgeML [95] introduces novel model architectures
such as ProtoNN [101] for ultra-low-memory inference. For advanced users, the Google Colab and
TFLite workflow [96, 97] provides maximum scripting flexibility, allowing users to train models in
the cloud and convert them for deployment using the TFLite converter. Sony AI Studio [98, 99],
while limited to the Spresense board, offers a curated development environment for vision and
audio inference. Lastly, KaaEdge AI [100] addresses large-scale deployment and orchestration needs,
supporting federated learning pipelines and distributed edge intelligence.
Together, these platforms span a wide spectrum-from GUI-based environments to optimization-
first toolchains-offering developers diverse options depending on their expertise, application
complexity, and hardware targets.
achieved high accuracy with a minimal model size and fast inference time, addressing privacy and
latency concerns associated with camera-based systems [122]. Another notable development is an
IoT wristband designed for on-device, privacy-preserving HAR, providing a low-power, low-cost
solution that performs real-time activity classification that avoids reliance on cloud infrastructure
[123]. Overall, these advancements underscore TinyML’s pivotal role in enhancing consumer
healthcare data protection through AI-driven and privacy-preserving techniques [124], expanding
wearable technologies, and supporting intelligent medical devices with high autonomy and energy
efficiency [125].
efforts emphasize TinyML’s critical role in advancing industrial efficiency and environmental
sustainability by pushing AI capabilities closer to the data source [12, 149].
Flower Clients
9.2.1 Quantization vs. Accuracy. Quantization reduces the precision of weights and activations,
typically from 32-bit floating point (FP32) to 8-bit integers (INT8), and sometimes lower. This
can lead to massive reductions in model size, latency, and energy usage, but may affect model
accuracy. Magnitude-based structured pruning of MobileNetV2 (50% weights removed) incurs
< 1% top-1 drop on ImageNet while shrinking model size by 1.9 times [65]. Jacob et al. [160]
found that PTQ can reduce MobileNetV1 accuracy on ImageNet by up to 3–4%, though QAT can
mitigate this. QAT introduces fake quantization nodes during training, allowing the network to
compensate for precision loss. Special training procedures and loss-aware quantization [161] are
employed for binary networks. Beyond QAT and PTQ, several quantization schemes have emerged
to address different levels of granularity and trade-off. For instance, per-channel quantization adjusts
the scale and zero-point for each output channel, leading to better numerical stability and often
improved accuracy compared to per-tensor quantization [162]. Mixed-precision quantization allows
different layers or operations to use different bit-widths (e.g., 8-bit for early layers and 4-bit for
later layers), striking a balance between efficiency and performance [163]. Techniques like DoReFa-
Net [164] and Learned Step Size Quantization [165] further improve accuracy by learning optimal
quantization parameters during training. In the TinyML context, these methods have enabled
the deployment of accurate models like ResNet and MobileNet variants on MCUs with under
256 KB of SRAM. Furthermore, hardware-aware quantization strategies are increasingly integrated
into deployment pipelines using tools such as TensorFlow Model Optimization Toolkit [166] and
PyTorch’s quantization API, enabling automated conversion and validation across platforms.
9.2.2 Constraints Beyond Quantization. TinyML models often face deployment-specific constraints
such as memory budgets (e.g., 64 KB of RAM), latency caps (e.g., 10 ms), and energy limits (e.g.,
1 mJ per inference). Frameworks like 𝜇NAS [167] and Once-for-All (OFA) [168] support constraint-
aware NAS to generate tailored models. In practice, trade-offs are carefully evaluated to balance
latency, accuracy, and reliability. Constraint-aware modeling goes beyond architectural choices and
encompasses compiler-level and deployment-time optimizations. For example, models must comply
with quantization compatibility constraints of hardware accelerators like the GAP8 SoC, which
supports only INT8 convolutions [169], or the Ambiq Apollo3 Blue, which requires careful SRAM
and DMA management to maintain sub-mW operation [170]. Real-time constraints also vary across
application domains, such as, voice-triggered devices may tolerate 10–20 ms of latency, whereas
anomaly detection in industrial sensors may allow hundreds of milliseconds, but must operate
within a strict energy envelope. To handle such variance, modern compilers such as TVM [157],
Glow [78], and Apache Relay perform cross-layer optimization and memory layout transformations
that respect such deployment constraints. Additionally, tools like MCUNetV2 [171] integrate NAS
with firmware-level profiling to co-optimize models for specific MCUs, achieving a better trade-off
across the energy-latency-accuracy spectrum.
From Tiny Machine Learning to Tiny Deep Learning: A Survey 27
9.3.1 Google Speech Commands. Developed by Warden et al., this dataset contains short spoken
words sampled at 16 kHz. It is the standard benchmark for KWS. Tasks involve recognizing a fixed
vocabulary such as “yes”, “no”, and “go”.
9.3.2 Visual Wake Words. This dataset is a binary classification task for detecting the presence
of a person in low-resolution images. It is used in wake-word-style visual triggers for cameras or
embedded vision systems.
9.3.3 Tiny ImageNet and CIFAR. These datasets serve as benchmarks for image classification under
low-resolution and low-memory conditions. Tiny ImageNet is more challenging due to its 200-class
design, while CIFAR remains widely used for comparison.
9.3.4 𝜇MLPerf Benchmark Suite. MLCommons introduced 𝜇MLPerf to provide standardized evalu-
ation across KWS, image classification, and anomaly detection. It includes metrics like accuracy,
model size, memory footprint, and energy per inference, making it one of the most comprehensive
benchmarks for TinyML systems.
during inference, requiring sophisticated memory scheduling strategies that consider the entire
network topology rather than layer-wise optimization [176].
memory limitations, but requires algorithmic innovations that fundamentally differ from traditional
approaches [181].
These challenges represent fundamental research opportunities that will shape the future devel-
opment of TinyDL[59][183]. Addressing these issues requires interdisciplinary collaboration across
machine learning, computer systems, and hardware design communities to develop innovative
solutions that unlock the full potential of edge AI applications [183][184].
11 Future Directions
As TinyDL systems mature and expand across diverse application domains from healthcare and
smart homes to industrial automation and autonomous sensing, new challenges and technological
frontiers are emerging. Addressing these will require interdisciplinary advances in hardware design,
algorithmic efficiency, secure training, and adaptable software ecosystems. This section outlines
five promising directions for future exploration. Neuromorphic architectures employing Spiking
Neural Networks (SNNs) offer an alternative computational paradigm designed for ultra-low-power,
event driven processing typical of brain-inspired systems. These architectures promise efficient
always-on inference on MCU-scale devices ideal for continuous monitoring applications-with
hardware platforms like BrainChip’s Akida and Intel’s Loihi leading the way. To fully leverage
SNNs in TinyDL, research must advance surrogate-gradient training, event encoding techniques,
and software toolchains that map spiking models onto neuromorphic hardware seamlessly [65].
Implementing federated learning (FL) in TinyDL contexts addresses privacy and adaptability
by enabling decentralized learning across devices without sharing raw data crucial for distributed
sensor networks. Lightweight frameworks such as TinyFedTL and TinyMetaFed demonstrate
on-device aggregation of quantized updates, yet challenges remain in managing communication
overhead, heterogeneous device capabilities, and adversarial resilience [69, 178]. Future work must
focus on sparsified updates, asynchronous or hierarchical FL protocols, and secure aggregation
mechanisms amenable to TinyML constraints.
Tiny Foundation Models refer to miniaturized versions of large pretrained models intended for
deployment on edge hardware. Promising techniques such as knowledge distillation, structured
pruning, and quantizationapplied to models like TinyViT have shown the potential to reduce
model size to MCU-suitable scales while preserving task performance [41, 63]. The next step is to
enable modular foundational architectures, where a general “backbone” pre-trained model supports
multiple lightweight task-specific heads, with workflows powered by on-device or Edge AutoML-
enabled fine-tuning. Edge AutoML seeks to automate the process of designing, compressing, and
deploying TinyDL models on resource-constrained devices. Techniques like hardware-aware NAS
frameworks, such as TinyNAS and Once-for-All Networks, have demonstrated effective ways to
balance accuracy with memory and latency constraints [61, 64]. However, integrating AutoML into
full deployment pipelines remains an open challenge. Future research should focus on combining
AutoML with model compression strategies like quantization and pruning and incorporating
hardware feedback to generate models that are not only accurate but also energy-efficient and
deployable in real-world TinyDL scenarios.
Domain-specific accelerators, including NPUs, ASICs, FPGAs, and specialized RISC-V engines,
offer substantial gains in inference speed, energy efficiency, and model scalability for TinyDL.
Devices like the EdgeTPU and transformer-focused RISC-V extensions efficiently deliver quantized
convolution and attention workloads, outperforming general-purpose MCUs [64, 65]. The challenge
now is to develop advanced compilation toolchains that partition and schedule TinyDL models
across heterogeneous hardware, integrate with platforms like TFLite Micro and CMSIS-NN, and
maximize runtime configuration flexibility without sacrificing portability or ease of development
[77, 78].
12 Conclusions
This survey presents a comprehensive examination of the evolution from TinyML to TinyDL,
highlighting how the convergence of efficient model architectures, software toolchains, and hard-
ware platforms has enabled sophisticated on-device intelligence in severely resource-constrained
environments. We begin by delineating the scope and distinction between TinyML and TinyDL,
emphasizing the growing need to embed deep learning capabilities, once reserved for data centers,
into low-power MCUs and edge devices. We have outlined the hardware advancements, including
the emergence of neural accelerators and specialized ASICs, that now support the deployment of
deep networks with kilobyte-scale memory footprints and milliwatt power budgets. Simultane-
ously, we explored the critical role of model optimization techniques such as quantization, pruning,
and joint compression strategies, as well as the contributions of NAS in tailoring architectures
to edge constraints. On the software side, we cataloged an extensive range of deployment frame-
works, compiler toolchains, and AutoML platforms that streamline the end-to-end TinyDL lifecycle.
Through domain-specific applications in vision, audio, healthcare, and industrial monitoring, we
demonstrated the transformative potential of TinyDL across sectors demanding low latency, energy
efficiency, and data privacy.
Looking ahead, TinyDL is poised to catalyze a new generation of edge-native intelligence. This
includes the development of neuromorphic architectures using spiking neural networks, federated
learning for decentralized personalization, and ultra-lightweight foundation models capable of
generalization across tasks and modalities. The co-design of hardware and software will become
increasingly central, as will the creation of standardized, energy-aware benchmarks to evaluate
system performance holistically. By bridging the conceptual, architectural, and practical aspects of
TinyDL, this survey aims to serve as a foundational resource for both researchers and practitioners.
It underscores the critical shift from cloud dependence to autonomous, efficient edge intelligence,
laying the groundwork for continued innovation in AI at the very edge of computing.
References
[1] Syed Ali Raza Zaidi, Ali M Hayajneh, Maryam Hafeez, and Qasim Zeeshan Ahmed. Unlocking edge intelligence
through tiny machine learning (tinyml). IEEE Access, 10:100867–100877, 2022.
[2] Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, and Song Han. Tiny machine learning: Progress and futures
[feature]. IEEE Circuits and Systems Magazine, 23(3):8–34, 2023.
[3] Hui Han and Julien Siebert. Tinyml: A systematic review and synthesis of existing research. In 2022 International
Conference on Artificial Intelligence in Information and Communication (ICAIIC), pages 269–274. IEEE, 2022.
[4] Bita Darvish Rouhani, Azalia Mirhoseini, and Farinaz Koushanfar. Tinydl: Just-in-time deep learning solution for
constrained embedded systems. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–4.
IEEE, 2017.
[5] Minh Tri Lê, Pierre Wolinski, and Julyan Arbel. Efficient neural networks for tiny machine learning: A comprehensive
review. arXiv preprint arXiv:2311.11883, 2023.
[6] Jihong Park, Sumudu Samarakoon, Mehdi Bennis, and Mérouane Debbah. Wireless network intelligence at the edge.
Proceedings of the IEEE, 107(11):2204–2239, 2019.
[7] Stanislava Soro. Tinyml for ubiquitous edge ai. arXiv preprint arXiv:2102.01255, 2021.
[8] Youssef Abadade, Anas Temouden, Hatim Bamoumen, Nabil Benamar, Yousra Chtouki, and Abdelhakim Senhaji
Hafid. A comprehensive survey on tinyml. IEEE Access, 11:96892–96922, 2023.
[9] Partha Pratim Ray. A review on tinyml: State-of-the-art and prospects. Journal of King Saud University-Computer and
Information Sciences, 34(4):1595–1623, 2022.
[10] Danilo Pau and Prem Kumar Ambrose. Automated neural and on-device learning for micro controllers. In 2022 IEEE
21st Mediterranean Electrotechnical Conference (MELECON), pages 758–763. IEEE, 2022.
[11] Giovanni Delnevo, Silvia Mirri, Catia Prandi, and Pietro Manzoni. An evaluation methodology to determine the
actual limitations of a tinyml-based solution. Internet of Things, 22:100729, 2023.
[12] Luigi Capogrosso, Federico Cunico, Dong Seon Cheng, Franco Fummi, and Marco Cristani. A machine learning-
oriented survey on tiny machine learning. IEEE Access, 12:23406–23426, 2024.
[13] Lina Bariah, Qiyang Zhao, Hang Zou, Yu Tian, Faouzi Bader, and Merouane Debbah. Large generative ai models for
telecom: The next big thing? IEEE Communications Magazine, 62(11):84–90, 2024.
[14] Imopishak Thingom and N Basanta Singh. A review on machine learning in iot devices. International Journal of
Digital Technologies, 2(1), 2023.
[15] Visal Rajapakse, Ishan Karunanayake, and Nadeem Ahmed. Intelligence at the extreme edge: A survey on reformable
tinyml. ACM Computing Surveys, 55(13s):1–30, 2023.
[16] Nasser Alajlan and Dalia M. Ibrahim. Tinyml: Enabling of inference deep learning models on ultra-low-power iot
edge devices for ai applications. Micromachines, 13(6):851, 2022.
[17] Abdussalam Elhanashi, Pierpaolo Dini, Sergio Saponara, and Qinghe Zheng. Advancements in tinyml: Applications,
limitations, and impact on iot devices. Electronics, 13(17):3562, 2024.
[18] Georgios Kornaros. Hardware-assisted machine learning in resource-constrained iot environments for security:
review and future prospective. IEEE Access, 10:58603–58622, 2022.
[19] Ismail Lamaakal, Ibrahim Ouahbi, Khalid El Makkaoui, Yassine Maleh, Paweł Pławiak, and Fahad Alblehai. A tinydl
model for gesture-based air handwriting arabic numbers and simple arabic letters recognition. IEEE Access, 2024.
[20] Zeinab E Ahmed, Aisha A Hashim, Rashid A Saeed, and Mamoon M Saeed. Tinyml network applications for smart
cities. In TinyML for Edge Intelligence in IoT and LPWAN Networks, pages 423–451. Elsevier, 2024.
[21] Norah N Alajlan and Dina M Ibrahim. Original research article tinyml: Adopting tiny machine learning in smart
cities. Journal of Autonomous Intelligence, 7(4), 2024.
[22] Ivan Khokhlov, Egor Davydenko, Ilya Osokin, Ilya Ryakin, Azer Babaev, Vladimir Litvinenko, and Roman Gorbachev.
Tiny-yolo object detection supplemented with geometrical data. In 2020 IEEE 91st Vehicular Technology Conference
(VTC2020-Spring), pages 1–5. IEEE, 2020.
[23] Nithesh Singh Sanjay and Ali Ahmadinia. Mobilenet-tiny: A deep neural network-based real-time object detection for
rasberry pi. In 2019 18th IEEE international conference on machine learning and applications (ICMLA), pages 647–652.
IEEE, 2019.
[24] Brett Koonce. Squeezenet. In Convolutional neural networks with swift for tensorflow: image recognition and dataset
categorization, pages 73–85. Springer, 2021.
[25] Riku Immonen and Timo Hämäläinen. Tiny machine learning for resource-constrained microcontrollers. Journal of
Sensors, 2022(1):7437023, 2022.
[26] Danilo Pau, Abderrahim Khiari, and Davide Denaro. Online learning on tiny micro-controllers for anomaly detection
in water distribution systems. In 2021 IEEE 11th International Conference on Consumer Electronics (ICCE-Berlin), pages
1–6. IEEE, 2021.
[27] Rakhee Kallimani, Krishna Pai, Prasoon Raghuwanshi, Sridhar Iyer, and Onel LA López. Tinyml: Tools, applications,
challenges, and future research directions. Multimedia Tools and Applications, 83(10):29015–29045, 2024.
[28] Swapnil Sayan Saha, Sandeep Singh Sandha, and Mani Srivastava. Machine learning for microcontroller-class
hardware: A review. IEEE Sensors Journal, 22(22):21362–21390, 2022.
[29] Michael Fauscette. TinyML: Portable, Low Cost, Low Power Machine Learning, 2025.
[30] Sucheta Mandal. TinyML: Running Deep Learning Models on Microcontrollers, April 2025.
[31] Syntiant. Syntiant NDP120 Achieves Outstanding Results in Latest MLPerf Tiny v0.7 Benchmark Suite, 2025.
[32] Coral. M.2 Accelerator with Dual Edge TPU, 2025.
[33] Syntiant. Syntiant Core 2 Achieves Lowest Power Results in MLPerf Tiny v1.2 Benchmark Suite, 2025.
[34] Himax Technologies Inc. Himax Launches WiseEye WE-I Plus HX6537-A to Support AI Deep Learning with Google’s
TensorFlow Lite for Microcontrollers, June 2020.
[35] Salvatore Salamone. Real-time Analytics News for Week Ending June 19, June 2021.
[36] MLCommons. Benchmark MLPerf Inference: Tiny | MLCommons V1.1 Results, 2025.
[37] MLMark. Introducing the EEMBC MLMark Benchmark, 2025.
[38] Colby R Banbury, Vijay Janapa Reddi, Max Lam, William Fu, Amin Fazel, Jeremy Holleman, Xinyuan Huang, Robert
Hurtado, David Kanter, Anton Lokhmotov, et al. Benchmarking tinyml systems: Challenges and direction. arXiv
preprint arXiv:2003.04821, 2020.
[39] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning
requires rethinking generalization, 2017.
[40] Aakanksha Chowdhery, Pete Warden, Jonathon Shlens, Andrew Howard, and Rocky Rhodes. Visual wake words
dataset. arXiv preprint arXiv:1906.05721, 2019.
[41] Ji Lin, Wei-Ming Chen, Yujun Lin, Chuang Gan, Song Han, et al. Mcunet: Tiny deep learning on iot devices. Advances
in neural information processing systems, 33:11711–11722, 2020.
[42] Sangwon Lee, Jonghoon Choi, Sehoon Park, and Sungroh Yoon. Designing extremely memory-efficient cnns for
on-device vision tasks. IEEE Access, 8:49401–49413, 2020.
[43] Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David
Kanter, Sebastian Ahmed, Danilo Pau, et al. Mlperf tiny benchmark. arXiv preprint arXiv:2106.07597, 2021.
[44] Gaurav Menghani. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM
Computing Surveys, 55(12):1–37, 2023.
[45] Gabriel Signoretti, Marianne Silva, Pedro Andrade, Ivanovitch Silva, Emiliano Sisinni, and Paolo Ferrari. An evolving
tinyml compression algorithm for iot environments based on data eccentricity. Sensors, 21(12):4153, 2021.
[46] Urmish Thakker, Paul N Whatmough, Zhi-Gang Liu, Matthew Mattina, and Jesse Beu. Compressing language models
using doped kronecker products. arXiv preprint arXiv:2001.08896, 2020.
[47] Pan Hu, Junha Im, Zain Asgar, and Sachin Katti. Starfish: Resilient image compression for aiot cameras. In Proceedings
of the 18th Conference on Embedded Networked Sensor Systems, pages 395–408, 2020.
[48] Sek M Chai. Quantization-guided training for compact tinyml models. In Research Symposium on Tiny Machine
Learning, 2021.
[49] Manuele Rusci, Marco Fariselli, Alessandro Capotondi, and Luca Benini. Leveraging automated mixed-low-precision
quantization for tiny edge microcontrollers. In IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge,
and Mobile for Embedded Machine Learning: Second International Workshop, IoT Streams 2020, and First International
Workshop, ITEM 2020, Co-located with ECML/PKDD 2020, Ghent, Belgium, September 14-18, 2020, Revised Selected Papers
2, pages 296–308. Springer, 2020.
[50] Hamed Fatemi, Vedant Karia, Tej Pandit, and Dhireesha Kudithipudi. TENT: Efficient quantization of neural networks
on the tiny edge with tapered fixed point. In Proceedings of the Research Symposium on Tiny Machine Learning, 2020.
[51] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. Squeezenet:
Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size, 2016.
[52] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto,
and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.
[53] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
residuals and linear bottlenecks, 2019.
[54] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter, 2020.
[55] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling
bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
[56] Haokui Zhang, Wenze Hu, and Xiaoyu Wang. Parc-net: Position aware circular convolution with merits from convnets
and transformer, 2022.
[57] Yvan Tortorella, Luca Bertaccini, Luca Benini, Davide Rossi, and Francesco Conti. Redmule: A mixed-precision matrix-
matrix operation engine for flexible and energy-efficient on-chip linear algebra and tinyml training acceleration,
2023.
[58] Hasib-Al Rashid, Argho Sarkar, Aryya Gangopadhyay, Maryam Rahnemoonfar, and Tinoosh Mohsenin. Tinyvqa:
Compact multimodal deep neural network for visual question answering on resource-constrained devices, 2024.
[59] Soroush Heydari and Qusay H Mahmoud. Tiny machine learning and on-device inference: A survey of applications,
challenges, and future directions. Sensors, 25(10):3191, 2025.
[60] Praneel Chand and Mansour Assaf. An empirical study on lightweight cnn models for efficient classification of used
electronic parts. Sustainability, 16(17):7607, 2024.
[61] Mohammad Javad Shafiee, Francis Li, Brendan Chwyl, and Alexander Wong. Squishednets: Squishing squeezenet
further for edge device scenarios via deep evolutionary synthesis. arXiv preprint arXiv:1711.07459, 2017.
[62] Yuanyuan Xu, Genke Yang, Jiliang Luo, and Jianan He. An electronic component recognition algorithm based on
deep learning with a faster squeezenet. Mathematical Problems in Engineering, 2020(1):2940286, 2020.
[63] Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi,
Matthew Mattina, and Paul Whatmough. Micronets: Neural network architectures for deploying tinyml applications
on commodity microcontrollers. Proceedings of machine learning and systems, 3:517–532, 2021.
[64] Victor JB Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, and Luca Benini. Optimizing the deployment of
tiny transformers on low-power mcus. IEEE Transactions on Computers, 2024.
[65] Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. Pruning and quantization for deep neural
network acceleration: A survey. Neurocomputing, 461:370–403, 2021.
[66] Benjamin Hawks, Javier Duarte, Nicholas J Fraser, Alessandro Pappalardo, Nhan Tran, and Yaman Umuroglu. Ps and
qs: Quantization-aware pruning for efficient low latency neural network inference. Frontiers in Artificial Intelligence,
4:676564, 2021.
[67] Andrey Kuzmin, Markus Nagel, Mart Van Baalen, Arash Behboodi, and Tijmen Blankevoort. Pruning vs quantization:
Which is better? Advances in neural information processing systems, 36:62414–62427, 2023.
[68] KA Kumari, S Ahamad, T Patil, K Sardana, E Muniyandy, and D Pilli. Neural network pruning techniques for efficient
model compression. International Journal of Intelligent Systems and Applications in Engineering, 12(15s):565–575, 2024.
[69] Xinyu Zhang, Ian Colbert, Ken Kreutz-Delgado, and Srinjoy Das. Training deep neural networks with joint quantization
and pruning of features and weights.
[70] Han Cai, Chuang Gan, Ji Lin, and Song Han. Network augmentation for tiny deep learning. arXiv preprint
arXiv:2110.08890, 2021.
[71] N. Tan. AI on Microcontrollers: uTensor brings Deep-Learning to MCUs. FOSDEM 2018, 2018. Presentation.
[72] Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, Ian Nappier, Meghna
Natraj, Tiezhen Wang, et al. Tensorflow lite micro: Embedded machine learning for tinyml systems. Proceedings of
Machine Learning and Systems, 3:800–811, 2021.
[73] Shawn Hymel, Jan Trivedi, Louis Heller, Sandeep Sharma, Paul Fiedler, Ajay Patel, Anil Chandak, Abhishek Sinha,
and Thomas Schmid. Edge Impulse: An MLOps Platform for Tiny Machine Learning. arXiv preprint, arXiv:2212.03332,
2022. Available at [Link]
[74] Google AI Edge. TensorFlow Lite Model Maker. [Link] 2025. Accessed:
June 5, 2025.
[75] Towards Data Science. Deep Learning on your phone: PyTorch Lite Interpreter for mobile plat-
forms. [Link]
ae73d0b17eaa/3, 2025. Published on January 18, 2025. Accessed: June 5, 2025.
[76] Liangzhen Lai, Naveen Suda, and Vikas Chandra. CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M
CPUs. arXiv preprint, arXiv:1801.06601, 2018. Available at [Link]
[77] C. Liu, M. Jobst, L. Guo, X. Shi, J. Partzsch, and C. Mayr. Deploying machine learning models to ahead-of-time runtime
on edge using microtvm. arXiv preprint, arXiv:2304.04842, 2023. Available at [Link]
[78] N. Rotem, J. Fix, S. Abdulrasool, G. Catron, S. Deng, R. Dzhabarov, N. Gibson, J. Hegeman, M. Lele, R. Levenstein,
J. Montgomery, B. Maher, S. Nadathur, J. Olesen, J. Park, A. Rakhov, M. Smelyanskiy, and M. Wang. Glow: Graph
lowering compiler techniques for neural networks. arXiv preprint, arXiv:1805.00907, 2018. Available at https:
//[Link]/abs/1805.00907.
[79] Keras2c: A library for converting Keras neural networks to real-time compatible C. Engineering Applications of
Artificial Intelligence, 100:104188, 2021.
[80] R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March, P. Ram, N. A. Mehta, and A. G. Gray. Mlpack: A scalable c++
machine learning library. Journal of Machine Learning Research, 14(1):801–805, 2013.
[81] STMicroelectronics. X-CUBE-AI: STM32Cube Expansion Package. [Link]
[Link], 2025. Accessed: June 5, 2025.
[82] J. Duarte, E. Kreinar, J. Ngadiuba, et al. hls4ml: An open-source codesign workflow to empower scientific low-
power machine learning devices. In TinyML Research Symposium 2021, San Jose, CA, 2021. arXiv:2103.05579,
[Link]
[83] ARM Ltd. OctoML: Accelerating ML model deployment. [Link] 2025. ARM
Partner Catalog, Accessed: June 5, 2025.
[84] Nebuly Team. nebullvm: AI runtime optimization library. [Link] 2024. Accessed: June 5,
2025.
[85] Mauro Conti, Roberto Di Pietro, Luigi V. Mancini, and Alessandro Mei. (old) distributed data source verification in
wireless sensor networks. Inf. Fusion, 10(4):342–353, 2009.
[86] X. He. Accelerated linear algebra compiler for computationally efficient numerical models. PLOS ONE, 18(2):e0282265,
2023.
[87] J. Bai, F. Lu, K. Zhang, et al. Onnx: Open neural network exchange. GitHub repository, 2019.
[88] N. Vasilache et al. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions.
arXiv preprint, arXiv:1802.04730, 2018. Available at [Link]
[89] Qeexo. Qeexo automl user guides. [Link] 2025. Accessed: 2025-06-11.
[90] Neuton AI. Neuton ai. [Link] 2025. Accessed: 2025-06-11.
[91] Latent AI. Latent ai. [Link] 2025. Accessed: 2025-06-11.
[92] SensiML Corp. Sensiml toolkit technical overview. Technical report, SensiML Corp., April 2021. Rev. 1.1.
[93] Arduino. Get started with machine learning on arduino nano 33 ble sense. [Link]
ble-sense/get-started-with-machine-learning/, 2025. Accessed: 2025-06-11.
[94] NXP Semiconductors. eIQ Toolkit User Guide, 2024. Version 1.8.0.
[95] Microsoft. Edgeml: Machine learning for resource-constrained edge devices. [Link]
2025. Accessed: 2025-06-11.
[96] EdjeElectronics. Train a tensorflow lite 2 object detection model. [Link]
EdjeElectronics/TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi/blob/master/Train_TFLite2_
Object_Detction_Model.ipynb, 2025. Accessed: 2025-06-11.
[97] EdjeElectronics. Tensorflow lite object detection on android and raspberry pi. [Link]
TensorFlow-Lite-Object-Detection-on-Android-and-Raspberry-Pi, 2025. Accessed: 2025-06-11.
[98] Sony AI. Sony ai. [Link] 2025. Accessed: 2025-06-11.
[99] Sony. Model optimization toolkit. [Link] 2025. Accessed: 2025-06-11.
[100] KaaIoT Technologies, LLC. Kaaiot and supermicro collaborate to provide ai-powered iot solutions for the edge. https:
//[Link]/blog/kaaiot-and-supermicro-collaborate-to-provide-ai-powered-iot-solutions-for-the-edge, 2025.
Accessed: 2025-06-11.
[101] Chirag Gupta, Arun Sai Suggala, Ankit Goyal, Harsha Vardhan Simhadri, Bhargavi Paranjape, Ashish Kumar, Saurabh
Goyal, Raghavendra Udupa, Manik Varma, and Prateek Jain. Protonn: compressed and accurate knn for resource-
scarce devices. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page
1331–1340. [Link], 2017.
[102] Yuxuan Liang, Yulin Han, and Fangming Jiang. Deep learning-based small object detection: A survey. In Proceedings
of the 2022 8th International Conference on Computing and Artificial Intelligence (ICCAI ’22), pages 432–438. ACM,
2022.
[103] Qianyun Lu and Boris Murmann. Improving the energy efficiency and robustness of tinyml computer vision using
log-gradient input images. In Proceedings of the tinyML Research Symposium (tinyML Research Symposium ’22). ACM,
March 2022. Also available as arXiv:2203.02571.
[104] S. B. Lakshman and N. U. Eisty. Software engineering approaches for tinyml based iot embedded vision: A systematic
literature review. arXiv preprint arXiv:2204.08702, 2022.
[105] Qian Feng, Xinxin Xu, and Zhenxing Wang. Deep learning-based small object detection: A survey. Mathematical
Biosciences and Engineering, 20(4):6551–6590, 2023.
[106] Adriel Monti De Nardi and Maxwell Eduardo Monteiro. Evaluation of the energy viability of smart iot sensors using
tinyml for computer vision applications: A case study. International Robotics & Automation Journal, 9(2):78–85, 2023.
[107] Colby Banbury, Emil Njor, Andrea Mattia Garavagno, Mark Mazumder, Matthew Stewart, Pete Warden, Manjunath
Kudlur, Nat Jeffries, and Vijay Janapa Reddi. Wake vision: A tailored dataset and benchmark suite for tinyml computer
vision applications. arXiv preprint arXiv:2405.00892v5, Jun 2025. ver. 5.
[108] S. You, Zhiyu Chen, Shangdong Li, Mengxue Wang, Tengfeng Feng, and Yimu Jiang. Yolite+: a lightweight multi-object
detection approach in traffic scenarios. Procedia Computer Science, 199:346–353, 2022.
[109] Andrew Barovic and Armin Moin. Tinyml for speech recognition. arXiv preprint arXiv:2504.16213, 2025.
[110] Ahmed Y. Radwan, Mohammad Shehab, and Mohamed-Slim Alouini. Tinyml nlp scheme for semantic wireless
sentiment classification with privacy preservation. arXiv preprint arXiv:2411.06291v3, April 2025. Accepted at EuCNC
& 6G Summit 2025.
[111] Ismail Lamaakal, Yassine Maleh, Khalid El Makkaoui, Ibrahim Ouahbi, Mohamed Essahraui, Mohamed F. Bouami,
Ahmed A. Abd El-Latif, May Almousa, Jun Peng, and Dusit Niyato. A comprehensive survey on tiny machine learning
for human behavior analysis (hba). IEEE Internet of Things Journal, 2025. In press.
[112] M. Pujari, A. Goel, and A. K. Pakina. Efficient tinyml architectures for on-device small language models: Privacy-
preserving inference at the edge. International Journal Science and Technology, 3(3):67–75, 2024.
[113] Mohammad Wali Ur Rahman, Murad Mehrab Abrar, Hunter Gibbons Copening, Salim Hariri, Sicong Shao, Pratik
Satam, and Soheil Salehi. Quantized transformer language model implementations on edge devices. In Proceedings of
the 2023 IEEE 22nd International Conference on Machine Learning and Applications (ICMLA), pages 104–111, 2023.
[114] Zhaolan Huang, Adrien Tousnakhoff, Polina Kozyr, Roman Rehausen, Felix Bießmann, Robert Lachlan, Cedric Adjih,
and Emmanuel Baccelli. Tinychirp: Bird song recognition using tinyml models on low-power wireless acoustic
sensors. arXiv preprint arXiv:2407.21453v2, September 2024. Accepted at IEEE IS2 2024.
[115] Vasileios Tsoukas, Eleni Boumpa, Georgios Giannakas, and Athanasios Kakarountas. A review of machine learning
and tinyml in healthcare. In Proceedings of the 25th Pan-Hellenic Conference on Informatics, pages 69–73, 2021.
[116] Norhen Abdennadher, Danilo Pau, and Arcangelo Bruna. Fixed complexity tiny reservoir heterogeneous network
for on-line ecg learning of anomalies. In 2021 IEEE 10th Global Conference on Consumer Electronics (GCCE), pages
233–237. IEEE, 2021.
[117] Anandi Dutta. A smart design framework for a novel reconfigurable multi-processor systems-on-chip (ASREM) architecture.
Ph.d. dissertation, University of Louisiana at Lafayette, Lafayette, LA, USA, 2016.
[118] Musa Dima Genemo. Federated learning for bronchus cancer detection using tiny machine learning edge devices.
Indonesian Journal of Data and Science, 5(1):64–69, 2024.
[119] Martin Ragot, Nicolas Martin, Sonia Em, Nico Pallamin, and Jean-Marc Diverrez. Emotion recognition using
physiological signals: laboratory vs. wearable sensors. In Advances in Human Factors in Wearable Technologies and
Game Design: Proceedings of the AHFE 2017 International Conference on Advances in Human Factors and Wearable
Technologies, July 17-21, 2017, The Westin Bonaventure Hotel, Los Angeles, California, USA 8, pages 15–22. Springer,
2018.
[120] Juan Antonio Domínguez-Jiménez, Kiara Coralia Campo-Landines, Juan C Martínez-Santos, Enrique J Delahoz, and
Sonia H Contreras-Ortiz. A machine learning model for emotion recognition from physiological signals. Biomedical
signal processing and control, 55:101646, 2020.
[121] Rita Laureanti, Marco Bilucaglia, Margherita Zito, Riccardo Circi, Alessandro Fici, Fiamma Rivetti, Riccardo Valesi,
Carlo Oldrini, Luca T Mainardi, and Vincenzo Russo. Emotion assessment using machine learning and low-cost
wearable devices. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society
(EMBC), pages 576–579. IEEE, 2020.
[122] Satyapreet Singh Yadav, Radha Agarwal, Kola Bharath, Sandeep Rao, and Chetan Singh Thakur. Tinyradar: Mmwave
radar based human activity classification for edge computing. In 2022 IEEE International Symposium on Circuits and
Systems (ISCAS), pages 2414–2417. IEEE, 2022.
[123] Bidyut Saha, Riya Samanta, Soumya Kanti Ghosh, and Ram Babu Roy. From wrist to world: Harnessing wearable imu
sensors and tinyml to enable smart environment interactions. In Proceedings of the Third International Conference on
AI-ML Systems, pages 1–3, 2023.
[124] Anita Christaline Johnvictor, M Poonkodi, N Prem Sankar, and Thinesh Vs. Tinyml-based lightweight ai healthcare
mobile chatbot deployment. Journal of Multidisciplinary Healthcare, pages 5091–5104, 2024.
[125] Mamta Bhamare, Pradnya V Kulkarni, Rashmi Rane, Sarika Bobde, and Ruhi Patankar. Tinyml applications and use
cases for healthcare. In TinyML for Edge Intelligence in IoT and LPWAN Networks, pages 331–353. Elsevier, 2024.
[126] Samson O Ooko and Simon M Karume. Application of tiny machine learning in predicative maintenance in industries.
Journal of Computing Theories and Applications, 2(1):131–150, 2024.
[127] Lachit Dutta and Swapna Bharali. Tinyml meets iot: A comprehensive survey. Internet of Things, 16:100461, 2021.
[128] J Manokaran and G Vairavel. Smart anomaly detection using data-driven techniques in iot edge: a survey. In
Proceedings of Third International Conference on Communication, Computing and Electronics Systems: ICCCES 2021,
pages 685–702. Springer, 2022.
[129] Vítor M Oliveira and António HJ Moreira. Edge ai system using a thermal camera for industrial anomaly detection.
In International Summit Smart City 360°, pages 172–187. Springer, 2021.
[130] Matteo Cardoni, Danilo Pietro Pau, Laura Falaschetti, Claudio Turchetti, and Marco Lattuada. Online learning of oil
leak anomalies in wind turbines with block-based binary reservoir. Electronics, 10(22):2836, 2021.
[131] Apostolos Xenakis, Anthony Karageorgos, Efthimios Lallas, Adriana E Chis, and Horacio González-Vélez. Towards
distributed iot/cloud based fault detection and maintenance in industrial automation. Procedia Computer Science,
151:683–690, 2019.
[132] Yap Yan Siang, Mohd Ridzuan Ahamd, and Mastura Shafinaz Zainal Abidin. Anomaly detection based on tiny machine
learning: A review. Open International Journal of Informatics, 9(Special Issue 2):67–78, 2021.
[133] Mattia Antonini, Miguel Pincheira, Massimo Vecchio, and Fabio Antonelli. A tinyml approach to non-repudiable
anomaly detection in extreme industrial environments. In 2022 IEEE International Workshop on Metrology for Industry
4.0 & IoT (MetroInd4. 0&IoT), pages 397–402. IEEE, 2022.
[134] Muhammad Abubakar, Adbul Sattar, Hamid Manzoor, Khola Farooq, and Muhammad Yousif. Iiot: An infusion of
embedded systems, tinyml, and federated learning in industrial iot. Journal of Computing & Biomedical Informatics,
8(02), 2025.
[135] Martina Casiroli and Danilo Pietro Pau. Tiny machine learning business intelligence in the semiconductor industry:
A case study. In 2023 IEEE Global Conference on Artificial Intelligence and Internet of Things (GCAIoT), pages 9–16.
IEEE, 2023.
[136] Ramon Sanchez-Iborra and Antonio F Skarmeta. Tinyml-enabled frugal smart objects: Challenges and opportunities.
IEEE Circuits and Systems Magazine, 20(3):4–18, 2020.
[137] Hatim Bamoumen, Anas Temouden, Nabil Benamar, and Yousra Chtouki. How tinyml can be leveraged to solve
environmental problems: A survey. In 2022 International Conference on Innovation and Intelligence for Informatics,
Computing, and Technologies (3ICT), pages 338–343. IEEE, 2022.
[138] United Nations. World Population Projected to Reach 9.8 Billion in 2050, and 11.2 Billion in 2100. Online, 2017.
Accessed: YYYY-MM-DD.
[139] Alakananda Mitra, Sukrutha LT Vangipuram, Anand K Bapatla, Venkata KVV Bathalapalli, Saraju P Mohanty,
Elias Kougianos, and Chittaranjan Ray. Everything you wanted to know about smart agriculture. arXiv preprint
arXiv:2201.04754, 2022.
[140] Sarah Condran, Michael Bewong, Md Zahidul Islam, Lancelot Maphosa, and Lihong Zheng. Machine learning in
precision agriculture: a survey on trends, applications and evaluations over two decades. IEEE Access, 10:73786–73803,
2022.
[141] Yogeswaranathan Kalyani and Rem Collier. A systematic survey on the role of cloud, fog, and edge computing
combination in smart agriculture. Sensors, 21(17):5922, 2021.
[142] Vu Khanh Quy, Nguyen Van Hau, Dang Van Anh, Nguyen Minh Quy, Nguyen Tien Ban, Stefania Lanza, Giovanni
Randazzo, and Anselme Muzirafuti. Iot-enabled smart agriculture: architecture, applications, and challenges. Applied
Sciences, 12(7):3396, 2022.
[143] Nikesh Gondchawar, RS Kawitkar, et al. Iot based smart agriculture. International Journal of advanced research in
Computer and Communication Engineering, 5(6):838–842, 2016.
[144] G Sushanth and S Sujatha. Iot based smart agriculture system. In 2018 international conference on wireless communi-
cations, signal processing and networking (WiSPNET), pages 1–4. IEEE, 2018.
[145] Devis Tuia, Benjamin Kellenberger, Sara Beery, Blair R Costelloe, Silvia Zuffi, Benjamin Risse, Alexander Mathis,
Mackenzie W Mathis, Frank Van Langevelde, Tilo Burghardt, et al. Perspectives in machine learning for wildlife
conservation. Nature communications, 13(1):792, 2022.
[146] David J Curnick, Alasdair J Davies, Clare Duncan, Robin Freeman, David MP Jacoby, Hugo TE Shelley, Cristian Rossi,
Oliver R Wearn, Michael J Williamson, and Nathalie Pettorelli. Smallsats: a new technological frontier in ecology and
conservation? Remote Sensing in Ecology and Conservation, 8(2):139–150, 2022.
[147] Kong Ka Hing, Mehran Behjati, Vala Saleh, Yap Kian Meng, Anwar PP Abdul Majeed, and Yufan Zheng. Edge
intelligence for wildlife conservation: Real-time hornbill call classification using tinyml. In International Conference
on Intelligent Manufacturing and Robotics, pages 476–488. Springer, 2024.
[148] Konkala Venkateswarlu Reddy, BS Karthikeya Reddy, Veerapu Goutham, Miriyala Mahesh, JS Nisha, Gopinath
Palanisamy, Mallikarjuna Golla, Swetha Purushothaman, Katangure Rithisha Reddy, and Varsha Ramkumar. Edge ai
in sustainable farming: Deep learning-driven iot framework to safeguard crops from wildlife threats. IEEE Access,
12:77707–77723, 2024.
[149] Ariel M Lorenzo, Rodrigo Barien, Neil Darwin Favila, Dennis Basa, Jay M Ventura, and Sherwin Catolos. Trees have
ears: An acoustic surveillance and tinyml-based for detecting illegal logging. In 2024 International Conference of
Adisutjipto on Aerospace Electrical Engineering and Informatics (ICAAEEI), pages 1–6. IEEE, 2024.
[150] Mangesh Pujari, Anil Kumar Pakina, and Ashwin Sharma. Enhancing cybersecurity in edge ai systems: A game-
theoretic approach to threat detection and mitigation. IOSR Journal of Computer Engineering, 25(3):65–73, 2023.
[151] Haoyu Ren, Darko Anicic, and Thomas A Runkler. Tinyol: Tinyml with online-learning on microcontrollers. In 2021
international joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2021.
[152] Mark Mazumder, Colby Banbury, Josh Meyer, Pete Warden, and Vijay Janapa Reddi. Few-shot keyword spotting in
any language. arXiv preprint arXiv:2104.01454, 2021.
[153] Kavya Kopparapu and Eric Lin. Tinyfedtl: Federated transfer learning on tiny devices. arXiv preprint arXiv:2110.01107,
2021.
[154] Marc Monfort Grau, Roger Pueyo Centelles, and Felix Freitag. On-device training of machine learning models on
microcontrollers with a look at federated learning. In Proceedings of the Conference on Information Technology for
Social Good, pages 198–203, 2021.
[155] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning,
trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
[156] Pete Warden and Daniel Situnayake. Tinyml: Machine learning with tensorflow lite on arduino and ultra-low-power
microcontrollers. O’Reilly Media, 2019.
[157] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan
Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In
13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
[158] Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-efficient backpropagation
through time. Advances in neural information processing systems, 29, 2016.
[159] Minsik Cho and Daniel Brand. Mec: Memory-efficient convolution for deep neural network. In International Conference
on Machine Learning, pages 815–824. PMLR, 2017.
[160] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and
Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
[161] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and
Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint
arXiv:1805.06085, 2018.
[162] Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for
rapid-deployment. Advances in Neural Information Processing Systems, 32, 2019.
[163] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed
precision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8612–8620, 2019.
[164] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth
convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
[165] Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned
step size quantization. arXiv preprint arXiv:1902.08153, 2019.
[166] Tensorflow model optimization toolkit. [Link] Accessed: 2024-06-10.
[167] Boyu Chen, Peixia Li, Baopu Li, Chen Lin, Chuming Li, Ming Sun, Junjie Yan, and Wanli Ouyang. Bn-nas: Neural
architecture search with batch normalization. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 307–316, 2021.
[168] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize
it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
[169] Angelo Garofalo, Manuele Rusci, Francesco Conti, Davide Rossi, and Luca Benini. Pulp-nn: A computing library for
quantized neural network inference at the edge on risc-v based parallel ultra low power clusters. In 2019 26th IEEE
International Conference on Electronics, Circuits and Systems (ICECS), pages 33–36. IEEE, 2019.
[170] Ambiq micro apollo3 blue soc technical reference manual. [Link] Accessed: 2024-06-10.
[171] J Lin, WM Chen, H Cai, C Gan, and S Han. Mcunetv2: Memory-efficient patch-based inference for tiny deep learning.
arxiv. arXiv preprint arXiv:2110.15352, 2021.
[172] Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209,
2018.
[173] Yann Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
[174] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[175] Riya Adlakha and Eltahir Kabbar. The challenges of tinyml implementation: A literature review. 2024.
[176] Filip Svoboda, Javier Fernandez-Marques, Edgar Liberis, and Nicholas D Lane. Deep learning on microcontrollers: A
study on deployment costs and challenges. In Proceedings of the 2nd European Workshop on Machine Learning and
Systems, pages 54–63, 2022.
[177] Parin Shah, Yuvaraj Govindarajulu, Pavan Kulkarni, and Manojkumar Parmar. Enhancing tinyml security: Study of
adversarial attack transferability. arXiv preprint arXiv:2407.11599, 2024.
[178] Jacob Huckelberry, Yuke Zhang, Allison Sansone, James Mickens, Peter A Beerel, and Vijay Janapa Reddi. Tinyml
security: Exploring vulnerabilities in resource-constrained machine learning systems. arXiv preprint arXiv:2411.07114,
2024.
[179] Archit Parnami and Minwoo Lee. Learning from few examples: A summary of approaches to few-shot learning.
arXiv preprint arXiv:2203.04291, 2022.
[180] Yeonju Kim, Jeonghyeon Yoon, and Seungku Kim. A few-shot learning-based material recognition scheme using
smartphones. Applied Sciences, 15(1):430, 2025.
[181] Enrico Fini, Stéphane Lathuiliere, Enver Sangineto, Moin Nabi, and Elisa Ricci. Online continual learning under
extreme memory constraints. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part XXVIII 16, pages 720–735. Springer, 2020.
[182] Peter Chang. Benchmarking ai compiler for the tinyml market. [Link]
Peter-Chang_tinyML-[Link], 2023. Presented at tinyML Asia 2023, November 16, 2023.
[183] Sukhpal Singh Gill, Muhammed Golec, Jianmin Hu, Minxian Xu, Junhui Du, Huaming Wu, Guneet Kaur Walia,
Subramaniam Subramanian Murugesan, Babar Ali, Mohit Kumar, et al. Edge ai: A taxonomy, systematic review and
future directions. Cluster Computing, 28(1):1–53, 2025.
[184] Xubin Wang, Zhiqing Tang, Jianxiong Guo, Tianhui Meng, Chenhao Wang, Tian Wang, and Weijia Jia. Empowering
edge intelligence: A comprehensive survey on on-device ai models. ACM Computing Surveys, 57(9):1–39, 2025.
Hardware advancements such as the evolution from simple microcontroller units (MCUs) to enhanced microcontrollers with accelerators like NPUs and Edge TPUs have significantly bolstered TinyDL's capabilities. These advancements allow TinyDL to tackle more complex applications like speech recognition, NLP, and object detection, which require more compute power and memory than typical TinyML use cases . Additionally, these sophisticated hardware platforms enable the deployment of deeper neural networks and more accurate models optimized through techniques like QAT and NAS . Medium-scale deployment of TinyDL is feasible due to the integration of these hardware advancements, which balance power, performance, and accuracy .
TinyML employs data-efficient training techniques such as knowledge distillation, pruning, and quantization to ensure performance in low-power IoT devices. Knowledge distillation involves training a smaller ‘student’ model to mimic the output of a larger ‘teacher’ model, effectively compressing knowledge into a format that runs on constrained hardware such as MCUs . Structured pruning removes negligible parts of the model to reduce size, and quantization compresses weights and activations to lower bit-widths, decreasing computational load . These techniques enable powerful inference without the need for extensive on-device computation, maximizing efficiency and battery life on IoT devices .
TinyDL addresses the constraints of ultra-resource-constrained environments by prioritizing metrics of efficiency and scalability beyond traditional accuracy benchmarks. Techniques such as model compression (e.g., pruning, quantization), small neural networks, and the use of low-power hardware platforms like enhanced MCUs and specialized accelerators ensure models fit within limited memory, storage, and energy budgets without sacrificing essential functionalities . Additionally, frameworks like Edge AutoML automate the design and fine-tuning for these environments, incorporating hardware feedback to optimize deployment feasibility . This comprehensive approach ensures TinyDL models are more agile and suited for edge conditions where resources are severely limited.
Domain-specific accelerators like NPUs, ASICs, and FPGAs are designed to efficiently handle specific types of computations prevalent in deep learning, such as quantized convolution and attention workloads, outperforming general-purpose MCUs . They offer significant improvements in inference speed, energy efficiency, and model scalability, crucial for deployment in TinyDL scenarios . These accelerators provide customized data paths and optimized execution for operations like matrix multiplication which are computationally intensive in TinyDL models, enabling more complex applications like vision-based inference or speech recognition in constrained devices . This specialization results in faster data processing and lower power consumption, crucial for prolonged operation in embedded systems.
Efficiency and compression metrics are crucial for TinyML as they measure how well a model can perform under severe resource constraints, unlike standard performance metrics like accuracy or F1-score. These metrics include model size in KB, inference latency in ms, and memory usage in terms of both static and dynamic requirements . They determine if a model can run efficiently on hardware with limited RAM or processing power, such as ARM Cortex-M MCUs . Techniques like weight pruning and quantization are utilized to minimize model footprint, ensuring inference can occur in real-time even with constrained device specifications . This emphasis on operational feasibility marks a significant departure from conventional performance metrics.
TinyDL utilizes advanced optimization techniques such as Quantization Aware Training (QAT), Neural Architecture Search (NAS), Hardware-Aware Quantization (HAQ), and knowledge distillation to ensure that models retain high accuracy while operating under strict memory and energy constraints . In contrast, TinyML often employs foundational methods like Post-Training Quantization (PTQ) and pruning, alongside manual feature engineering . Therefore, while TinyML strategies focus more on operability within basic resource constraints, TinyDL explores sophisticated techniques to squeeze more capabilities out of limited resources without significant accuracy losses.
A modular foundational architecture in TinyDL offers the benefit of reusing a general pre-trained backbone for multiple task-specific heads, reducing redundancy and improving adaptability. Practically, it could be implemented using lightweight modular components fine-tuned with on-device or edge AutoML frameworks, allowing devices to swap task-specific components without replacing the entire model . This approach enhances flexibility, enabling seamless updates in response to changing user requirements without extensive retraining. Additionally, it could optimize resource utilization by deploying only the necessary modules, thus maintaining efficiency on constrained hardware . Such modularity could pave the way for innovative deployment strategies in edge AI, ensuring models remain adaptable and contextually relevant.
Current TinyDL frameworks face limitations in integrating fully automated deployment pipelines that accommodate real-time environmental feedback and varied hardware capabilities. Improvements could include enhancing compiler toolchains to better partition TinyDL models across heterogeneous hardware, maximizing runtime flexibility, and ensuring compatibility across platforms like TFLite Micro and CMSIS-NN . Additionally, expanding support for dynamic model adaptation and fine-tuning via on-device AutoML could improve model performance over time without significant manual intervention. These enhancements would support robust, adaptable, and efficient deployment of TinyDL models in real-world edge environments, overcoming existing framework limitations that hinder seamless edge integration .
Federated learning in TinyDL faces challenges of communication overhead, heterogeneous device capabilities, and adversarial resilience. Addressing these requires focusing on sparsified updates, asynchronous or hierarchical federated learning protocols, and secure aggregation mechanisms that suit TinyML's constraints . Solutions like on-device aggregation of quantized updates with frameworks like TinyFedTL and TinyMetaFed can help minimize communication load . Developing lightweight protocols that dynamically adapt to varying device capabilities and enhancing security measures for data exchange will be crucial for overcoming these challenges in constrained environments.
AutoML in TinyDL assists in automating the design, compression, and deployment of models, making them more feasible for operation in resource-constrained environments . It leverages frameworks like TinyNAS to perform hardware-aware NAS and Once-for-All Networks to balance accuracy, memory, and latency constraints . However, challenges remain in fully integrating AutoML into deployment pipelines, particularly in terms of combining it with model compression strategies, like quantization and pruning, while incorporating real-time hardware feedback . Addressing these challenges would streamline end-to-end model deployment, enhancing efficiency and performance predictability in TinyDL applications.